2.10 Missing Values (NA)

In applications, you may encounter the situation where some values are missing in the data set. In this scenario, R uses NA to represent those values, indicating they are not available. Let’s see the following example.

a <- 1:10
a[11]
#> [1] NA

Since you have defined a as a vector of length 10, there are 10 values in a. If you try to access the 11th element of a, it is not available, hence you will see NA as the result.

Sometimes, the values of some elements in a vector are missing, then you can use NA for these elements. Here is an example containing NAs in values. If you want to get values of the 2nd and 4th elements in b, of course you will get NA NA as the result.

b <- c(1, NA, 2, NA, 3)
b
#> [1]  1 NA  2 NA  3
b[c(2,4)]
#> [1] NA NA

Now you have had a basic understanding of NA. In the following parts of this section, we want to introduce several properties of NA. Let’s start with NA is contagious.

2.10.1 NA is contagious

NA implies that the underlying value is not available, in other words, there is uncertainty with the value. As a result, for most operations associated with NA, the results will also be NA, showing that NA is contagious.

y <- NA
y + 3
#> [1] NA
y == 3
#> [1] NA

As you can see here, y is NA, indicating the value of y is not available. When you try to do operations like y + 3 or y == 3, the answers are clearly not available as well, hence both taking the value NA.

How about we create another NA object and compare it with y?

z <- NA 
y == z 
#> [1] NA

It is again NA, which may be confusing at first. However, keep in mind that since both y and z are not available, there is no way to tell whether they are the same. Hence y == z is also NA.

  1. Think about what is the value of 1NA in R. Try to run it in R. Does it agree with your thoughts?

  2. Think about what is the value of 0NA in R. Try to run it in R. Does it agree with your thoughts?

  • Answer 1. For any \(x\), we have \(1^x = \exp(\log (1^x)) = \exp(x \log 1) = \exp(x \cdot 0) = 1\). Since there is no uncertainty regarding the expression, the value of 1NA is 1
  • Answer 2. We have \(0^0 = \lim_{x\to 0}x^x =\exp[\lim_{x\to 0} x\log(x)]=\exp[0] = 1\), and \(0^1 = 0\). Since NA represents uncertainty values, it can be 0 or 1 or other numbers. So 0NA is not deterministic because it can take different values according to the exponent. Hence, the value of 0NA is also NA.

Now, let’s talk about what impact the NA values make when we apply statistical functions. Let’s create a vector containing NA values, and apply some functions on it.

x <- c(1, NA, 3, 4, NA, 2)
x
#> [1]  1 NA  3  4 NA  2
sum(x) 
#> [1] NA
mean(x)
#> [1] NA
sd(x)
#> [1] NA

As you can see, for many statistical functions on vectors, as long as there exists at least one NA values in vectors, the results are often impossible to determine, hence resulting an NA value as well.

As a result, you may want to ignore the NA values during the function evaluation. Fortunately, most statistical functions on vectors provide an optional argument na.rm, which takes a logical value, indicating whether to remove NA before applying the functions. Let’s see the following examples.

sum(x, na.rm = TRUE) 
#> [1] 10
mean(x, na.rm = TRUE)
#> [1] 2.5
sd(x, na.rm = TRUE)
#> [1] 1.290994

It is easy to verify that the results are what we expect to get if the NA values are removed. Feel free to try the following codes which apply the same functions on the subvector with the non-missing values.

x_no_na <- c(1, 3, 4, 2)
sum(x_no_na) 
mean(x_no_na)
sd(x_no_na)

Interesting, the summary() function will deal with the NA values automatically by removing them before computing the five percentiles and the mean. In addition, the summary() function provides a column which shows the number of NAs in x.

summary(x) 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    1.00    1.75    2.50    2.50    3.25    4.00       2

2.10.2 Work with NA values

When there are NA values in our vector, there is nothing to be afraid of as there are many useful tools we can use.

Let’s use x <- c(1, NA, 3, 4, NA, 2) throughout this part.

a. Find indices with missing values

Firstly, we will introduce how to find indices with missing values. To find the indices, you may be tempted to use the comparison operator introduced in Section 2.6. Let’s try to compare x with NA as the following code.

x == NA

You get a vector of all values equaling NA! This is actually not surprising for the following reason. Given that the NA you are comparing can take any unknown value, any comparison with it will result in an NA value due to the lack of information.

Instead of using == for finding missing values, the correct way is to use the is.na() function, which returns a logical vector x_na representing whether the value of each element is missing or not. Then, you can use the which() function which can return the locations of all TRUE values to find the indices for the NA values. Here, the sum() function on the logical vector x_na returns the number of NA values in the vector, following the cocercion rule described in Section 2.5.3.

x_na <- is.na(x)        #logical vector     
x_na
#> [1] FALSE  TRUE FALSE FALSE  TRUE FALSE
which(x_na)             #numeric vector
#> [1] 2 5
sum(x_na)               #the number of NAs in x 
#> [1] 2
sum(!x_na)              #the number of non-NAs in x
#> [1] 4

If you only want to detect whether there is any NA values in the vector, you can use the anyNA() function.

anyNA(c(NA, 1))
#> [1] TRUE
anyNA(1:3)
#> [1] FALSE

b. Remove missing values

Sometimes, you may want to simply remove the missing values. To do that, you can use a logical vector to do vector subsetting as introduced in Section 2.6.3. The specific logical vector you want to use is the opposite (!) of the logical vector that represents missing values. Then you will get a subvector of x which keeps all values except NA.

x2 <- x[!x_na]
x2
#> [1] 1 3 4 2

c. Impute missing values

In many applications, naively removing the missing values before doing the analysis may lead to incorrect inference. Usually, it is useful to make the data complete by imputing the missing values. For example, you can use mean imputing or median imputing, which replaces the missing values with the mean or median of the non-missing values.

x_impute <- x
mean(x, na.rm = TRUE)
#> [1] 2.5
x_impute[x_na] <- mean(x, na.rm = TRUE)
x_impute
#> [1] 1.0 2.5 3.0 4.0 2.5 2.0
x
#> [1]  1 NA  3  4 NA  2

If you want to compare the values of an object before and after some operations, you can create a new object with the same value as the original object (here, we create x_impute which has the same value as x), then make operations on x_impute without changing the value of x.

In x_impute, values of the 2nd and 5th elements are replaced by 2.5, the average of the non-missing values of x.

d. Replace non-standard missing values with NA in the vector

Sometimes, the data we collected may not use NA to indicate missingness. For example, in the following vector x3, the value 999 represents the corresponding element is missing.

x3 <- c(4, 999, 1, 999, 3, 999, 999)

It is highly recommended to convert the values into NA before carrying out any analysis.

x3[x3 == 999] <- NA
x3
#> [1]  4 NA  1 NA  3 NA NA

Let’s see another example where both 999 and -999 represent the value is missing. You can convert all 999 and -999 into NA by using operations introduced in Section 2.8.1

x4 <- c(4, 999, 1, -999, 3, -999, 999)
##999 and -999 are the values indicating missingness
x4[x4 %in% c(999, -999)] <- NA
x4
#> [1]  4 NA  1 NA  3 NA NA

2.10.3 Exercises

  1. For the vector x <- rep(c(1, 2, NA), 3:5),
  1. verify each value of summary(x) by using other functions.
  2. find the indices with missing values;
  3. create a vector x_no_na containing the non-missing values in x;
  4. replace those missing values by the median of the non-missing values in x.
  1. For the vector y <- rep(c("N", 2, "A"), 5:3), the values of both "N" and "A" indicate missingness. Convert non-standard missing values to NA, then find the indices of y that correspond to missing values.