2.10 Missing Values (NA)
In applications, you may encounter the situation where some values are missing in the data set. In this scenario, R uses NA
to represent those values, indicating they are not available. Let’s see the following example.
<- 1:10
a 11]
a[#> [1] NA
Since you have defined a
as a vector of length 10, there are 10 values in a
. If you try to access the 11th element of a
, it is not available, hence you will see NA
as the result.
Sometimes, the values of some elements in a vector are missing, then you can use NA
for these elements. Here is an example containing NA
s in values. If you want to get values of the 2nd and 4th elements in b
, of course you will get NA NA
as the result.
<- c(1, NA, 2, NA, 3)
b
b#> [1] 1 NA 2 NA 3
c(2,4)]
b[#> [1] NA NA
Now you have had a basic understanding of NA
. In the following parts of this section, we want to introduce several properties of NA
. Let’s start with NA
is contagious.
2.10.1 NA
is contagious
NA
implies that the underlying value is not available, in other words, there is uncertainty with the value. As a result, for most operations associated with NA
, the results will also be NA
, showing that NA
is contagious.
<- NA
y + 3
y #> [1] NA
== 3
y #> [1] NA
As you can see here, y
is NA
, indicating the value of y
is not available. When you try to do operations like y + 3
or y == 3
, the answers are clearly not available as well, hence both taking the value NA
.
How about we create another NA
object and compare it with y
?
<- NA
z == z
y #> [1] NA
It is again NA
, which may be confusing at first. However, keep in mind that since both y
and z
are not available, there is no way to tell whether they are the same. Hence y == z
is also NA
.
Think about what is the value of 1NA in R. Try to run it in R. Does it agree with your thoughts?
Think about what is the value of 0NA in R. Try to run it in R. Does it agree with your thoughts?
- Answer 1. For any \(x\), we have \(1^x = \exp(\log (1^x)) = \exp(x \log 1) = \exp(x \cdot 0) = 1\). Since there is no uncertainty regarding the expression, the value of 1NA is 1
- Answer 2. We have \(0^0 = \lim_{x\to 0}x^x =\exp[\lim_{x\to 0} x\log(x)]=\exp[0] = 1\), and \(0^1 = 0\). Since
NA
represents uncertainty values, it can be 0 or 1 or other numbers. So 0NA is not deterministic because it can take different values according to the exponent. Hence, the value of 0NA is alsoNA
.
Now, let’s talk about what impact the NA
values make when we apply statistical functions. Let’s create a vector containing NA
values, and apply some functions on it.
<- c(1, NA, 3, 4, NA, 2)
x
x#> [1] 1 NA 3 4 NA 2
sum(x)
#> [1] NA
mean(x)
#> [1] NA
sd(x)
#> [1] NA
As you can see, for many statistical functions on vectors, as long as there exists at least one NA
values in vectors, the results are often impossible to determine, hence resulting an NA
value as well.
As a result, you may want to ignore the NA
values during the function evaluation. Fortunately, most statistical functions on vectors provide an optional argument na.rm
, which takes a logical value, indicating whether to remove NA
before applying the functions. Let’s see the following examples.
sum(x, na.rm = TRUE)
#> [1] 10
mean(x, na.rm = TRUE)
#> [1] 2.5
sd(x, na.rm = TRUE)
#> [1] 1.290994
It is easy to verify that the results are what we expect to get if the NA
values are removed. Feel free to try the following codes which apply the same functions on the subvector with the non-missing values.
<- c(1, 3, 4, 2)
x_no_na sum(x_no_na)
mean(x_no_na)
sd(x_no_na)
Interesting, the summary()
function will deal with the NA
values automatically by removing them before computing the five percentiles and the mean. In addition, the summary()
function provides a column which shows the number of NA
s in x
.
summary(x)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 1.00 1.75 2.50 2.50 3.25 4.00 2
2.10.2 Work with NA
values
When there are NA
values in our vector, there is nothing to be afraid of as there are many useful tools we can use.
Let’s use x <- c(1, NA, 3, 4, NA, 2)
throughout this part.
a. Find indices with missing values
Firstly, we will introduce how to find indices with missing values. To find the indices, you may be tempted to use the comparison operator introduced in Section 2.6. Let’s try to compare x
with NA
as the following code.
== NA x
You get a vector of all values equaling NA
! This is actually not surprising for the following reason. Given that the NA
you are comparing can take any unknown value, any comparison with it will result in an NA
value due to the lack of information.
Instead of using ==
for finding missing values, the correct way is to use the is.na()
function, which returns a logical vector x_na
representing whether the value of each element is missing or not. Then, you can use the which()
function which can return the locations of all TRUE
values to find the indices for the NA
values. Here, the sum()
function on the logical vector x_na
returns the number of NA
values in the vector, following the cocercion rule described in Section 2.5.3.
<- is.na(x) #logical vector
x_na
x_na#> [1] FALSE TRUE FALSE FALSE TRUE FALSE
which(x_na) #numeric vector
#> [1] 2 5
sum(x_na) #the number of NAs in x
#> [1] 2
sum(!x_na) #the number of non-NAs in x
#> [1] 4
If you only want to detect whether there is any NA
values in the vector, you can use the anyNA()
function.
anyNA(c(NA, 1))
#> [1] TRUE
anyNA(1:3)
#> [1] FALSE
b. Remove missing values
Sometimes, you may want to simply remove the missing values. To do that, you can use a logical vector to do vector subsetting as introduced in Section 2.6.3. The specific logical vector you want to use is the opposite (!
) of the logical vector that represents missing values. Then you will get a subvector of x
which keeps all values except NA
.
<- x[!x_na]
x2
x2#> [1] 1 3 4 2
c. Impute missing values
In many applications, naively removing the missing values before doing the analysis may lead to incorrect inference. Usually, it is useful to make the data complete by imputing the missing values. For example, you can use mean imputing or median imputing, which replaces the missing values with the mean or median of the non-missing values.
<- x
x_impute mean(x, na.rm = TRUE)
#> [1] 2.5
<- mean(x, na.rm = TRUE)
x_impute[x_na]
x_impute#> [1] 1.0 2.5 3.0 4.0 2.5 2.0
x#> [1] 1 NA 3 4 NA 2
If you want to compare the values of an object before and after some operations, you can create a new object with the same value as the original object (here, we create x_impute
which has the same value as x
), then make operations on x_impute
without changing the value of x
.
In x_impute
, values of the 2nd and 5th elements are replaced by 2.5, the average of the non-missing values of x
.
d. Replace non-standard missing values with NA
in the vector
Sometimes, the data we collected may not use NA
to indicate missingness. For example, in the following vector x3
, the value 999
represents the corresponding element is missing.
<- c(4, 999, 1, 999, 3, 999, 999) x3
It is highly recommended to convert the values into NA
before carrying out any analysis.
== 999] <- NA
x3[x3
x3#> [1] 4 NA 1 NA 3 NA NA
Let’s see another example where both 999
and -999
represent the value is missing. You can convert all 999
and -999
into NA
by using operations introduced in Section 2.8.1
<- c(4, 999, 1, -999, 3, -999, 999)
x4 ##999 and -999 are the values indicating missingness
%in% c(999, -999)] <- NA
x4[x4
x4#> [1] 4 NA 1 NA 3 NA NA
2.10.3 Exercises
- For the vector
x <- rep(c(1, 2, NA), 3:5)
,
- verify each value of
summary(x)
by using other functions. - find the indices with missing values;
- create a vector x_no_na containing the non-missing values in
x
; - replace those missing values by the median of the non-missing values in
x
.
- For the vector
y <- rep(c("N", 2, "A"), 5:3)
, the values of both"N"
and"A"
indicate missingness. Convert non-standard missing values toNA
, then find the indices ofy
that correspond to missing values.