## 2.10 Missing Values (NA)

In applications, you may encounter the situation where some values are missing in the data set. In this scenario, R uses `NA`

to represent those values, indicating they are **not available**. Let’s see the following example.

```
<- 1:10
a 11]
a[#> [1] NA
```

Since you have defined `a`

as a vector of length 10, there are 10 values in `a`

. If you try to access the 11th element of `a`

, it is not available, hence you will see `NA`

as the result.

Sometimes, the values of some elements in a vector are missing, then you can use `NA`

for these elements. Here is an example containing `NA`

s in values. If you want to get values of the 2nd and 4th elements in `b`

, of course you will get `NA NA`

as the result.

```
<- c(1, NA, 2, NA, 3)
b
b#> [1] 1 NA 2 NA 3
c(2,4)]
b[#> [1] NA NA
```

Now you have had a basic understanding of `NA`

. In the following parts of this section, we want to introduce several properties of `NA`

. Let’s start with `NA`

is contagious.

### 2.10.1 `NA`

is contagious

`NA`

implies that the underlying value is not available, in other words, there is uncertainty with the value. As a result, for most operations associated with `NA`

, the results will also be `NA`

, showing that `NA`

is **contagious**.

```
<- NA
y + 3
y #> [1] NA
== 3
y #> [1] NA
```

As you can see here, `y`

is `NA`

, indicating the value of `y`

is not available. When you try to do operations like `y + 3`

or `y == 3`

, the answers are clearly not available as well, hence both taking the value `NA`

.

How about we create another `NA`

object and compare it with `y`

?

```
<- NA
z == z
y #> [1] NA
```

It is again `NA`

, which may be confusing at first. However, keep in mind that since both `y`

and `z`

are not available, there is no way to tell whether they are the same. Hence `y == z`

is also `NA`

.

Think about what is the value of 1

^{NA}in R. Try to run it in R. Does it agree with your thoughts?Think about what is the value of 0

^{NA}in R. Try to run it in R. Does it agree with your thoughts?

- Answer 1. For any \(x\), we have \(1^x = \exp(\log (1^x)) = \exp(x \log 1) = \exp(x \cdot 0) = 1\). Since there is no uncertainty regarding the expression, the value of 1
^{NA}is 1 - Answer 2. We have \(0^0 = \lim_{x\to 0}x^x =\exp[\lim_{x\to 0} x\log(x)]=\exp[0] = 1\), and \(0^1 = 0\). Since
`NA`

represents uncertainty values, it can be 0 or 1 or other numbers. So 0^{NA}is not deterministic because it can take different values according to the exponent. Hence, the value of 0^{NA}is also`NA`

.

Now, let’s talk about what impact the `NA`

values make when we apply statistical functions. Let’s create a vector containing `NA`

values, and apply some functions on it.

```
<- c(1, NA, 3, 4, NA, 2)
x
x#> [1] 1 NA 3 4 NA 2
sum(x)
#> [1] NA
mean(x)
#> [1] NA
sd(x)
#> [1] NA
```

As you can see, for many statistical functions on vectors, as long as there exists at least one `NA`

values in vectors, the results are often impossible to determine, hence resulting an `NA`

value as well.

As a result, you may want to ignore the `NA`

values during the function evaluation. Fortunately, most statistical functions on vectors provide an optional argument `na.rm`

, which takes a logical value, indicating whether to remove `NA`

before applying the functions. Let’s see the following examples.

```
sum(x, na.rm = TRUE)
#> [1] 10
mean(x, na.rm = TRUE)
#> [1] 2.5
sd(x, na.rm = TRUE)
#> [1] 1.290994
```

It is easy to verify that the results are what we expect to get if the `NA`

values are removed. Feel free to try the following codes which apply the same functions on the subvector with the non-missing values.

```
<- c(1, 3, 4, 2)
x_no_na sum(x_no_na)
mean(x_no_na)
sd(x_no_na)
```

Interesting, the `summary()`

function will deal with the `NA`

values automatically by removing them before computing the five percentiles and the mean. In addition, the `summary()`

function provides a column which shows the number of `NA`

s in `x`

.

```
summary(x)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 1.00 1.75 2.50 2.50 3.25 4.00 2
```

### 2.10.2 Work with `NA`

values

When there are `NA`

values in our vector, there is nothing to be afraid of as there are many useful tools we can use.

Let’s use `x <- c(1, NA, 3, 4, NA, 2)`

throughout this part.

*a. Find indices with missing values*

Firstly, we will introduce how to find indices with missing values. To find the indices, you may be tempted to use the comparison operator introduced in Section 2.6. Let’s try to compare `x`

with `NA`

as the following code.

`== NA x `

You get a vector of all values equaling `NA`

! This is actually not surprising for the following reason. Given that the `NA`

you are comparing can take any unknown value, any comparison with it will result in an `NA`

value due to the lack of information.

Instead of using `==`

for finding missing values, the correct way is to use the `is.na()`

function, which returns a logical vector `x_na`

representing whether the value of each element is missing or not. Then, you can use the `which()`

function which can return the locations of all `TRUE`

values to find the indices for the `NA`

values. Here, the `sum()`

function on the logical vector `x_na`

returns the number of `NA`

values in the vector, following the cocercion rule described in Section 2.5.3.

```
<- is.na(x) #logical vector
x_na
x_na#> [1] FALSE TRUE FALSE FALSE TRUE FALSE
which(x_na) #numeric vector
#> [1] 2 5
sum(x_na) #the number of NAs in x
#> [1] 2
sum(!x_na) #the number of non-NAs in x
#> [1] 4
```

If you only want to detect whether there is any `NA`

values in the vector, you can use the `anyNA()`

function.

```
anyNA(c(NA, 1))
#> [1] TRUE
anyNA(1:3)
#> [1] FALSE
```

*b. Remove missing values*

Sometimes, you may want to simply remove the missing values. To do that, you can use a logical vector to do vector subsetting as introduced in Section 2.6.3. The specific logical vector you want to use is the opposite (`!`

) of the logical vector that represents missing values. Then you will get a subvector of `x`

which keeps all values except `NA`

.

```
<- x[!x_na]
x2
x2#> [1] 1 3 4 2
```

*c. Impute missing values*

In many applications, naively removing the missing values before doing the analysis may lead to incorrect inference. Usually, it is useful to make the data complete by **imputing** the missing values. For example, you can use mean imputing or median imputing, which replaces the missing values with the mean or median of the non-missing values.

```
<- x
x_impute mean(x, na.rm = TRUE)
#> [1] 2.5
<- mean(x, na.rm = TRUE)
x_impute[x_na]
x_impute#> [1] 1.0 2.5 3.0 4.0 2.5 2.0
x#> [1] 1 NA 3 4 NA 2
```

If you want to compare the values of an object before and after some operations, you can create a new object with the same value as the original object (here, we create `x_impute`

which has the same value as `x`

), then make operations on `x_impute`

without changing the value of `x`

.

In `x_impute`

, values of the 2nd and 5th elements are replaced by 2.5, the average of the non-missing values of `x`

.

*d. Replace non-standard missing values with NA in the vector*

Sometimes, the data we collected may not use `NA`

to indicate missingness. For example, in the following vector `x3`

, the value `999`

represents the corresponding element is missing.

`<- c(4, 999, 1, 999, 3, 999, 999) x3 `

It is highly recommended to convert the values into `NA`

before carrying out any analysis.

```
== 999] <- NA
x3[x3
x3#> [1] 4 NA 1 NA 3 NA NA
```

Let’s see another example where both `999`

and `-999`

represent the value is missing. You can convert all `999`

and `-999`

into `NA`

by using operations introduced in Section 2.8.1

```
<- c(4, 999, 1, -999, 3, -999, 999)
x4 ##999 and -999 are the values indicating missingness
%in% c(999, -999)] <- NA
x4[x4
x4#> [1] 4 NA 1 NA 3 NA NA
```

### 2.10.3 Exercises

- For the vector
`x <- rep(c(1, 2, NA), 3:5)`

,

- verify each value of
`summary(x)`

by using other functions. - find the indices with missing values;
- create a vector x_no_na containing the non-missing values in
`x`

; - replace those missing values by the median of the non-missing values in
`x`

.

- For the vector
`y <- rep(c("N", 2, "A"), 5:3)`

, the values of both`"N"`

and`"A"`

indicate missingness. Convert non-standard missing values to`NA`

, then find the indices of`y`

that correspond to missing values.