8.3 More on Missing Values
In Section 2.15, we have introduced the concept of missing values and how to detect them. In this section, we will discuss missingness related to tidy data.
8.3.1 Missing Values in Tidy Data
In tidy data, missing values are represented as NA
. In R, NA
stands for “Not Available” and is used to represent missing values. When working with tidy data, it is important to understand how missing values are handled in R. Let’s revisit the following dataset from Section 8.1.
library(r02pro)
library(tidyverse)
gm_tidy <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
select(country, year, life_expectancy) %>%
filter(year >= 2004 & year <= 2006)
Let’s remove the 3rd and 4th rows to prepare the dataset for this section.
gm_tidy <- gm_tidy[-c(3, 4), ]
gm_tidy
#> # A tibble: 7 × 3
#> country year life_expectancy
#> <chr> <dbl> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 China 2006 74.2
#> 4 Russia 2005 66.5
#> 5 Russia 2006 67.5
#> 6 United States 2005 78
#> 7 United States 2006 78.2
In this tidy format, we actually don’t see any missing values. Let’s try to convert it into a wide format and see what happens.
gm_wide <- gm_tidy %>%
pivot_wider(names_from = year, values_from = life_expectancy)
gm_wide
#> # A tibble: 3 × 4
#> country `2004` `2006` `2005`
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 73 74.2 NA
#> 2 Russia 65 67.5 66.5
#> 3 United States NA 78.2 78
From this output, we can see that the NA
values are introduced when we convert the tidy data into a wide format. This is because the gm_tidy
dataset does not have data for China in the year 2005 or for United States in the year 2004. When we convert it into a wide format, the NA
values are introduced to represent the missing data.
Let’s try to convert gm_wide
back to a tidy format.
gm_tidy2 <- gm_wide %>%
pivot_longer(cols = -1, names_to = "year", values_to = "life_expectancy")
gm_tidy2
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2006 74.2
#> 3 China 2005 NA
#> 4 Russia 2004 65
#> 5 Russia 2006 67.5
#> 6 Russia 2005 66.5
#> 7 United States 2004 NA
#> 8 United States 2006 78.2
#> 9 United States 2005 78
Now, we have the missing values in the tidy format.
8.3.2 Filling Missing Values
When working with missing values, it is important to decide how to handle them. One common approach is to fill the missing values with a specific value. In R, we can use the replace_na()
function from the tidyr
package to fill missing values. Let’s say we fill in all the missing values with the average of all the non-missing values.
gm_tidy2 %>%
mutate(life_expectancy = replace_na(life_expectancy, mean(life_expectancy, na.rm = TRUE)))
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2006 74.2
#> 3 China 2005 71.8
#> 4 Russia 2004 65
#> 5 Russia 2006 67.5
#> 6 Russia 2005 66.5
#> 7 United States 2004 71.8
#> 8 United States 2006 78.2
#> 9 United States 2005 78
Clearly, we can see the life_expectancy
for China in 2005 and for United States in 2005 are both replaced as the overall mean 71.77143
.
Sometimes, you may want to replace the missing values by the average value within a particular group. In this case, it is reasonable to use the average value for a particular country. To achieve this, you just need to add a group_by()
step before the mutate operation.
gm_tidy2 %>%
group_by(country) %>%
mutate(life_expectancy = replace_na(life_expectancy, mean(life_expectancy, na.rm = TRUE)))
#> # A tibble: 9 × 3
#> # Groups: country [3]
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2006 74.2
#> 3 China 2005 73.6
#> 4 Russia 2004 65
#> 5 Russia 2006 67.5
#> 6 Russia 2005 66.5
#> 7 United States 2004 78.1
#> 8 United States 2006 78.2
#> 9 United States 2005 78
In some datasets with longitudinal measurements, the values are only recorded when they change. In this case, it is helpful to use fill()
to fill in missing values with previous or next value.
Let’s first try to fill in the missing values with the last non-missing value.
gm_tidy2 %>%
fill(life_expectancy, .direction = "down")
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2006 74.2
#> 3 China 2005 74.2
#> 4 Russia 2004 65
#> 5 Russia 2006 67.5
#> 6 Russia 2005 66.5
#> 7 United States 2004 66.5
#> 8 United States 2006 78.2
#> 9 United States 2005 78
From the output, you can see the life_expectancy
for China in 2005 is 74.2, the same value as the preceding value (the life_expectancy
for China in 2004). The life_expectancy
for United States in 2004 is set to be the same as the preceding value (Russia in 2005). Again, it may be more reasonable to group by the country first, arrange by country
and year
, and fill in the value with the preceding value for a particular country.
gm_tidy2 %>%
group_by(country) %>%
arrange(country, year) %>%
fill(life_expectancy, .direction = "down")
#> # A tibble: 9 × 3
#> # Groups: country [3]
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2005 73
#> 3 China 2006 74.2
#> 4 Russia 2004 65
#> 5 Russia 2005 66.5
#> 6 Russia 2006 67.5
#> 7 United States 2004 NA
#> 8 United States 2005 78
#> 9 United States 2006 78.2
Perhaps it comes as a surprise that the life_expectancy
for United States for 2004 is still missing. The underlying reason is that for United States, there is no preceding observations that are complete. In this case, you can use .direction = "downup"
in the fill()
function, which will fill in the missing values in a downward fashion and then in a upward fashion.
gm_tidy2 %>%
group_by(country) %>%
arrange(country, year) %>%
fill(life_expectancy, .direction = "downup")
#> # A tibble: 9 × 3
#> # Groups: country [3]
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2005 73
#> 3 China 2006 74.2
#> 4 Russia 2004 65
#> 5 Russia 2005 66.5
#> 6 Russia 2006 67.5
#> 7 United States 2004 78
#> 8 United States 2005 78
#> 9 United States 2006 78.2
In the downward pass, we fill in the life_expectancy
value for China in 2005 with the value for 2004. Then in the upward pass, we fill in the life_expectancy
value for United States with the value for 2005.