8.3 More on Missing Values
In Section 2.15, we have introduced the concept of missing values and how to detect them. In this section, we will discuss missingness related to tidy data.
8.3.1 Missing Values in Tidy Data
In tidy data, missing values are represented as NA
. In R, NA
stands for “Not Available” and is used to represent missing values. When working with tidy data, it is important to understand how missing values are handled in R. Let’s revisit the following dataset from Section 8.1.
library(r02pro)
library(tidyverse)
gm_tidy <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
select(country, year, life_expectancy) %>%
filter(year >= 2004 & year <= 2006)
Let’s remove the 3rd and 4th rows to prepare the dataset for this section.
gm_tidy <- gm_tidy[-c(3, 4), ]
gm_tidy
#> # A tibble: 7 × 3
#> country year life_expectancy
#> <chr> <dbl> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 China 2006 74.2
#> 4 Russia 2005 66.5
#> 5 Russia 2006 67.5
#> 6 United States 2005 78
#> 7 United States 2006 78.2
In this tidy format, we actually don’t see any missing values. Let’s try to convert it into a wide format and see what happens.
gm_wide <- gm_tidy %>%
pivot_wider(names_from = year, values_from = life_expectancy)
gm_wide
#> # A tibble: 3 × 4
#> country `2004` `2006` `2005`
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 73 74.2 NA
#> 2 Russia 65 67.5 66.5
#> 3 United States NA 78.2 78
From this output, we can see that the NA
values are introduced when we convert the tidy data into a wide format. This is because the gm_tidy
dataset does not have data for China in the year 2005 or for United States in the year 2004. When we convert it into a wide format, the NA
values are introduced to represent the missing data.
Let’s try to convert gm_wide
back to a tidy format.
gm_tidy2 <- gm_wide %>%
pivot_longer(cols = -1, names_to = "year", values_to = "life_expectancy")
gm_tidy2
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2006 74.2
#> 3 China 2005 NA
#> 4 Russia 2004 65
#> 5 Russia 2006 67.5
#> 6 Russia 2005 66.5
#> 7 United States 2004 NA
#> 8 United States 2006 78.2
#> 9 United States 2005 78
Now, we have the missing values in the tidy format.
8.3.2 Filling Missing Values
When working with missing values, it is important to decide how to handle them. One common approach is to fill the missing values with a specific value. In R, we can use the replace_na()
function from the tidyr
package to fill missing values. Let’s fill the missing values in the gm_tidy2
dataset with the average life expectancy for each country.
gm_tidy2_filled <- gm_tidy2 %>%
group_by(country) %>%
mutate(life_expectancy = replace_na(life_expectancy, mean(life_expectancy, na.rm = TRUE)))
gm_tidy2_filled
#> # A tibble: 9 × 3
#> # Groups: country [3]
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2006 74.2
#> 3 China 2005 73.6
#> 4 Russia 2004 65
#> 5 Russia 2006 67.5
#> 6 Russia 2005 66.5
#> 7 United States 2004 78.1
#> 8 United States 2006 78.2
#> 9 United States 2005 78