8.3 More on Missing Values

In Section 2.15, we have introduced the concept of missing values and how to detect them. In this section, we will discuss missingness related to tidy data.

8.3.1 Missing Values in Tidy Data

In tidy data, missing values are represented as NA. In R, NA stands for “Not Available” and is used to represent missing values. When working with tidy data, it is important to understand how missing values are handled in R. Let’s revisit the following dataset from Section 8.1.

library(r02pro)
library(tidyverse)
gm_tidy <- gm %>%
    filter(country %in% c("United States", "China", "Russia")) %>%
    select(country, year, life_expectancy) %>%
    filter(year >= 2004 & year <= 2006)

Let’s remove the 3rd and 4th rows to prepare the dataset for this section.

gm_tidy <- gm_tidy[-c(3, 4), ]
gm_tidy
#> # A tibble: 7 × 3
#>   country        year life_expectancy
#>   <chr>         <dbl>           <dbl>
#> 1 China          2004            73  
#> 2 Russia         2004            65  
#> 3 China          2006            74.2
#> 4 Russia         2005            66.5
#> 5 Russia         2006            67.5
#> 6 United States  2005            78  
#> 7 United States  2006            78.2

In this tidy format, we actually don’t see any missing values. Let’s try to convert it into a wide format and see what happens.

gm_wide <- gm_tidy %>%
    pivot_wider(names_from = year, values_from = life_expectancy)
gm_wide
#> # A tibble: 3 × 4
#>   country       `2004` `2006` `2005`
#>   <chr>          <dbl>  <dbl>  <dbl>
#> 1 China             73   74.2   NA  
#> 2 Russia            65   67.5   66.5
#> 3 United States     NA   78.2   78

From this output, we can see that the NA values are introduced when we convert the tidy data into a wide format. This is because the gm_tidy dataset does not have data for China in the year 2005 or for United States in the year 2004. When we convert it into a wide format, the NA values are introduced to represent the missing data.

Let’s try to convert gm_wide back to a tidy format.

gm_tidy2 <- gm_wide %>%
    pivot_longer(cols = -1, names_to = "year", values_to = "life_expectancy")
gm_tidy2
#> # A tibble: 9 × 3
#>   country       year  life_expectancy
#>   <chr>         <chr>           <dbl>
#> 1 China         2004             73  
#> 2 China         2006             74.2
#> 3 China         2005             NA  
#> 4 Russia        2004             65  
#> 5 Russia        2006             67.5
#> 6 Russia        2005             66.5
#> 7 United States 2004             NA  
#> 8 United States 2006             78.2
#> 9 United States 2005             78

Now, we have the missing values in the tidy format.

8.3.2 Filling Missing Values

When working with missing values, it is important to decide how to handle them. One common approach is to fill the missing values with a specific value. In R, we can use the replace_na() function from the tidyr package to fill missing values. Let’s fill the missing values in the gm_tidy2 dataset with the average life expectancy for each country.

gm_tidy2_filled <- gm_tidy2 %>%
    group_by(country) %>%
    mutate(life_expectancy = replace_na(life_expectancy, mean(life_expectancy, na.rm = TRUE)))
gm_tidy2_filled
#> # A tibble: 9 × 3
#> # Groups:   country [3]
#>   country       year  life_expectancy
#>   <chr>         <chr>           <dbl>
#> 1 China         2004             73  
#> 2 China         2006             74.2
#> 3 China         2005             73.6
#> 4 Russia        2004             65  
#> 5 Russia        2006             67.5
#> 6 Russia        2005             66.5
#> 7 United States 2004             78.1
#> 8 United States 2006             78.2
#> 9 United States 2005             78