8.1 Convert Between Names and Values

First, let’s prepare a subset of the gm data in the r02pro package.

library(r02pro)
library(tidyverse)
gm_tidy <- gm %>%
    filter(country %in% c("United States", "China", "Russia")) %>%
    select(country, year, life_expectancy) %>%
    filter(year >= 2004 & year <= 2006)
gm_tidy
#> # A tibble: 9 × 3
#>   country        year life_expectancy
#>   <chr>         <dbl>           <dbl>
#> 1 China          2004            73  
#> 2 Russia         2004            65  
#> 3 United States  2004            77.7
#> 4 China          2005            73.7
#> 5 China          2006            74.2
#> 6 Russia         2005            66.5
#> 7 Russia         2006            67.5
#> 8 United States  2005            78  
#> 9 United States  2006            78.2

From the code, you can see that gm_tidy is a tibble that contains the life expectancy of the United States, China, and Russia from 2004 to 2006. This is a tidy format.

8.1.1 Convert Values into Column Names

In gm_tidy, each row contains the year when the life expectancy measurement was taken. Suppose we want to convert the year value into column names. You can use the pivot_wider() function.

In pivot_wider(), you need to specify two arguments:

  • names_from denotes which column in the original tibble contains the values of the new column names,
  • values_from denotes which column in the original tibble contains the values for each cell in the new tibble.
gm_wide <- gm_tidy %>%
    pivot_wider(names_from = year, values_from = life_expectancy)
gm_wide
#> # A tibble: 3 × 4
#>   country       `2004` `2005` `2006`
#>   <chr>          <dbl>  <dbl>  <dbl>
#> 1 China           73     73.7   74.2
#> 2 Russia          65     66.5   67.5
#> 3 United States   77.7   78     78.2

From the resulting tibble gm_wide, you can see that the year variable is converted into column names. The life_expectancy variable contains the corresponding values. The gm_wide is now in a wide format. During the process, the year information is lost, which is not desirable. Fortunately, you can add a prefix “year” to the column names via an argument names_prefix in the pivot_wider() function.

gm_wide_year <- gm_tidy %>%
    pivot_wider(names_from = year, names_prefix = "year", values_from = life_expectancy)
gm_wide_year
#> # A tibble: 3 × 4
#>   country       year2004 year2005 year2006
#>   <chr>            <dbl>    <dbl>    <dbl>
#> 1 China             73       73.7     74.2
#> 2 Russia            65       66.5     67.5
#> 3 United States     77.7     78       78.2

In the resulting tibble gm_wide_year, you can see that the column names are prefixed with “year”. The year information is now reflected in the names of the tibble.

The reason why we call the function pivot_wider() is that it will create a wider dataset than the original one, containing more columns.

8.1.2 Convert Column Names into Values

Apparently, both gm_wide and gm_wide_year are not tidy. Now, let’s make them tidy using the pivot_longer() function.

gm_wide %>%
    pivot_longer(cols = -1, names_to = "year", values_to = "life_expectancy")
#> # A tibble: 9 × 3
#>   country       year  life_expectancy
#>   <chr>         <chr>           <dbl>
#> 1 China         2004             73  
#> 2 China         2005             73.7
#> 3 China         2006             74.2
#> 4 Russia        2004             65  
#> 5 Russia        2005             66.5
#> 6 Russia        2006             67.5
#> 7 United States 2004             77.7
#> 8 United States 2005             78  
#> 9 United States 2006             78.2

In the pivot_longer() function,

  • cols specifies the column names that you want to convert from, which accept the same format as that in dplyr::select() introduced in Section 7.2.
  • names_to specifies the variable name you want to use for the column names.
  • values_to specifies the variable name for holding the values in the selected columns.

In the resulting tibble, col = -1 means that all columns except the first column are selected. The column names 2004, 2005, and 2006 are converted into values of the variable year. The life_expectancy variable contains the corresponding values. We have recovered the tidy gm_tidy through the tidy process.

The reason why we call the function pivot_longer() is that it will create a longer dataset than the original one, containing more rows.

Now, let’s tidy gm_wide_year using the pivot_longer() function.

gm_wide_year %>%
    pivot_longer(cols = -1, names_to = "year", names_prefix = "year", values_to = "life_expectancy")
#> # A tibble: 9 × 3
#>   country       year  life_expectancy
#>   <chr>         <chr>           <dbl>
#> 1 China         2004             73  
#> 2 China         2005             73.7
#> 3 China         2006             74.2
#> 4 Russia        2004             65  
#> 5 Russia        2005             66.5
#> 6 Russia        2006             67.5
#> 7 United States 2004             77.7
#> 8 United States 2005             78  
#> 9 United States 2006             78.2

Note that we have an extra argument names_prefix in the pivot_longer() function. This argument is used to remove the prefix “year” from the column names.

Comparing the pivot_wider() and pivot_longer() functions, you can see that they are complementary. The pivot_wider() function converts values into column names, while the pivot_longer() function converts column names into values.

8.1.3 Multiple Variables

In the gm dataset, there are multiple variables that you may want to convert into column names. You can specify multiple variables in the names_from argument in the pivot_wider() function.

gm_tidy <- gm %>%
    filter(country %in% c("United States", "China", "Russia")) %>%
    select(country, year, life_expectancy, GDP_per_capita) %>%
    filter(year >= 2004 & year <= 2006)
gm_tidy
#> # A tibble: 9 × 4
#>   country        year life_expectancy GDP_per_capita
#>   <chr>         <dbl>           <dbl>          <dbl>
#> 1 China          2004            73             3.39
#> 2 Russia         2004            65             8.93
#> 3 United States  2004            77.7          52.8 
#> 4 China          2005            73.7           3.8 
#> 5 China          2006            74.2           4.32
#> 6 Russia         2005            66.5           9.69
#> 7 Russia         2006            67.5          10.5 
#> 8 United States  2005            78            53.8 
#> 9 United States  2006            78.2          54.3
gm_wide <- gm_tidy %>%
    pivot_wider(names_from = c(country, year), values_from = c(life_expectancy, GDP_per_capita),
        names_sep = "-")
gm_wide
#> # A tibble: 1 × 18
#>   life_expecta…¹ life_…² life_…³ life_…⁴ life_…⁵ life_…⁶ life_…⁷ life_…⁸ life_…⁹
#>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1             73      65    77.7    73.7    74.2    66.5    67.5      78    78.2
#> # … with 9 more variables: `GDP_per_capita-China-2004` <dbl>,
#> #   `GDP_per_capita-Russia-2004` <dbl>,
#> #   `GDP_per_capita-United States-2004` <dbl>,
#> #   `GDP_per_capita-China-2005` <dbl>, `GDP_per_capita-China-2006` <dbl>,
#> #   `GDP_per_capita-Russia-2005` <dbl>, `GDP_per_capita-Russia-2006` <dbl>,
#> #   `GDP_per_capita-United States-2005` <dbl>,
#> #   `GDP_per_capita-United States-2006` <dbl>, and abbreviated variable names …

In the pivot_wider() function, you can specify multiple variables in the names_from and values_from arguments. The resulting tibble gm_wide contains the life expectancy and GDP per capita of the United States, China, and Russia from 2004 to 2006. The column names are in the format of country_year_variable. Note that we have an extra argument names_sep in the pivot_wider() function. This argument is used to separate the column names with a hyphen "-". We didn’t use the separator "_" because it is already used in the column names of the original tibble.

To make the tibble tidy, you can use the pivot_longer() function.

gm_wide %>%
    pivot_longer(cols = everything(), names_to = c(".value", "country", "year"),
        names_sep = "-")
#> # A tibble: 9 × 4
#>   country       year  life_expectancy GDP_per_capita
#>   <chr>         <chr>           <dbl>          <dbl>
#> 1 China         2004             73             3.39
#> 2 Russia        2004             65             8.93
#> 3 United States 2004             77.7          52.8 
#> 4 China         2005             73.7           3.8 
#> 5 China         2006             74.2           4.32
#> 6 Russia        2005             66.5           9.69
#> 7 Russia        2006             67.5          10.5 
#> 8 United States 2005             78            53.8 
#> 9 United States 2006             78.2          54.3
  • Here, we use the everything() function to select all columns.
  • The names_to argument specifies the variable names for the column names. The ".value" is a special value that tells the function to keep the values in the selected columns. The country and year variables are used to store the country and year values in the column names.
  • The names_sep argument specifies the separator used in the column names.

Now, let’s see another example.

gm_gender <- gm %>%
    filter(country %in% c("United States", "China", "Russia")) %>%
    filter(year == 2004) %>%
    select(country, ends_with("male"))
gm_gender
#> # A tibble: 3 × 15
#>   country       smokin…¹ smoki…² lungc…³ lungc…⁴ alcoh…⁵ alcoh…⁶ liver…⁷ liver…⁸
#>   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 China              3.7    59.5   23       57.5    6.32    1.63   48.3    17.2 
#> 2 Russia            26.5    70.1    7.69    69.1   28.5     7.97    7.34    3.45
#> 3 United States     21.5    26.3   44.1     67.5   14.7     4.46    8.59    3.07
#> # … with 6 more variables: mortality_male <dbl>, mortality_female <dbl>,
#> #   cholesterol_fat_in_blood_male <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> #   BMI_female <dbl>, BMI_male <dbl>, and abbreviated variable names
#> #   ¹​smoking_female, ²​smoking_male, ³​lungcancer_newcases_female,
#> #   ⁴​lungcancer_newcases_male, ⁵​alcohol_male, ⁶​alcohol_female,
#> #   ⁷​livercancer_newcases_male, ⁸​livercancer_newcases_female

Here, we have selected the columns that end with male. One tricky part is that some columns contain more than one underscores, making the pattern a bit more complicated.

We want to convert the column names into values.

gm_gender %>%
    pivot_longer(cols = -1, names_to = c(".value", "gender"), names_pattern = "(.+)_(?!.*_)(.+)")
#> # A tibble: 6 × 9
#>   country       gender smoking lungcance…¹ alcohol liver…² morta…³ chole…⁴   BMI
#>   <chr>         <chr>    <dbl>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
#> 1 China         female     3.7       23       1.63   17.2     16.9    4.53  22.8
#> 2 China         male      59.5       57.5     6.32   48.3     26.1    4.44  22.8
#> 3 Russia        female    26.5        7.69    7.97    3.45    24.2    5.11  26.9
#> 4 Russia        male      70.1       69.1    28.5     7.34    51.9    4.93  25.6
#> 5 United States female    21.5       44.1     4.46    3.07    13.1    5.22  28.1
#> 6 United States male      26.3       67.5    14.7     8.59    19.3    5.16  28.2
#> # … with abbreviated variable names ¹​lungcancer_newcases,
#> #   ²​livercancer_newcases, ³​mortality, ⁴​cholesterol_fat_in_blood

In the pivot_longer() function, the names_to argument specifies the variable names for the column names. Here, the ".value" is a special value that tells the function to keep the values in the selected columns. The gender variable is used to store the values of the column names. The names_pattern argument specifies the pattern of the column names. The (.+)_(?!.*_)(.+) regular expression uses a negative lookahead assertion to identify the last underscore. In particular,

  • (.+): Captures everything before the last underscore.
  • _(?!.*_): Matches an underscore only if it’s not followed by another underscore.
  • (.+): Captures everything after the last underscore.

8.1.4 Exercises

  1. You are given the following dataset.
df <- tibble(country = c("A", "A", "B", "B"), year = c(2000, 2001, 2000, 2001), life_expectancy = c(70,
    71, 75, 76), infant_mortality = c(5, 4.5, 3, 2.8))

Create a wide-format dataset where:

  • Each row represents a country.
  • Columns represent combinations of year and a demographic variable (e.g., ‘2000_life_expectancy’, ‘2000_infant_mortality’, ‘2001_life_expectancy’, etc.).
  1. Use the wide-format dataset created in Exercise 1, transform the data back into a tidy format.

  2. You are given the following dataset.

df <- tibble(country = c("A", "B", "C"), life_expectancy_male = c(72, 78, 75), life_expectancy_female = c(78,
    83, 80))

Tidy the dataset so that it contains the following columns: country, gender, and life_expectancy.

  1. You are given the following dataset.
df <- tibble(group = c("A", "B"), q1_pre = c(10, 12), q1_post = c(15, 18), q2_pre = c(7,
    9), q2_post = c(11, 13))

Tidy the dataset so that it contains the following columns: group, question, time, and score, where question is either q1 or q2, time is either pre or post, and score is the corresponding value.