8.1 Convert Between Names and Values
First, let’s prepare a subset of the gm data in the r02pro package.
library(r02pro)
library(tidyverse)
gm_tidy <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
select(country, year, life_expectancy) %>%
filter(year >= 2004 & year <= 2006)
gm_tidy
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <dbl> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 United States 2004 77.7
#> 4 China 2005 73.7
#> 5 China 2006 74.2
#> 6 Russia 2005 66.5
#> 7 Russia 2006 67.5
#> 8 United States 2005 78
#> 9 United States 2006 78.2From the code, you can see that gm_tidy is a tibble that contains the life expectancy of the United States, China, and Russia from 2004 to 2006. This is a tidy format.
8.1.1 Convert Values into Column Names
In gm_tidy, each row contains the year when the life expectancy measurement was taken. Suppose we want to convert the year value into column names. You can use the pivot_wider() function.
In pivot_wider(), you need to specify two arguments:
names_fromdenotes which column in the original tibble contains the values of the new column names,values_fromdenotes which column in the original tibble contains the values for each cell in the new tibble.
gm_wide <- gm_tidy %>%
pivot_wider(names_from = year, values_from = life_expectancy)
gm_wide
#> # A tibble: 3 × 4
#> country `2004` `2005` `2006`
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 73 73.7 74.2
#> 2 Russia 65 66.5 67.5
#> 3 United States 77.7 78 78.2From the resulting tibble gm_wide, you can see that the year variable is converted into column names. The life_expectancy variable contains the corresponding values. The gm_wide is now in a wide format. During the process, the year information is lost, which is not desirable. Fortunately, you can add a prefix “year” to the column names via an argument names_prefix in the pivot_wider() function.
gm_wide_year <- gm_tidy %>%
pivot_wider(names_from = year, names_prefix = "year", values_from = life_expectancy)
gm_wide_year
#> # A tibble: 3 × 4
#> country year2004 year2005 year2006
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 73 73.7 74.2
#> 2 Russia 65 66.5 67.5
#> 3 United States 77.7 78 78.2In the resulting tibble gm_wide_year, you can see that the column names are prefixed with “year”. The year information is now reflected in the names of the tibble.
The reason why we call the function pivot_wider() is that it will create a wider dataset than the original one, containing more columns.
8.1.2 Convert Column Names into Values
Apparently, both gm_wide and gm_wide_year are not tidy. Now, let’s make them tidy using the pivot_longer() function.
gm_wide %>%
pivot_longer(cols = -1, names_to = "year", values_to = "life_expectancy")
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2005 73.7
#> 3 China 2006 74.2
#> 4 Russia 2004 65
#> 5 Russia 2005 66.5
#> 6 Russia 2006 67.5
#> 7 United States 2004 77.7
#> 8 United States 2005 78
#> 9 United States 2006 78.2In the pivot_longer() function,
colsspecifies the column names that you want to convert from, which accept the same format as that indplyr::select()introduced in Section 7.2.names_tospecifies the variable name you want to use for the column names.values_tospecifies the variable name for holding the values in the selected columns.
In the resulting tibble, col = -1 means that all columns except the first column are selected. The column names 2004, 2005, and 2006 are converted into values of the variable year. The life_expectancy variable contains the corresponding values. We have recovered the tidy gm_tidy through the tidy process.
The reason why we call the function pivot_longer() is that it will create a longer dataset than the original one, containing more rows.
Now, let’s tidy gm_wide_year using the pivot_longer() function.
gm_wide_year %>%
pivot_longer(cols = -1, names_to = "year", names_prefix = "year", values_to = "life_expectancy")
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2005 73.7
#> 3 China 2006 74.2
#> 4 Russia 2004 65
#> 5 Russia 2005 66.5
#> 6 Russia 2006 67.5
#> 7 United States 2004 77.7
#> 8 United States 2005 78
#> 9 United States 2006 78.2Note that we have an extra argument names_prefix in the pivot_longer() function. This argument is used to remove the prefix “year” from the column names.
Comparing the pivot_wider() and pivot_longer() functions, you can see that they are complementary. The pivot_wider() function converts values into column names, while the pivot_longer() function converts column names into values.
8.1.3 Multiple Variables
In the gm dataset, there are multiple variables that you may want to convert into column names. You can specify multiple variables in the names_from argument in the pivot_wider() function.
gm_tidy <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
select(country, year, life_expectancy, GDP_per_capita) %>%
filter(year >= 2004 & year <= 2006)
gm_tidy
#> # A tibble: 9 × 4
#> country year life_expectancy GDP_per_capita
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 2004 73 3.39
#> 2 Russia 2004 65 8.93
#> 3 United States 2004 77.7 52.8
#> 4 China 2005 73.7 3.8
#> 5 China 2006 74.2 4.32
#> 6 Russia 2005 66.5 9.69
#> 7 Russia 2006 67.5 10.5
#> 8 United States 2005 78 53.8
#> 9 United States 2006 78.2 54.3
gm_wide <- gm_tidy %>%
pivot_wider(names_from = c(country, year), values_from = c(life_expectancy, GDP_per_capita),
names_sep = "-")
gm_wide
#> # A tibble: 1 × 18
#> `life_expectancy-China-2004` life_expectancy-Russia-2…¹ life_expectancy-Unit…²
#> <dbl> <dbl> <dbl>
#> 1 73 65 77.7
#> # ℹ abbreviated names: ¹`life_expectancy-Russia-2004`,
#> # ²`life_expectancy-United States-2004`
#> # ℹ 15 more variables: `life_expectancy-China-2005` <dbl>,
#> # `life_expectancy-China-2006` <dbl>, `life_expectancy-Russia-2005` <dbl>,
#> # `life_expectancy-Russia-2006` <dbl>,
#> # `life_expectancy-United States-2005` <dbl>,
#> # `life_expectancy-United States-2006` <dbl>, …In the pivot_wider() function, you can specify multiple variables in the names_from and values_from arguments. The resulting tibble gm_wide contains the life expectancy and GDP per capita of the United States, China, and Russia from 2004 to 2006. The column names are in the format of country_year_variable. Note that we have an extra argument names_sep in the pivot_wider() function. This argument is used to separate the column names with a hyphen "-". We didn’t use the separator "_" because it is already used in the column names of the original tibble.
To make the tibble tidy, you can use the pivot_longer() function.
gm_wide %>%
pivot_longer(cols = everything(), names_to = c(".value", "country", "year"),
names_sep = "-")
#> # A tibble: 9 × 4
#> country year life_expectancy GDP_per_capita
#> <chr> <chr> <dbl> <dbl>
#> 1 China 2004 73 3.39
#> 2 Russia 2004 65 8.93
#> 3 United States 2004 77.7 52.8
#> 4 China 2005 73.7 3.8
#> 5 China 2006 74.2 4.32
#> 6 Russia 2005 66.5 9.69
#> 7 Russia 2006 67.5 10.5
#> 8 United States 2005 78 53.8
#> 9 United States 2006 78.2 54.3- Here, we use the
everything()function to select all columns. - The
names_toargument specifies the variable names for the column names. The".value"is a special value that tells the function to keep the values in the selected columns. Thecountryandyearvariables are used to store the country and year values in the column names. - The
names_separgument specifies the separator used in the column names.
Now, let’s see another example.
gm_gender <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
filter(year == 2004) %>%
select(country, ends_with("male"))
gm_gender
#> # A tibble: 3 × 15
#> country smoking_female smoking_male lungcancer_newcases_female
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 3.7 59.5 23
#> 2 Russia 26.5 70.1 7.69
#> 3 United States 21.5 26.3 44.1
#> # ℹ 11 more variables: lungcancer_newcases_male <dbl>, alcohol_male <dbl>,
#> # alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> # livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> # mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> # cholesterol_fat_in_blood_female <dbl>, BMI_female <dbl>, BMI_male <dbl>Here, we have selected the columns that end with male.
One tricky part is that some columns contain more than one underscores, making the pattern a bit more complicated.
We want to convert the column names into values.
gm_gender %>%
pivot_longer(cols = -1, names_to = c(".value", "gender"), names_pattern = "(.+)_(?!.*_)(.+)")
#> # A tibble: 6 × 9
#> country gender smoking lungcancer_newcases alcohol livercancer_newcases
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 China female 3.7 23 1.63 17.2
#> 2 China male 59.5 57.5 6.32 48.3
#> 3 Russia female 26.5 7.69 7.97 3.45
#> 4 Russia male 70.1 69.1 28.5 7.34
#> 5 United States female 21.5 44.1 4.46 3.07
#> 6 United States male 26.3 67.5 14.7 8.59
#> # ℹ 3 more variables: mortality <dbl>, cholesterol_fat_in_blood <dbl>,
#> # BMI <dbl>In the pivot_longer() function, the names_to argument specifies the variable names for the column names. Here, the ".value" is a special value that tells the function to keep the values in the selected columns. The gender variable is used to store the values of the column names. The names_pattern argument specifies the pattern of the column names. The (.+)_(?!.*_)(.+) regular expression uses a negative lookahead assertion to identify the last underscore. In particular,
(.+): Captures everything before the last underscore._(?!.*_): Matches an underscore only if it’s not followed by another underscore.(.+): Captures everything after the last underscore.
8.1.4 Exercises
- You are given the following dataset.
df <- tibble(country = c("A", "A", "B", "B"), year = c(2000, 2001, 2000, 2001), life_expectancy = c(70,
71, 75, 76), infant_mortality = c(5, 4.5, 3, 2.8))Create a wide-format dataset where:
- Each row represents a country.
- Columns represent combinations of year and a demographic variable (e.g., ‘2000_life_expectancy’, ‘2000_infant_mortality’, ‘2001_life_expectancy’, etc.).
Use the wide-format dataset created in Exercise 1, transform the data back into a tidy format.
You are given the following dataset.
df <- tibble(country = c("A", "B", "C"), life_expectancy_male = c(72, 78, 75), life_expectancy_female = c(78,
83, 80))Tidy the dataset so that it contains the following columns: country, gender, and life_expectancy.
- You are given the following dataset.
df <- tibble(group = c("A", "B"), q1_pre = c(10, 12), q1_post = c(15, 18), q2_pre = c(7,
9), q2_post = c(11, 13))Tidy the dataset so that it contains the following columns: group, question, time, and score, where question is either q1 or q2, time is either pre or post, and score is the corresponding value.