8.1 Convert Between Names and Values
First, let’s prepare a subset of the gm
data in the r02pro package.
library(r02pro)
library(tidyverse)
gm_tidy <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
select(country, year, life_expectancy) %>%
filter(year >= 2004 & year <= 2006)
gm_tidy
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <dbl> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 United States 2004 77.7
#> 4 China 2005 73.7
#> 5 China 2006 74.2
#> 6 Russia 2005 66.5
#> 7 Russia 2006 67.5
#> 8 United States 2005 78
#> 9 United States 2006 78.2
From the code, you can see that gm_tidy
is a tibble that contains the life expectancy of the United States, China, and Russia from 2004 to 2006. This is a tidy format.
8.1.1 Convert Values into Column Names
In gm_tidy
, each row contains the year when the life expectancy measurement was taken. Suppose we want to convert the year value into column names. You can use the pivot_wider()
function.
In pivot_wider()
, you need to specify two arguments:
names_from
denotes which column in the original tibble contains the values of the new column names,values_from
denotes which column in the original tibble contains the values for each cell in the new tibble.
gm_wide <- gm_tidy %>%
pivot_wider(names_from = year, values_from = life_expectancy)
gm_wide
#> # A tibble: 3 × 4
#> country `2004` `2005` `2006`
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 73 73.7 74.2
#> 2 Russia 65 66.5 67.5
#> 3 United States 77.7 78 78.2
From the resulting tibble gm_wide
, you can see that the year
variable is converted into column names. The life_expectancy
variable contains the corresponding values. The gm_wide
is now in a wide format. During the process, the year
information is lost, which is not desirable. Fortunately, you can add a prefix “year” to the column names via an argument names_prefix
in the pivot_wider()
function.
gm_wide_year <- gm_tidy %>%
pivot_wider(names_from = year, names_prefix = "year", values_from = life_expectancy)
gm_wide_year
#> # A tibble: 3 × 4
#> country year2004 year2005 year2006
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 73 73.7 74.2
#> 2 Russia 65 66.5 67.5
#> 3 United States 77.7 78 78.2
In the resulting tibble gm_wide_year
, you can see that the column names are prefixed with “year”. The year
information is now reflected in the names of the tibble.
The reason why we call the function pivot_wider()
is that it will create a wider dataset than the original one, containing more columns.
8.1.2 Convert Column Names into Values
Apparently, both gm_wide
and gm_wide_year
are not tidy. Now, let’s make them tidy using the pivot_longer()
function.
gm_wide %>%
pivot_longer(cols = -1, names_to = "year", values_to = "life_expectancy")
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2005 73.7
#> 3 China 2006 74.2
#> 4 Russia 2004 65
#> 5 Russia 2005 66.5
#> 6 Russia 2006 67.5
#> 7 United States 2004 77.7
#> 8 United States 2005 78
#> 9 United States 2006 78.2
In the pivot_longer()
function,
cols
specifies the column names that you want to convert from, which accept the same format as that indplyr::select()
introduced in Section 7.2.names_to
specifies the variable name you want to use for the column names.values_to
specifies the variable name for holding the values in the selected columns.
In the resulting tibble, col = -1
means that all columns except the first column are selected. The column names 2004
, 2005
, and 2006
are converted into values of the variable year
. The life_expectancy
variable contains the corresponding values. We have recovered the tidy gm_tidy
through the tidy process.
The reason why we call the function pivot_longer()
is that it will create a longer dataset than the original one, containing more rows.
Now, let’s tidy gm_wide_year
using the pivot_longer()
function.
gm_wide_year %>%
pivot_longer(cols = -1, names_to = "year", names_prefix = "year", values_to = "life_expectancy")
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 China 2005 73.7
#> 3 China 2006 74.2
#> 4 Russia 2004 65
#> 5 Russia 2005 66.5
#> 6 Russia 2006 67.5
#> 7 United States 2004 77.7
#> 8 United States 2005 78
#> 9 United States 2006 78.2
Note that we have an extra argument names_prefix
in the pivot_longer()
function. This argument is used to remove the prefix “year” from the column names.
Comparing the pivot_wider()
and pivot_longer()
functions, you can see that they are complementary. The pivot_wider()
function converts values into column names, while the pivot_longer()
function converts column names into values.
8.1.3 Multiple Variables
In the gm
dataset, there are multiple variables that you may want to convert into column names. You can specify multiple variables in the names_from
argument in the pivot_wider()
function.
gm_tidy <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
select(country, year, life_expectancy, GDP_per_capita) %>%
filter(year >= 2004 & year <= 2006)
gm_tidy
#> # A tibble: 9 × 4
#> country year life_expectancy GDP_per_capita
#> <chr> <dbl> <dbl> <dbl>
#> 1 China 2004 73 3.39
#> 2 Russia 2004 65 8.93
#> 3 United States 2004 77.7 52.8
#> 4 China 2005 73.7 3.8
#> 5 China 2006 74.2 4.32
#> 6 Russia 2005 66.5 9.69
#> 7 Russia 2006 67.5 10.5
#> 8 United States 2005 78 53.8
#> 9 United States 2006 78.2 54.3
gm_wide <- gm_tidy %>%
pivot_wider(names_from = c(country, year), values_from = c(life_expectancy, GDP_per_capita),
names_sep = "-")
gm_wide
#> # A tibble: 1 × 18
#> life_expecta…¹ life_…² life_…³ life_…⁴ life_…⁵ life_…⁶ life_…⁷ life_…⁸ life_…⁹
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 73 65 77.7 73.7 74.2 66.5 67.5 78 78.2
#> # … with 9 more variables: `GDP_per_capita-China-2004` <dbl>,
#> # `GDP_per_capita-Russia-2004` <dbl>,
#> # `GDP_per_capita-United States-2004` <dbl>,
#> # `GDP_per_capita-China-2005` <dbl>, `GDP_per_capita-China-2006` <dbl>,
#> # `GDP_per_capita-Russia-2005` <dbl>, `GDP_per_capita-Russia-2006` <dbl>,
#> # `GDP_per_capita-United States-2005` <dbl>,
#> # `GDP_per_capita-United States-2006` <dbl>, and abbreviated variable names …
In the pivot_wider()
function, you can specify multiple variables in the names_from
and values_from
arguments. The resulting tibble gm_wide
contains the life expectancy and GDP per capita of the United States, China, and Russia from 2004 to 2006. The column names are in the format of country_year_variable
. Note that we have an extra argument names_sep
in the pivot_wider()
function. This argument is used to separate the column names with a hyphen "-"
. We didn’t use the separator "_"
because it is already used in the column names of the original tibble.
To make the tibble tidy, you can use the pivot_longer()
function.
gm_wide %>%
pivot_longer(cols = everything(), names_to = c(".value", "country", "year"),
names_sep = "-")
#> # A tibble: 9 × 4
#> country year life_expectancy GDP_per_capita
#> <chr> <chr> <dbl> <dbl>
#> 1 China 2004 73 3.39
#> 2 Russia 2004 65 8.93
#> 3 United States 2004 77.7 52.8
#> 4 China 2005 73.7 3.8
#> 5 China 2006 74.2 4.32
#> 6 Russia 2005 66.5 9.69
#> 7 Russia 2006 67.5 10.5
#> 8 United States 2005 78 53.8
#> 9 United States 2006 78.2 54.3
- Here, we use the
everything()
function to select all columns. - The
names_to
argument specifies the variable names for the column names. The".value"
is a special value that tells the function to keep the values in the selected columns. Thecountry
andyear
variables are used to store the country and year values in the column names. - The
names_sep
argument specifies the separator used in the column names.
Now, let’s see another example.
gm_gender <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
filter(year == 2004) %>%
select(country, ends_with("male"))
gm_gender
#> # A tibble: 3 × 15
#> country smokin…¹ smoki…² lungc…³ lungc…⁴ alcoh…⁵ alcoh…⁶ liver…⁷ liver…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 China 3.7 59.5 23 57.5 6.32 1.63 48.3 17.2
#> 2 Russia 26.5 70.1 7.69 69.1 28.5 7.97 7.34 3.45
#> 3 United States 21.5 26.3 44.1 67.5 14.7 4.46 8.59 3.07
#> # … with 6 more variables: mortality_male <dbl>, mortality_female <dbl>,
#> # cholesterol_fat_in_blood_male <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> # BMI_female <dbl>, BMI_male <dbl>, and abbreviated variable names
#> # ¹smoking_female, ²smoking_male, ³lungcancer_newcases_female,
#> # ⁴lungcancer_newcases_male, ⁵alcohol_male, ⁶alcohol_female,
#> # ⁷livercancer_newcases_male, ⁸livercancer_newcases_female
Here, we have selected the columns that end with male
.
One tricky part is that some columns contain more than one underscores, making the pattern a bit more complicated.
We want to convert the column names into values.
gm_gender %>%
pivot_longer(cols = -1, names_to = c(".value", "gender"), names_pattern = "(.+)_(?!.*_)(.+)")
#> # A tibble: 6 × 9
#> country gender smoking lungcance…¹ alcohol liver…² morta…³ chole…⁴ BMI
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 China female 3.7 23 1.63 17.2 16.9 4.53 22.8
#> 2 China male 59.5 57.5 6.32 48.3 26.1 4.44 22.8
#> 3 Russia female 26.5 7.69 7.97 3.45 24.2 5.11 26.9
#> 4 Russia male 70.1 69.1 28.5 7.34 51.9 4.93 25.6
#> 5 United States female 21.5 44.1 4.46 3.07 13.1 5.22 28.1
#> 6 United States male 26.3 67.5 14.7 8.59 19.3 5.16 28.2
#> # … with abbreviated variable names ¹lungcancer_newcases,
#> # ²livercancer_newcases, ³mortality, ⁴cholesterol_fat_in_blood
In the pivot_longer()
function, the names_to
argument specifies the variable names for the column names. Here, the ".value"
is a special value that tells the function to keep the values in the selected columns. The gender
variable is used to store the values of the column names. The names_pattern
argument specifies the pattern of the column names. The (.+)_(?!.*_)(.+)
regular expression uses a negative lookahead assertion to identify the last underscore. In particular,
(.+)
: Captures everything before the last underscore._(?!.*_)
: Matches an underscore only if it’s not followed by another underscore.(.+)
: Captures everything after the last underscore.
8.1.4 Exercises
- You are given the following dataset.
df <- tibble(country = c("A", "A", "B", "B"), year = c(2000, 2001, 2000, 2001), life_expectancy = c(70,
71, 75, 76), infant_mortality = c(5, 4.5, 3, 2.8))
Create a wide-format dataset where:
- Each row represents a country.
- Columns represent combinations of year and a demographic variable (e.g., ‘2000_life_expectancy’, ‘2000_infant_mortality’, ‘2001_life_expectancy’, etc.).
Use the wide-format dataset created in Exercise 1, transform the data back into a tidy format.
You are given the following dataset.
df <- tibble(country = c("A", "B", "C"), life_expectancy_male = c(72, 78, 75), life_expectancy_female = c(78,
83, 80))
Tidy the dataset so that it contains the following columns: country
, gender
, and life_expectancy
.
- You are given the following dataset.
df <- tibble(group = c("A", "B"), q1_pre = c(10, 12), q1_post = c(15, 18), q2_pre = c(7,
9), q2_post = c(11, 13))
Tidy the dataset so that it contains the following columns: group
, question
, time
, and score
, where question
is either q1
or q2
, time
is either pre
or post
, and score
is the corresponding value.