8.2 Separate and Combine Columns via separate()
and unite()
Real-world data is often messy. Sometimes, valuable information is crammed into a single column, making it difficult to analyze. Other times, separate columns might need to be combined for specific tasks. The separate()
and unite()
functions in tidyr provide powerful tools to reshape your data and make it tidy for analysis. Imagine a dataset with a column like ‘City, State’. separate()
allows you to split this into separate 'City'
and 'State'
columns. Conversely, unite()
can combine 'First Name'
and 'Last Name'
columns into a single 'Full Name'
column.
To introduce the separate and and unite operation, let’s again consider the following tidy data set.
library(r02pro)
library(tidyverse)
gm_tidy <- gm %>%
filter(country %in% c("United States", "China", "Russia")) %>%
select(country, year, life_expectancy) %>%
filter(year >= 2004 & year <= 2006)
gm_tidy
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <dbl> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 United States 2004 77.7
#> 4 China 2005 73.7
#> 5 China 2006 74.2
#> 6 Russia 2005 66.5
#> 7 Russia 2006 67.5
#> 8 United States 2005 78
#> 9 United States 2006 78.2
8.2.1 Combine Columns via unite()
Sometimes, we may want to combine multiple columns into a single one. To do this, you can use the unite()
function in the tidyr package.
In the unite()
function, you need to specify the name of the to-be created column (col
), the columns to be combined, and the separator between the columns (sep
).
gm_year <- gm_tidy %>%
unite(col = "country_year", country, year, sep = "_")
gm_year
#> # A tibble: 9 × 2
#> country_year life_expectancy
#> <chr> <dbl>
#> 1 China_2004 73
#> 2 Russia_2004 65
#> 3 United States_2004 77.7
#> 4 China_2005 73.7
#> 5 China_2006 74.2
#> 6 Russia_2005 66.5
#> 7 Russia_2006 67.5
#> 8 United States_2005 78
#> 9 United States_2006 78.2
Now, the gm_year
tibble contains a new column country_year
, which is the combination of the country
and year
columns. The separator between the two columns is an underscore _
.
8.2.2 Separate One Column into Multiple Columns Using separate()
We know the unite()
function combine multiple columns into a single column. Sometimes, we need to do the opposite operation, namely separate one column into multiple ones.
For example, gm_year
is not a tidy dataset since the country_year
column contains two variables.
To make it tidy, you can use the separate()
function. In the separate()
function, you need to specify
col
: the name of the column to be separatedinto
: the desired names of the to-be created columns after separationsep
: the separator during the separation process
For the country_year
column in the gm_year
tibble, we can separate it into two columns: country
and year
.
gm_year %>%
separate(col = "country_year", into = c("country", "year"), sep = "_")
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 United States 2004 77.7
#> 4 China 2005 73.7
#> 5 China 2006 74.2
#> 6 Russia 2005 66.5
#> 7 Russia 2006 67.5
#> 8 United States 2005 78
#> 9 United States 2006 78.2
From the result, you can see that the original tidy tibble is recovered via the separate()
function. After a careful look, you may notice that the column types of country
and year
are both character
, which may not be what we want. To convert the types automatically, you can add the convert = TRUE
argument in the separate()
function.
gm_year %>%
separate(col = "country_year", into = c("country", "year"), sep = "_", convert = TRUE)
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <int> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 United States 2004 77.7
#> 4 China 2005 73.7
#> 5 China 2006 74.2
#> 6 Russia 2005 66.5
#> 7 Russia 2006 67.5
#> 8 United States 2005 78
#> 9 United States 2006 78.2
Now, the country
and year
columns are converted to the appropriate types.
In addition to separating via the separator, you can also separate by the position. For example, to separate the year into century (first two digits) and 2-digit year (last 2 digits), you can separate the year
after position 2.
gm_tidy %>%
separate(col = "year", into = c("y12", "y34"), sep = 2, convert = TRUE)
#> # A tibble: 9 × 4
#> country y12 y34 life_expectancy
#> <chr> <int> <int> <dbl>
#> 1 China 20 4 73
#> 2 Russia 20 4 65
#> 3 United States 20 4 77.7
#> 4 China 20 5 73.7
#> 5 China 20 6 74.2
#> 6 Russia 20 5 66.5
#> 7 Russia 20 6 67.5
#> 8 United States 20 5 78
#> 9 United States 20 6 78.2
From the result, you can see that the year
column is separated into two columns: y12
and y34
. The y12
column contains the first two digits of the year, while the y34
column contains the last two digits of the year.
You can also separate at multiple locations. For example, to separate the year into four columns, representing four digits, you can use multiple positions to separate.
gm_digits <- gm_tidy %>%
separate(col = "year", into = c("y1", "y2", "y3", "y4"), sep = c(1, 2, 3), convert = TRUE)
gm_digits
#> # A tibble: 9 × 6
#> country y1 y2 y3 y4 life_expectancy
#> <chr> <int> <int> <int> <int> <dbl>
#> 1 China 2 0 0 4 73
#> 2 Russia 2 0 0 4 65
#> 3 United States 2 0 0 4 77.7
#> 4 China 2 0 0 5 73.7
#> 5 China 2 0 0 6 74.2
#> 6 Russia 2 0 0 5 66.5
#> 7 Russia 2 0 0 6 67.5
#> 8 United States 2 0 0 5 78
#> 9 United States 2 0 0 6 78.2
Using the gm_digits
, you can also combine the four columns into a single column named year
.
gm_digits %>%
unite(col = "year", y1:y4, sep = "")
#> # A tibble: 9 × 3
#> country year life_expectancy
#> <chr> <chr> <dbl>
#> 1 China 2004 73
#> 2 Russia 2004 65
#> 3 United States 2004 77.7
#> 4 China 2005 73.7
#> 5 China 2006 74.2
#> 6 Russia 2005 66.5
#> 7 Russia 2006 67.5
#> 8 United States 2005 78
#> 9 United States 2006 78.2
Note that the sep
argument in the unite()
function is an empty string ""
. This means that there is no separator between the columns.
8.2.3 Exercises
- You have a tibble containing data on disease outbreaks, but the location column combines the country and city.
outbreaks <- tibble(outbreak_id = c(1, 2, 3), disease = c("Measles", "Influenza",
"Cholera"), location = c("USA_New York City", "Brazil_Sao Paulo", "India_Mumbai"))
outbreaks
#> # A tibble: 3 × 3
#> outbreak_id disease location
#> <dbl> <chr> <chr>
#> 1 1 Measles USA_New York City
#> 2 2 Influenza Brazil_Sao Paulo
#> 3 3 Cholera India_Mumbai
Separate the location
column into country
and city
columns.
- You have a tibble with separate columns for patient medical record numbers and visit dates.
patients <- tibble(patient_id = c(1, 2, 3), record_number = c("A123", "B456", "C789"),
visit_date = c("2024-10-15", "2024-09-22", "2024-10-16"))
patients
#> # A tibble: 3 × 3
#> patient_id record_number visit_date
#> <dbl> <chr> <chr>
#> 1 1 A123 2024-10-15
#> 2 2 B456 2024-09-22
#> 3 3 C789 2024-10-16
Combine the record_number
and visit_date
columns into a single column named record_date
with the format “record_number - visit_date”.
- You have a tibble with a vaccine_date column in the format “YYYY-MM-DD”.
vaccinations <- tibble(patient_id = c(1, 2, 3), vaccine_name = c("MMR", "Flu", "Tdap"),
vaccine_date = c("2023-08-10", "2024-10-05", "2022-06-15"))
vaccinations
#> # A tibble: 3 × 3
#> patient_id vaccine_name vaccine_date
#> <dbl> <chr> <chr>
#> 1 1 MMR 2023-08-10
#> 2 2 Flu 2024-10-05
#> 3 3 Tdap 2022-06-15
- Split the
vaccine_date
column into three columns: year, month, and day. - Create a new column called
month_year
in the format “MM/YYYY”.
- You have a tibble with responses to a health survey, but some responses in the symptoms column are separated by commas, while others are separated by semicolons.
survey_data <- tibble(respondent_id = c(1, 2, 3), symptoms = c("Headache, Fatigue",
"Nausea; Vomiting", "Cough"))
survey_data
#> # A tibble: 3 × 2
#> respondent_id symptoms
#> <dbl> <chr>
#> 1 1 Headache, Fatigue
#> 2 2 Nausea; Vomiting
#> 3 3 Cough
Split the symptoms column into symptom1
and symptom2
columns, handling the different separators correctly.