7.4 Create New Variables via mutate() and transmute()
You are now an expert in filtering observations (Section 7.1), selecting, renaming & reordering variables (Section 7.2), and reordering observations (Section 7.3). In many applications, you may want to create new variables as functions of the existing ones. In this section, we will learn how to do this using the dplyr package.
Let’s say you want to compute the total GDP for each country in the gm
data set. To highlight the useful columns, we first use select()
to select the country
, year
, population
, and GDP_per_capita
. Then, use the mutate()
function to add a new variable named total_GDP
with the value GDP_per_capita * population
to the end.
library(r02pro)
library(tidyverse)
gm %>%
select(country, year, population, GDP_per_capita) %>%
mutate(total_GDP = GDP_per_capita * population)
#> # A tibble: 65,531 × 5
#> country year population GDP_per_capita total_GDP
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 1999 3130 1.96 6135.
#> 2 Albania 2000 3130 2.14 6698.
#> 3 Albania 2001 3130 2.25 7042.
#> 4 Albania 2002 3120 2.38 7426.
#> 5 Albania 2003 3100 2.52 7812
#> 6 Albania 2004 3090 2.68 8281.
#> 7 Andorra 1999 65.4 34.3 2243.
#> 8 Andorra 2000 67.3 36 2423.
#> 9 Andorra 2001 70 36.2 2534
#> 10 Andorra 2002 73.2 37.6 2752.
#> # … with 65,521 more rows
From the result, you can check that the resulting tibble has 5 columns, with the last column being the newly created variable total_GDP
. You can use mutate()
to create multiple variables at the same time following the same format.
gm %>%
select(country, year, population, GDP_per_capita, livercancer_newcases_male,
livercancer_newcases_female) %>%
mutate(total_GDP = GDP_per_capita * population, livercancer_newcases = livercancer_newcases_male +
livercancer_newcases_female)
#> # A tibble: 65,531 × 8
#> country year population GDP_per_capita livercancer…¹ liver…² total…³ liver…⁴
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 1999 3130 1.96 17.8 8.28 6135. 26.1
#> 2 Albania 2000 3130 2.14 17.1 7.66 6698. 24.8
#> 3 Albania 2001 3130 2.25 16.4 7.1 7042. 23.5
#> 4 Albania 2002 3120 2.38 15.7 6.58 7426. 22.3
#> 5 Albania 2003 3100 2.52 15.1 6.1 7812 21.2
#> 6 Albania 2004 3090 2.68 14.4 5.66 8281. 20.1
#> 7 Andorra 1999 65.4 34.3 5.99 2.35 2243. 8.34
#> 8 Andorra 2000 67.3 36 6.05 2.35 2423. 8.4
#> 9 Andorra 2001 70 36.2 6.12 2.36 2534 8.48
#> 10 Andorra 2002 73.2 37.6 6.19 2.37 2752. 8.56
#> # … with 65,521 more rows, and abbreviated variable names
#> # ¹livercancer_newcases_male, ²livercancer_newcases_female, ³total_GDP,
#> # ⁴livercancer_newcases
This operation adds two new columns total_GDP
and livercancer_newcases
to the existing tibble. Note that the mutate()
function can only use the variables inside the select()
function. The following code will show an error since livercancer_newcases_male
is not included in the select()
function.
gm %>%
select(country, year, population, GDP_per_capita) %>%
mutate(total_GDP = GDP_per_capita * population, livercancer_newcases = livercancer_newcases_male +
livercancer_newcases_female)
#> Error in `mutate()`:
#> ℹ In argument:
#> `livercancer_newcases =
#> livercancer_newcases_male +
#> livercancer_newcases_female`.
#> Caused by error:
#> ! object 'livercancer_newcases_male' not found
Note that you are free to use any functions on a vector, including all the arithmetic operations and various functions. For example, to include the order of each country in terms of GDP per capita in the year 2008, you can use mutate(GDP_order = order(GDP_per_capita))
. To create a variable as the mean of the GDP_per_capita
of all countries, you can add GDP_per_capita_ave = mean(GDP_per_capita, na.rm = TRUE)
as an argument in the mutate()
function.
gm %>%
filter(year == 2008) %>%
select(country, year, population, GDP_per_capita) %>%
mutate(GDP_order = order(GDP_per_capita), GDP_per_capita_ave = mean(GDP_per_capita,
na.rm = TRUE))
#> # A tibble: 236 × 6
#> country year population GDP_per_capita GDP_order GDP_per_capi…¹
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 Afghanistan 2008 28400 0.492 13 14.8
#> 2 Angola 2008 22500 3.96 129 14.8
#> 3 Albania 2008 2970 3.43 38 14.8
#> 4 Andorra 2008 84.5 35.4 59 14.8
#> 5 United Arab Emirates 2008 7920 34.6 132 14.8
#> 6 Argentina 2008 40500 12.4 126 14.8
#> 7 Armenia 2008 2890 2.88 180 14.8
#> 8 American Samoa 2008 NA 12.1 115 14.8
#> 9 Antigua and Barbuda 2008 86.7 15.4 1 14.8
#> 10 Australia 2008 21800 53.3 31 14.8
#> # … with 226 more rows, and abbreviated variable name ¹GDP_per_capita_ave
The mutate()
function is very powerful in creating new variables. However, if you only want to keep the newly created variables, you can use the transmute()
function.
gm %>%
filter(year == 2008) %>%
select(country, year, population, GDP_per_capita) %>%
transmute(GDP_order = order(GDP_per_capita), GDP_per_capita_ave = mean(GDP_per_capita,
na.rm = TRUE))
#> # A tibble: 236 × 2
#> GDP_order GDP_per_capita_ave
#> <int> <dbl>
#> 1 13 14.8
#> 2 129 14.8
#> 3 38 14.8
#> 4 59 14.8
#> 5 132 14.8
#> 6 126 14.8
#> 7 180 14.8
#> 8 115 14.8
#> 9 1 14.8
#> 10 31 14.8
#> # … with 226 more rows
7.4.1 Exercises
Using the ahp
dataset and the pipe operator for the following exercises.
- Create a new variables named
age
being the age of the house when it was sold in terms of years (the number of years from when the house was built to when the house was sold). Then, select the variablesage
,sale_price
, andkit_qual
. Finally, generate a scatterplot betweenage
(x-axis) andsale_price
(y-axis), with different colors representing differentkit_qual
. Explain the findings from the figure.