7.4 Create New Variables via mutate() and transmute()

You are now an expert in filtering observations (Section 7.1), selecting, renaming & reordering variables (Section 7.2), and reordering observations (Section 7.3). In many applications, you may want to create new variables as functions of the existing ones. In this section, we will learn how to do this using the dplyr package.

Let’s say you want to compute the total GDP for each country in the gm data set. To highlight the useful columns, we first use select() to select the country, year, population, and GDP_per_capita. Then, use the mutate() function to add a new variable named total_GDP with the value GDP_per_capita * population to the end.

library(r02pro)
library(tidyverse)
gm %>%
    select(country, year, population, GDP_per_capita) %>%
    mutate(total_GDP = GDP_per_capita * population)
#> # A tibble: 65,531 × 5
#>    country  year population GDP_per_capita total_GDP
#>    <chr>   <dbl>      <dbl>          <dbl>     <dbl>
#>  1 Albania  1999     3130             1.96     6135.
#>  2 Albania  2000     3130             2.14     6698.
#>  3 Albania  2001     3130             2.25     7042.
#>  4 Albania  2002     3120             2.38     7426.
#>  5 Albania  2003     3100             2.52     7812 
#>  6 Albania  2004     3090             2.68     8281.
#>  7 Andorra  1999       65.4          34.3      2243.
#>  8 Andorra  2000       67.3          36        2423.
#>  9 Andorra  2001       70            36.2      2534 
#> 10 Andorra  2002       73.2          37.6      2752.
#> # … with 65,521 more rows

From the result, you can check that the resulting tibble has 5 columns, with the last column being the newly created variable total_GDP. You can use mutate() to create multiple variables at the same time following the same format.

gm %>%
    select(country, year, population, GDP_per_capita, livercancer_newcases_male,
        livercancer_newcases_female) %>%
    mutate(total_GDP = GDP_per_capita * population, livercancer_newcases = livercancer_newcases_male +
        livercancer_newcases_female)
#> # A tibble: 65,531 × 8
#>    country  year population GDP_per_capita livercancer…¹ liver…² total…³ liver…⁴
#>    <chr>   <dbl>      <dbl>          <dbl>         <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Albania  1999     3130             1.96         17.8     8.28   6135.   26.1 
#>  2 Albania  2000     3130             2.14         17.1     7.66   6698.   24.8 
#>  3 Albania  2001     3130             2.25         16.4     7.1    7042.   23.5 
#>  4 Albania  2002     3120             2.38         15.7     6.58   7426.   22.3 
#>  5 Albania  2003     3100             2.52         15.1     6.1    7812    21.2 
#>  6 Albania  2004     3090             2.68         14.4     5.66   8281.   20.1 
#>  7 Andorra  1999       65.4          34.3           5.99    2.35   2243.    8.34
#>  8 Andorra  2000       67.3          36             6.05    2.35   2423.    8.4 
#>  9 Andorra  2001       70            36.2           6.12    2.36   2534     8.48
#> 10 Andorra  2002       73.2          37.6           6.19    2.37   2752.    8.56
#> # … with 65,521 more rows, and abbreviated variable names
#> #   ¹​livercancer_newcases_male, ²​livercancer_newcases_female, ³​total_GDP,
#> #   ⁴​livercancer_newcases

This operation adds two new columns total_GDP and livercancer_newcases to the existing tibble. Note that the mutate() function can only use the variables inside the select() function. The following code will show an error since livercancer_newcases_male is not included in the select() function.

gm %>%
    select(country, year, population, GDP_per_capita) %>%
    mutate(total_GDP = GDP_per_capita * population, livercancer_newcases = livercancer_newcases_male +
        livercancer_newcases_female)
#> Error in `mutate()`:
#> ℹ In argument:
#>   `livercancer_newcases =
#>   livercancer_newcases_male +
#>   livercancer_newcases_female`.
#> Caused by error:
#> ! object 'livercancer_newcases_male' not found

Note that you are free to use any functions on a vector, including all the arithmetic operations and various functions. For example, to include the order of each country in terms of GDP per capita in the year 2008, you can use mutate(GDP_order = order(GDP_per_capita)). To create a variable as the mean of the GDP_per_capita of all countries, you can add GDP_per_capita_ave = mean(GDP_per_capita, na.rm = TRUE) as an argument in the mutate() function.

gm %>%
    filter(year == 2008) %>%
    select(country, year, population, GDP_per_capita) %>%
    mutate(GDP_order = order(GDP_per_capita), GDP_per_capita_ave = mean(GDP_per_capita,
        na.rm = TRUE))
#> # A tibble: 236 × 6
#>    country               year population GDP_per_capita GDP_order GDP_per_capi…¹
#>    <chr>                <dbl>      <dbl>          <dbl>     <int>          <dbl>
#>  1 Afghanistan           2008    28400            0.492        13           14.8
#>  2 Angola                2008    22500            3.96        129           14.8
#>  3 Albania               2008     2970            3.43         38           14.8
#>  4 Andorra               2008       84.5         35.4          59           14.8
#>  5 United Arab Emirates  2008     7920           34.6         132           14.8
#>  6 Argentina             2008    40500           12.4         126           14.8
#>  7 Armenia               2008     2890            2.88        180           14.8
#>  8 American Samoa        2008       NA           12.1         115           14.8
#>  9 Antigua and Barbuda   2008       86.7         15.4           1           14.8
#> 10 Australia             2008    21800           53.3          31           14.8
#> # … with 226 more rows, and abbreviated variable name ¹​GDP_per_capita_ave

The mutate() function is very powerful in creating new variables. However, if you only want to keep the newly created variables, you can use the transmute() function.

gm %>%
    filter(year == 2008) %>%
    select(country, year, population, GDP_per_capita) %>%
    transmute(GDP_order = order(GDP_per_capita), GDP_per_capita_ave = mean(GDP_per_capita,
        na.rm = TRUE))
#> # A tibble: 236 × 2
#>    GDP_order GDP_per_capita_ave
#>        <int>              <dbl>
#>  1        13               14.8
#>  2       129               14.8
#>  3        38               14.8
#>  4        59               14.8
#>  5       132               14.8
#>  6       126               14.8
#>  7       180               14.8
#>  8       115               14.8
#>  9         1               14.8
#> 10        31               14.8
#> # … with 226 more rows

7.4.1 Exercises

Using the ahp dataset and the pipe operator for the following exercises.

  1. Create a new variables named age being the age of the house when it was sold in terms of years (the number of years from when the house was built to when the house was sold). Then, select the variables age, sale_price, and kit_qual. Finally, generate a scatterplot between age (x-axis) and sale_price (y-axis), with different colors representing different kit_qual. Explain the findings from the figure.