7.4 Create New Variables via mutate() and transmute()

You are now an expert in filtering observations (Section 7.1), selecting, renaming & reordering variables (Section 7.2), and reordering observations (Section 7.3). In many applications, you may want to create new variables as functions of the existing ones. In this section, we will learn how to do this using the dplyr package.

Let’s say you want to compute the total GDP for each country in the gm data set. To highlight the useful columns, we first use select() to select the country, year, population, and GDP_per_capita. Then, use the mutate() function to add a new variable named total_GDP with the value GDP_per_capita * population to the end.

library(r02pro)
library(tidyverse)
gm %>%
    select(country, year, population, GDP_per_capita) %>%
    mutate(total_GDP = GDP_per_capita * population)
#> # A tibble: 65,531 × 5
#>    country  year population GDP_per_capita total_GDP
#>    <chr>   <dbl>      <dbl>          <dbl>     <dbl>
#>  1 Albania  1999     3130             1.96     6135.
#>  2 Albania  2000     3130             2.14     6698.
#>  3 Albania  2001     3130             2.25     7042.
#>  4 Albania  2002     3120             2.38     7426.
#>  5 Albania  2003     3100             2.52     7812 
#>  6 Albania  2004     3090             2.68     8281.
#>  7 Andorra  1999       65.4          34.3      2243.
#>  8 Andorra  2000       67.3          36        2423.
#>  9 Andorra  2001       70            36.2      2534 
#> 10 Andorra  2002       73.2          37.6      2752.
#> # ℹ 65,521 more rows

From the result, you can check that the resulting tibble has 5 columns, with the last column being the newly created variable total_GDP. You can use mutate() to create multiple variables at the same time following the same format.

gm %>%
    select(country, year, population, GDP_per_capita, livercancer_newcases_male,
        livercancer_newcases_female) %>%
    mutate(total_GDP = GDP_per_capita * population, livercancer_newcases = livercancer_newcases_male +
        livercancer_newcases_female)
#> # A tibble: 65,531 × 8
#>    country  year population GDP_per_capita livercancer_newcases_male
#>    <chr>   <dbl>      <dbl>          <dbl>                     <dbl>
#>  1 Albania  1999     3130             1.96                     17.8 
#>  2 Albania  2000     3130             2.14                     17.1 
#>  3 Albania  2001     3130             2.25                     16.4 
#>  4 Albania  2002     3120             2.38                     15.7 
#>  5 Albania  2003     3100             2.52                     15.1 
#>  6 Albania  2004     3090             2.68                     14.4 
#>  7 Andorra  1999       65.4          34.3                       5.99
#>  8 Andorra  2000       67.3          36                         6.05
#>  9 Andorra  2001       70            36.2                       6.12
#> 10 Andorra  2002       73.2          37.6                       6.19
#> # ℹ 65,521 more rows
#> # ℹ 3 more variables: livercancer_newcases_female <dbl>, total_GDP <dbl>,
#> #   livercancer_newcases <dbl>

This operation adds two new columns total_GDP and livercancer_newcases to the existing tibble. Note that the mutate() function can only use the variables inside the select() function. The following code will show an error since livercancer_newcases_male is not included in the select() function.

gm %>%
    select(country, year, population, GDP_per_capita) %>%
    mutate(total_GDP = GDP_per_capita * population, livercancer_newcases = livercancer_newcases_male +
        livercancer_newcases_female)
#> Error in `mutate()`:
#> ℹ In argument: `livercancer_newcases =
#>   livercancer_newcases_male + livercancer_newcases_female`.
#> Caused by error:
#> ! object 'livercancer_newcases_male' not found

Note that you are free to use any functions on a vector, including all the arithmetic operations and various functions. For example, to include the order of each country in terms of GDP per capita in the year 2008, you can use mutate(GDP_order = order(GDP_per_capita)). To create a variable as the mean of the GDP_per_capita of all countries, you can add GDP_per_capita_ave = mean(GDP_per_capita, na.rm = TRUE) as an argument in the mutate() function.

gm %>%
    filter(year == 2008) %>%
    select(country, year, population, GDP_per_capita) %>%
    mutate(GDP_order = order(GDP_per_capita), GDP_per_capita_ave = mean(GDP_per_capita,
        na.rm = TRUE))
#> # A tibble: 236 × 6
#>    country           year population GDP_per_capita GDP_order GDP_per_capita_ave
#>    <chr>            <dbl>      <dbl>          <dbl>     <int>              <dbl>
#>  1 Afghanistan       2008    28400            0.492        13               14.8
#>  2 Angola            2008    22500            3.96        129               14.8
#>  3 Albania           2008     2970            3.43         38               14.8
#>  4 Andorra           2008       84.5         35.4          59               14.8
#>  5 United Arab Emi…  2008     7920           34.6         132               14.8
#>  6 Argentina         2008    40500           12.4         126               14.8
#>  7 Armenia           2008     2890            2.88        180               14.8
#>  8 American Samoa    2008       NA           12.1         115               14.8
#>  9 Antigua and Bar…  2008       86.7         15.4           1               14.8
#> 10 Australia         2008    21800           53.3          31               14.8
#> # ℹ 226 more rows

The mutate() function is very powerful in creating new variables. However, if you only want to keep the newly created variables, you can use the transmute() function.

gm %>%
    filter(year == 2008) %>%
    select(country, year, population, GDP_per_capita) %>%
    transmute(GDP_order = order(GDP_per_capita), GDP_per_capita_ave = mean(GDP_per_capita,
        na.rm = TRUE))
#> # A tibble: 236 × 2
#>    GDP_order GDP_per_capita_ave
#>        <int>              <dbl>
#>  1        13               14.8
#>  2       129               14.8
#>  3        38               14.8
#>  4        59               14.8
#>  5       132               14.8
#>  6       126               14.8
#>  7       180               14.8
#>  8       115               14.8
#>  9         1               14.8
#> 10        31               14.8
#> # ℹ 226 more rows

7.4.1 Exercises

Using the ahp dataset and the pipe operator for the following exercises.

Create a new variables named age being the age of the house when it was sold in terms of years (the number of years from when the house was built to when the house was sold). Then, select the variables age, sale_price, and kit_qual. Finally, generate a scatterplot between age (x-axis) and sale_price (y-axis), with different colors representing different kit_qual. Explain the findings from the figure.