6.5 Create New Variables via mutate() and transmute()

You are now an expert in filtering observations (Section 6.1), reordering observations (Section 6.2), and selecting, renaming & reordering variables (Section 6.4). In many applications, you may want to create new variables as functions of the existing ones. In this section, we will learn how to do this using the dplyr package.

Let’s say you want to compute the average sale price per square foot for each house in the sahp data set. To highlight the useful columns, we first use select() to select the oa_qual, liv_area, and sale_price. Then, use the mutate() function to add a new variable named ave_price with the value sale_price/liv_area to the end.

library(r02pro)
library(tidyverse)
sahp %>% 
  select(oa_qual, liv_area, sale_price) %>%
  mutate(ave_price = sale_price/liv_area) 
#> # A tibble: 165 × 4
#>    oa_qual liv_area sale_price ave_price
#>      <dbl>    <dbl>      <dbl>     <dbl>
#>  1       6     1479       130.    0.0882
#>  2       7     2122        NA    NA     
#>  3       5     1057       109     0.103 
#>  4       5     1444       174     0.120 
#>  5       6     1445       138.    0.0958
#>  6       6     1888       190     0.101 
#>  7       6     1072       140     0.131 
#>  8       5     1188       142     0.120 
#>  9       5      924       112.    0.122 
#> 10       5     2080       135     0.0649
#> # … with 155 more rows

From the result, you can check that the resulting tibble has 4 columns, with the last column being the newly created variable ave_price. You can use mutate() to create multiple variables at the same time following the same format.

sahp %>% 
  select(oa_qual, liv_area, sale_price, lot_area) %>%
  mutate(ave_price = sale_price/liv_area, area = lot_area + liv_area)
#> # A tibble: 165 × 6
#>    oa_qual liv_area sale_price lot_area ave_price  area
#>      <dbl>    <dbl>      <dbl>    <dbl>     <dbl> <dbl>
#>  1       6     1479       130.    13517    0.0882 14996
#>  2       7     2122        NA     11492   NA      13614
#>  3       5     1057       109      7922    0.103   8979
#>  4       5     1444       174      9802    0.120  11246
#>  5       6     1445       138.    14235    0.0958 15680
#>  6       6     1888       190     16492    0.101  18380
#>  7       6     1072       140      3675    0.131   4747
#>  8       5     1188       142     12160    0.120  13348
#>  9       5      924       112.    15783    0.122  16707
#> 10       5     2080       135     11606    0.0649 13686
#> # … with 155 more rows

This operation adds two new columns ave_price and area to the existing tibble. Note that the mutate() function can only use the variables inside the select() function. The following code will show an error since lot_area is not included in the select() function.

sahp %>% 
  select(oa_qual, liv_area, sale_price) %>%
  mutate(ave_price = sale_price/liv_area, area = lot_area + liv_area)
#> Error in `mutate()`:
#> ! Problem while computing `area = lot_area + liv_area`.
#> Caused by error:
#> ! object 'lot_area' not found

Note that you are free to use any functions on a vector, including all the arithmetic operations and various functions. For example, to include the order of each house in terms of sale price, you can use mutate(sale_price_order = order(sale_price)). To create a variable as the mean of the sale prices of all houses, you can add sale_price_ave = mean(sale_price) as an argument in the mutate() function.

sahp %>%
  filter(sale_price > 400) %>%
  select(sale_price) %>% 
  mutate(sale_price_order = order(sale_price),
         sale_price_ave = mean(sale_price))
#> # A tibble: 5 × 3
#>   sale_price sale_price_order sale_price_ave
#>        <dbl>            <int>          <dbl>
#> 1       465                 5           466.
#> 2       545.                4           466.
#> 3       457.                3           466.
#> 4       442.                1           466.
#> 5       421.                2           466.

From the result, you can see that sale_price_order represent the order for each house in terms of sale price while the sale_price_ave has a constant value for all houses. The different behavior lies in the fact that the function order() returns another vector of the same length as the input, while mean() always returns a vector of length 1, which is recycled to fill in the new variable sale_price_ave.

Once creating the new variables, you can immediately use them to do visualization. Let’s say we want to create a scatterplot between area and ave_price, using different colors to represent whether oa_qual > 5.

sahp %>% 
  mutate(ave_price = sale_price/liv_area, area = lot_area + liv_area) %>%
  ggplot() +
  geom_point(mapping = aes(x = area, y = ave_price, color = oa_qual > 5))

This plots shows that the average sale price does not have a trend when the total area changes. In addition, the “good” quality houses (oa_qual > 5) has a higher ave_price in general when the total area is the same.

Using the mutate() function, we add newly created variable while keeping all the existing ones. If you want to only keep the newly created variables, you can use the transmute() function.

sahp %>% 
  transmute(ave_price = sale_price/liv_area, area = lot_area + liv_area)
#> # A tibble: 165 × 2
#>    ave_price  area
#>        <dbl> <dbl>
#>  1    0.0882 14996
#>  2   NA      13614
#>  3    0.103   8979
#>  4    0.120  11246
#>  5    0.0958 15680
#>  6    0.101  18380
#>  7    0.131   4747
#>  8    0.120  13348
#>  9    0.122  16707
#> 10    0.0649 13686
#> # … with 155 more rows

6.5.1 Exercises

Using the ahp dataset and the pipe operator for the following exercises.

  1. Create a new variables named age being the age of the house when it was sold in terms of years (the number of years from when the house was built to when the house was sold). Then, select the variables age, sale_price, and kit_qual. Finally, generate a scatterplot between age (x-axis) and sale_price (y-axis), with different colors representing different kit_qual. Explain the findings from the figure.