6.5 Create New Variables via mutate() and transmute()
You are now an expert in filtering observations (Section 6.1), reordering observations (Section 6.2), and selecting, renaming & reordering variables (Section 6.4). In many applications, you may want to create new variables as functions of the existing ones. In this section, we will learn how to do this using the dplyr package.
Let’s say you want to compute the average sale price per square foot for each house in the sahp
data set. To highlight the useful columns, we first use select()
to select the oa_qual
, liv_area
, and sale_price
. Then, use the mutate()
function to add a new variable named ave_price
with the value sale_price/liv_area
to the end.
library(r02pro)
library(tidyverse)
%>%
sahp select(oa_qual, liv_area, sale_price) %>%
mutate(ave_price = sale_price/liv_area)
#> # A tibble: 165 × 4
#> oa_qual liv_area sale_price ave_price
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 1479 130. 0.0882
#> 2 7 2122 NA NA
#> 3 5 1057 109 0.103
#> 4 5 1444 174 0.120
#> 5 6 1445 138. 0.0958
#> 6 6 1888 190 0.101
#> 7 6 1072 140 0.131
#> 8 5 1188 142 0.120
#> 9 5 924 112. 0.122
#> 10 5 2080 135 0.0649
#> # … with 155 more rows
From the result, you can check that the resulting tibble has 4 columns, with the last column being the newly created variable ave_price
. You can use mutate()
to create multiple variables at the same time following the same format.
%>%
sahp select(oa_qual, liv_area, sale_price, lot_area) %>%
mutate(ave_price = sale_price/liv_area, area = lot_area + liv_area)
#> # A tibble: 165 × 6
#> oa_qual liv_area sale_price lot_area ave_price area
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 1479 130. 13517 0.0882 14996
#> 2 7 2122 NA 11492 NA 13614
#> 3 5 1057 109 7922 0.103 8979
#> 4 5 1444 174 9802 0.120 11246
#> 5 6 1445 138. 14235 0.0958 15680
#> 6 6 1888 190 16492 0.101 18380
#> 7 6 1072 140 3675 0.131 4747
#> 8 5 1188 142 12160 0.120 13348
#> 9 5 924 112. 15783 0.122 16707
#> 10 5 2080 135 11606 0.0649 13686
#> # … with 155 more rows
This operation adds two new columns ave_price
and area
to the existing tibble. Note that the mutate()
function can only use the variables inside the select()
function. The following code will show an error since lot_area
is not included in the select()
function.
%>%
sahp select(oa_qual, liv_area, sale_price) %>%
mutate(ave_price = sale_price/liv_area, area = lot_area + liv_area)
#> Error in `mutate()`:
#> ! Problem while computing `area = lot_area + liv_area`.
#> Caused by error:
#> ! object 'lot_area' not found
Note that you are free to use any functions on a vector, including all the arithmetic operations and various functions. For example, to include the order of each house in terms of sale price, you can use mutate(sale_price_order = order(sale_price))
. To create a variable as the mean of the sale prices of all houses, you can add sale_price_ave = mean(sale_price)
as an argument in the mutate()
function.
%>%
sahp filter(sale_price > 400) %>%
select(sale_price) %>%
mutate(sale_price_order = order(sale_price),
sale_price_ave = mean(sale_price))
#> # A tibble: 5 × 3
#> sale_price sale_price_order sale_price_ave
#> <dbl> <int> <dbl>
#> 1 465 5 466.
#> 2 545. 4 466.
#> 3 457. 3 466.
#> 4 442. 1 466.
#> 5 421. 2 466.
From the result, you can see that sale_price_order
represent the order for each house in terms of sale price while the sale_price_ave
has a constant value for all houses. The different behavior lies in the fact that the function order()
returns another vector of the same length as the input, while mean()
always returns a vector of length 1, which is recycled to fill in the new variable sale_price_ave
.
Once creating the new variables, you can immediately use them to do visualization. Let’s say we want to create a scatterplot between area
and ave_price
, using different colors to represent whether oa_qual > 5
.
%>%
sahp mutate(ave_price = sale_price/liv_area, area = lot_area + liv_area) %>%
ggplot() +
geom_point(mapping = aes(x = area, y = ave_price, color = oa_qual > 5))
This plots shows that the average sale price does not have a trend when the total area changes. In addition, the “good” quality houses (oa_qual > 5
) has a higher ave_price
in general when the total area is the same.
Using the mutate()
function, we add newly created variable while keeping all the existing ones. If you want to only keep the newly created variables, you can use the transmute()
function.
%>%
sahp transmute(ave_price = sale_price/liv_area, area = lot_area + liv_area)
#> # A tibble: 165 × 2
#> ave_price area
#> <dbl> <dbl>
#> 1 0.0882 14996
#> 2 NA 13614
#> 3 0.103 8979
#> 4 0.120 11246
#> 5 0.0958 15680
#> 6 0.101 18380
#> 7 0.131 4747
#> 8 0.120 13348
#> 9 0.122 16707
#> 10 0.0649 13686
#> # … with 155 more rows
6.5.1 Exercises
Using the ahp
dataset and the pipe operator for the following exercises.
- Create a new variables named
age
being the age of the house when it was sold in terms of years (the number of years from when the house was built to when the house was sold). Then, select the variablesage
,sale_price
, andkit_qual
. Finally, generate a scatterplot betweenage
(x-axis) andsale_price
(y-axis), with different colors representing differentkit_qual
. Explain the findings from the figure.