6.6 Create Grouped Summaries via group_by()
and summarize()
In Section 6.5, you learned how to create new variables as functions of the existing ones. For example, we created a variable representing the average sale price of all houses. Perhaps you want to know the average sale price for houses of a particular over all condition. The dplyr package provides two useful functions to achieve this: namely group_by()
which can group the observations according to the specified variables, and summarize()
which create summaries for each group.
6.6.1 Create Summaries
To create summaries for a variable, you can use the summarize()
function. Let’s compute the number of houses, the average living area, and the 1st and 3rd quartile sale price of all houses.
library(r02pro)
library(tidyverse)
%>%
ahp summarize(n_houses = n(), ave_liv_area = mean(liv_area), prob = c(0.25, 0.75), q_price = quantile(sale_price, c(0.25, 0.75), na.rm = TRUE))
#> # A tibble: 2 × 4
#> n_houses ave_liv_area prob q_price
#> <int> <dbl> <dbl> <dbl>
#> 1 2048 1500. 0.25 130.
#> 2 2048 1500. 0.75 214
Note that here we use the n()
function to count the number of houses. And the prob
argument is included to represent the corresponding quantiles levels.
6.6.2 Create Grouped Summaries
So far, we have learned to use summarize()
to create summaries for all observations. In practical applications, it is often more useful to compute the summaries when the observations are grouped according to certain criteria. Let’s say we want to create the same summaries for each value of overall quality (oa_qual
). Then, we need to add a group_by()
function before applying the summarize()
function.
%>%
ahp group_by(oa_qual) %>%
summarize(n_houses = n(), ave_liv_area = mean(liv_area), prob = c(0.25, 0.75), q_price = quantile(sale_price, c(0.25, 0.75), na.rm = TRUE))
#> `summarise()` has grouped output by 'oa_qual'. You can override using the
#> `.groups` argument.
From the output, you can see that for each value of oa_qual
, we have two rows representing different quantile values. It is interesting to visualize the relationship between the oa_qual
and the quantile price (q_price
), with different colors for the 1st and 3rd quantile.
%>%
ahp group_by(oa_qual) %>%
summarize(n_houses = n(), ave_liv_area = mean(liv_area), prob = c(0.25, 0.75), q_price = quantile(sale_price, c(0.25, 0.75), na.rm = TRUE)) %>%
ggplot(mapping = aes(x = oa_qual, y = q_price, color = factor(prob))) +
geom_point() +
geom_smooth()
#> `summarise()` has grouped output by 'oa_qual'. You can override using the
#> `.groups` argument.
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This figure is very informative, showing that both the 1st quartile and 3rd quantile of the sale price increases as the overall quality increases.
6.6.3 Exercises
Using the ahp
dataset and the pipe operator for the following exercises.
For each month when the house was sold, summarize the 1st and 3rd quartile of the sale price. Then, create a scatterplot between the month (x-axis) and the quartile of the sale price with different colors for 1st and 3rd quantile. Explain the findings from the figure.
Someone has a conjecture that when sold, the houses that are less than or equal to 30 years old have a higher sale price than the houses that are more than 30 years old. Show whether this is true in terms of maximum price, median price, and the minimum price for the houses in each group.