## 6.6 Create Grouped Summaries via group_by() and summarize()

In Section 6.5, you learned how to create new variables as functions of the existing ones. For example, we created a variable representing the average sale price of all houses. Perhaps you want to know the average sale price for houses of a particular over all condition. The dplyr package provides two useful functions to achieve this: namely group_by() which can group the observations according to the specified variables, and summarize() which create summaries for each group.

### 6.6.1 Create Summaries

To create summaries for a variable, you can use the summarize() function. Let’s compute the number of houses, the average living area, and the 1st and 3rd quartile sale price of all houses.

library(r02pro)
library(tidyverse)
ahp %>%
summarize(n_houses = n(), ave_liv_area = mean(liv_area), prob = c(0.25, 0.75), q_price = quantile(sale_price, c(0.25, 0.75), na.rm = TRUE))
#> # A tibble: 2 × 4
#>   n_houses ave_liv_area  prob q_price
#>      <int>        <dbl> <dbl>   <dbl>
#> 1     2048        1500.  0.25    130.
#> 2     2048        1500.  0.75    214

Note that here we use the n() function to count the number of houses. And the prob argument is included to represent the corresponding quantiles levels.

### 6.6.2 Create Grouped Summaries

So far, we have learned to use summarize() to create summaries for all observations. In practical applications, it is often more useful to compute the summaries when the observations are grouped according to certain criteria. Let’s say we want to create the same summaries for each value of overall quality (oa_qual). Then, we need to add a group_by() function before applying the summarize() function.

ahp %>%
group_by(oa_qual) %>%
summarize(n_houses = n(), ave_liv_area = mean(liv_area), prob = c(0.25, 0.75), q_price = quantile(sale_price, c(0.25, 0.75), na.rm = TRUE))
#> summarise() has grouped output by 'oa_qual'. You can override using the
#> .groups argument.

From the output, you can see that for each value of oa_qual, we have two rows representing different quantile values. It is interesting to visualize the relationship between the oa_qual and the quantile price (q_price), with different colors for the 1st and 3rd quantile.

ahp %>%
group_by(oa_qual) %>%
summarize(n_houses = n(), ave_liv_area = mean(liv_area), prob = c(0.25, 0.75), q_price = quantile(sale_price, c(0.25, 0.75), na.rm = TRUE)) %>%
ggplot(mapping = aes(x = oa_qual, y = q_price, color = factor(prob))) +
geom_point() +
geom_smooth()
#> summarise() has grouped output by 'oa_qual'. You can override using the
#> .groups argument.
#> geom_smooth() using method = 'loess' and formula 'y ~ x'

This figure is very informative, showing that both the 1st quartile and 3rd quantile of the sale price increases as the overall quality increases.

### 6.6.3 Exercises

Using the ahp dataset and the pipe operator for the following exercises.

1. For each month when the house was sold, summarize the 1st and 3rd quartile of the sale price. Then, create a scatterplot between the month (x-axis) and the quartile of the sale price with different colors for 1st and 3rd quantile. Explain the findings from the figure.

2. Someone has a conjecture that when sold, the houses that are less than or equal to 30 years old have a higher sale price than the houses that are more than 30 years old. Show whether this is true in terms of maximum price, median price, and the minimum price for the houses in each group.