6.7 Grouped Mutate and Filter via group_by()

In Section 6.6, you have seen the power of group_by() and summarize() which can help to create grouped summaries. In turns out the group_by() can be combined with filter() and mutate() to filter observations by group and create new variables by group.

Let’s say we have to focus on the overall conditions where the number of corresponding houses are at least 30. Then, summarize the number of observations and the average sale price for each overall condition.

library(r02pro)
library(dplyr)
ahp %>% 
  group_by(oa_cond) %>%
  filter(n() >= 30) %>%
  summarize(count = n(), ave_price = mean(sale_price, na.rm = TRUE))
#> # A tibble: 6 × 3
#>   oa_cond count ave_price
#>     <dbl> <int>     <dbl>
#> 1       3    35      99.8
#> 2       4    70     114. 
#> 3       5  1165     207. 
#> 4       6   367     149. 
#> 5       7   270     155. 
#> 6       8   101     156.

The code works in the following steps.

  1. Group the observations according to oa_cond.
  2. Filter the observations that has a oa_cond which the number of houses having such value is at least 30.
  3. For the filtered groups, create two new variables represents the number of houses, and the average sale price, respectively.

Next, let’s try to find the top two expensive houses that was remodel each year, show only the remodel year and sale price, and arrange the result in the ascending order of remodel year. To achieve this task, we need to do the following.

  1. Group the observations by the remodel year.
  2. Compute the rank of each observation in terms of decreasing sale price within each remodel year.
  3. Filter the observations the has a rank less than or equal to 2.
  4. Select the remodel year and sale price.
  5. Arrange the observations according to the ascending order of remodel year.
ahp %>% 
  group_by(yr_remodel) %>%
  mutate(r = rank(desc(sale_price), ties.method = "first")) %>%
  filter(r <= 2 ) %>%
  select(yr_remodel, sale_price) %>%
  arrange(yr_remodel)
#> # A tibble: 122 × 2
#> # Groups:   yr_remodel [61]
#>    yr_remodel sale_price
#>         <dbl>      <dbl>
#>  1       1950       257.
#>  2       1950       256 
#>  3       1951       155 
#>  4       1951       141 
#>  5       1952       146.
#>  6       1952       166 
#>  7       1953       217 
#>  8       1953       225 
#>  9       1954       156.
#> 10       1954       150.
#> # … with 112 more rows

The final example is to group the houses by the combination of kitchen quality (kit_qual) and central AC (central_air), find the two houses with the minimum and maximum sale prices for each group, and compute the IQR of the sale prices for each group. Finally, show the kit_qual, central_air and the three values and arrange the observations in the descending order of maximum sale price for each group. The following is the sequence of operations necessary.

  1. Group the houses by kit_qual and central_air.
  2. Create three new variables representing the minimum, maximum, and IQR of the sale prices of each group.
  3. Select variables kit_qual and central_air and the three newly created ones.
  4. Arrange the observations in the descending order of maximum sale price for each group.
ahp %>% 
  group_by(kit_qual, central_air) %>%
  summarize(max_price = max(sale_price, na.rm = TRUE), min_price = min(sale_price, na.rm = TRUE), IQR_price = IQR(sale_price, na.rm = TRUE)) %>%
  select(kit_qual, central_air, max_price, min_price, IQR_price) %>%
  arrange(desc(max_price))
#> `summarise()` has grouped output by 'kit_qual'. You can override using the
#> `.groups` argument.