7.6 Grouped Mutate and Filter via group_by()

7.6 Grouped Mutate and Filter via `group_by()`

In Section 7.5, you have seen the power of group_by() and summarize() which can help to create grouped summaries. In turns out the group_by() can be combined with filter() and mutate() to filter observations by group and create new variables by group.

7.6.1 Grouped Filter

For the gm dataset, let’s say we want to find the countries that have a higher average GDP per capita than the average in the corresponding continent in the year 2008. We can use the following code.

library(r02pro)
library(dplyr)
gm %>%
    filter(year == 2008) %>%
    group_by(continent) %>%
    mutate(ave_GDP = mean(GDP_per_capita, na.rm = TRUE)) %>%
    filter(GDP_per_capita > ave_GDP) %>%
    select(country, continent, GDP_per_capita, ave_GDP)
#> # A tibble: 63 × 4
#> # Groups:   continent [6]
#>    country              continent GDP_per_capita ave_GDP
#>    <chr>                <chr>              <dbl>   <dbl>
#>  1 Angola               Africa              3.96    2.41
#>  2 Andorra              <NA>               35.4    27.6 
#>  3 United Arab Emirates Asia               34.6    13.0 
#>  4 Antigua and Barbuda  Americas           15.4    12.6 
#>  5 Australia            Oceania            53.3    11.2 
#>  6 Austria              Europe             42.6    26.7 
#>  7 Belgium              Europe             39      26.7 
#>  8 Bahrain              Asia               21.1    13.0 
#>  9 Bahamas              Americas           32.3    12.6 
#> 10 Bermuda              <NA>              118      27.6 
#> # … with 53 more rows

The code works in the following steps. 1. Filter the observations for the year 2008. 2. Group the observations by continent. 3. Compute the average GDP per capita for each continent. 4. Filter the observations that have a GDP per capita higher than the average GDP per capita of the corresponding continent. 5. Select the variables country, continent, GDP_per_capita, and ave_GDP.

7.6.2 Grouped Mutate

Now, let’s say we want to find the top two countries with the highest GDP per capita for each continent in the year 2008. We can use the following code.

gm %>%
    filter(year == 2008 & !is.na(continent)) %>%
    group_by(continent) %>%
    mutate(r = rank(desc(GDP_per_capita), ties.method = "first")) %>%
    filter(r <= 2) %>%
    select(country, continent, GDP_per_capita) %>%
    arrange(continent, desc(GDP_per_capita))
#> # A tibble: 10 × 3
#> # Groups:   continent [5]
#>    country           continent GDP_per_capita
#>    <chr>             <chr>              <dbl>
#>  1 Equatorial Guinea Africa              15.9
#>  2 Seychelles        Africa              11.7
#>  3 United States     Americas            51.9
#>  4 Greenland         Americas            42.9
#>  5 Macao, China      Asia                60.5
#>  6 Qatar             Asia                59.1
#>  7 Luxembourg        Europe              96  
#>  8 Switzerland       Europe              80.3
#>  9 Australia         Oceania             53.3
#> 10 New Zealand       Oceania             35.3

Here, we first filter the observations for the year 2008 and remove the missing values in the continent variable. Then, we group the observations by continent. We use the mutate() function to create a new variable r which represents the rank of the GDP per capita in descending order. The ties.method = "first" argument is used to break the ties by giving the same rank to the same values. We then filter the observations that have a rank less than or equal to 2. Finally, we select the variables country, continent, and GDP_per_capita, and arrange the observations according to the continent and the GDP per capita in descending order.

7.6.3 Exercises

Using the ahp dataset and the pipe operator for the following exercises.

Create a logical variable high_sale_price, which equals TRUE if the house’s sale price exceeds the average sale price for the corresponding month, and FALSE otherwise. Next, create a scatterplot to visualize the relationship between the month and the sale price, using different colors based on whether high_sale_price is TRUE or FALSE. Explain the insights from the plot.
Focus on the overall conditions where the number of corresponding houses are at least 30. Then, summarize the number of observations and the average sale price for each overall condition.
Find the top two expensive houses that was remodeled each year, show only the remodel year and sale price, and arrange the result in the ascending order of remodel year. Note: if there are ties in the sale price ranking, please include all the tied observations.
Group the houses by the combination of kitchen quality (kit_qual) and central AC (central_air), find the two houses with the minimum and maximum sale prices for each group. Then, arrange the groups in the descending order of maximum sale price within each group.