7.5 Create Grouped Summaries via group_by() and summarize()

In Section 7.4, you learned how to create new variables as functions of the existing ones. For example, we created a variable representing the order of GDP per capita for each country in the year 2008. In this section, we will learn how to create summaries for each group of observations.

The dplyr package provides two useful functions to achieve this: namely group_by() which can group the observations according to the specified variables, and summarize() which create summaries for each group.

7.5.1 Create Summaries

To create summaries for a variable, you can use the summarize() function. Let’s compute the number of countries, the average population, and the median of the GDP per capita for all observations in the gm dataset for the year 2008.

library(r02pro)
library(tidyverse)
gm %>%
    filter(year == 2008) %>%
    summarize(n_countries = n(), ave_population = mean(population, na.rm = TRUE),
        q_GDP = median(GDP_per_capita, na.rm = TRUE))
#> # A tibble: 1 × 3
#>   n_countries ave_population q_GDP
#>         <int>          <dbl> <dbl>
#> 1         236         34848.  5.12

Note that here we use the n() function to count the number of countries. The mean() function computes the average of the population, and the median() function computes the median of the GDP per capita. The na.rm = TRUE argument is used to remove the missing values when computing the average and median.

7.5.2 Create Grouped Summaries

So far, we have learned to use summarize() to create summaries for all observations. In practical applications, it is often more useful to compute the summaries when the observations are grouped according to certain criteria. Let’s say we want to create the same summaries for each continent in the gm dataset.

gm %>%
    group_by(continent) %>%
    summarize(n_countries = n(), ave_population = mean(population, na.rm = TRUE),
        q_GDP = median(GDP_per_capita, na.rm = TRUE))
#> # A tibble: 6 × 4
#>   continent n_countries ave_population  q_GDP
#>   <chr>           <int>          <dbl>  <dbl>
#> 1 Africa          15051         19103.  0.984
#> 2 Americas        10578         16009.  5.26 
#> 3 Asia            13764         53912.  3.41 
#> 4 Europe          11439         13590. 19.1  
#> 5 Oceania          3446          2419.  2.87 
#> 6 <NA>            11253          5008. 11.8

From the output, you can see that the summaries are created for each continent. The group_by() function is used to group the observations according to the continent variable. Then, the summarize() function is used to create the summaries for each group.

Let’s now do another example. We want to find the top 5 countries with the highest life expectancy in the year 2006 for each continent.

gm %>%
    filter(year == 2006, !is.na(continent)) %>%
    group_by(continent) %>%
    top_n(5, life_expectancy) %>%
    select(country, life_expectancy) %>%
    arrange(continent, desc(life_expectancy))
#> Adding missing grouping variables:
#> `continent`
#> # A tibble: 25 × 3
#> # Groups:   continent [5]
#>    continent country    life_expectancy
#>    <chr>     <chr>                <dbl>
#>  1 Africa    Libya                 76  
#>  2 Africa    Tunisia               75.7
#>  3 Africa    Mauritius             73.8
#>  4 Africa    Algeria               73.6
#>  5 Africa    Cape Verde            73.2
#>  6 Americas  Canada                80.9
#>  7 Americas  Costa Rica            80.3
#>  8 Americas  Cuba                  78.5
#>  9 Americas  Panama                78.5
#> 10 Americas  Chile                 78.4
#> # … with 15 more rows

In this example, we first filter the observations for the year 2006 and remove the missing values in the continent variable. Then, we group the observations according to the continent variable. The top_n() function is used to select the top 5 countries with the highest life expectancy for each continent. Finally, we select the country and life_expectancy variables and arrange the observations according to the continent and the life expectancy in descending order.

Finally, let’s say we want to visualize the trend of the average GDP per capita for each continent from 1998 to 2018.

gm %>%
    filter(year >= 1998 & year <= 2018 & !is.na(continent)) %>%
    group_by(continent, year) %>%
    summarize(ave_GDP = median(GDP_per_capita, na.rm = TRUE)) %>%
    ggplot(aes(x = year, y = ave_GDP, color = continent)) + geom_smooth(se = FALSE) +
    labs(title = "Trend of Average GDP per Capita by Continent", x = "Year", y = "Average GDP per Capita")
#> `summarise()` has grouped output by
#> 'continent'. You can override using the
#> `.groups` argument.
#> `geom_smooth()` using method = 'loess'
#> and formula = 'y ~ x'

Here, we first filter the observations for the years from 1998 to 2018 and remove the missing values in the continent variable. Then, we group the observations according to the continent and year variables. The summarize() function is used to compute the median of the GDP per capita for each group. Finally, we create a smoothline fit to visualize the trend of the average GDP per capita for each continent from 1998 to 2018.

7.5.3 Exercises

Using the ahp dataset and the pipe operator for the following exercises.

  1. For each month when the house was sold, summarize the 1st and 3rd quartile of the sale price. Then, create a scatterplot between the month (x-axis) and the quartile of the sale price with different colors for 1st and 3rd quantile. Explain the findings from the figure.

  2. Someone has a conjecture that when sold, the houses that are less than or equal to 30 years old have a higher sale price than the houses that are more than 30 years old. Show whether this is true in terms of maximum price, median price, and the minimum price for the houses in each group.