7.7 Joins

So far, we have been working on a single data set. In applications, however, we often have the need to combine multiple data sets into a single one before doing data analysis. Let’s prepare two data sets to show the various ways we can join two data sets.

library(tidyverse)
library(r02pro)
top_pop <- gm %>%
    filter(year == 2008 & !is.na(continent)) %>%
    mutate(r = rank(desc(population), ties.method = "first")) %>%
    filter(r <= 5) %>%
    select(country, population, HDI_category) %>%
    arrange(desc(population))
HDI_cate_dict <- tibble(HDI_category = c("low", "medium", "high"), HDI_description = c("below 0.549",
    "between 0.550–0.699", "between 0.700–0.799"))

Here, top_pop is a tibble of the top 3 countries with the highest population in 2008. HDI_cate_dict is a tibble that provides the description of the Human Development Index (HDI) categories.

It is clear that the two tibbles share a variable named HDI_category. It could be interesting to join the two tibbles into a single one by including all variables while matching the values of HDI_category. The bridging variable HDI_category is usually called the key to the joining operation.

Next, we will discuss several different join operations between top_pop and HDI_cate_dict, with the key being HDI_category.

7.7.1 Inner Join

The first type of join operation on two tables with a set of keys is called inner join. You can use the inner_join(tb1, tb2) function, which returns all rows in tb1 such that there are matching values of the corresponding keys in tb2.

inner <- inner_join(top_pop, HDI_cate_dict, by = "HDI_category")
inner
#> # A tibble: 4 × 4
#>   country   population HDI_category HDI_description    
#>   <chr>          <dbl> <chr>        <chr>              
#> 1 China        1360000 medium       between 0.550–0.699
#> 2 India        1220000 medium       between 0.550–0.699
#> 3 Indonesia     239000 medium       between 0.550–0.699
#> 4 Brazil        194000 high         between 0.700–0.799

From the result, you can see that there are 4 observations. Each observation corresponds to one row in top_pop that has a matching value in HDI_cate_dict for HDI_category. The observation corresponding to United States in top_pop is dropped in the result since its HDI_category value is "very high", which doesn’t have a matched row in HDI_cate_dict.

7.7.2 Left Join

The second type of join operation on two tables with a set of keys is called left join. You can use the left_join(tb1, tb2) function, which returns all rows in tb1, no matter whether there are matching values of the corresponding keys in tb2.

left <- left_join(top_pop, HDI_cate_dict, by = "HDI_category")
left
#> # A tibble: 5 × 4
#>   country       population HDI_category HDI_description    
#>   <chr>              <dbl> <chr>        <chr>              
#> 1 China            1360000 medium       between 0.550–0.699
#> 2 India            1220000 medium       between 0.550–0.699
#> 3 United States     306000 very high    <NA>               
#> 4 Indonesia         239000 medium       between 0.550–0.699
#> 5 Brazil            194000 high         between 0.700–0.799

As you can see from the result, all rows in top_pop is kept, even though United States doesn’t have a matching observation in HDI_cate_dict for the variable HDI_category. In this case, the corresponding value for HDI_description is NA.

7.7.3 Right Join

The third type of join operation on two tables with a set of keys is called right join, which works in a similar fashion as left join. You can use the right_join(tb1, tb2) function, which returns all rows in tb2, no matter whether there are matching values of the corresponding keys in tb1 or not.

right <- right_join(top_pop, HDI_cate_dict, by = "HDI_category")
right
#> # A tibble: 5 × 4
#>   country   population HDI_category HDI_description    
#>   <chr>          <dbl> <chr>        <chr>              
#> 1 China        1360000 medium       between 0.550–0.699
#> 2 India        1220000 medium       between 0.550–0.699
#> 3 Indonesia     239000 medium       between 0.550–0.699
#> 4 Brazil        194000 high         between 0.700–0.799
#> 5 <NA>              NA low          below 0.549

As you can see from the result, all rows in HDI_cate_dict is kept, even though there aren’t any matching observations in top_pop for HDI_category == "low". In this case, the corresponding values for country and population are NA.

7.7.4 Full Join

The fourth type of join operation on two tables with a set of keys is called full join. You can use the full_join(tb1, tb2) function, which returns all rows in both tb1 and tb2, no matter whether there are matching values of the keys in either tb1 or tb2.

full <- full_join(top_pop, HDI_cate_dict, by = "HDI_category")
full
#> # A tibble: 6 × 4
#>   country       population HDI_category HDI_description    
#>   <chr>              <dbl> <chr>        <chr>              
#> 1 China            1360000 medium       between 0.550–0.699
#> 2 India            1220000 medium       between 0.550–0.699
#> 3 United States     306000 very high    <NA>               
#> 4 Indonesia         239000 medium       between 0.550–0.699
#> 5 Brazil            194000 high         between 0.700–0.799
#> 6 <NA>                  NA low          below 0.549

As you can see from the result, all rows in both top_pop and HDI_cate_dict are kept. The rows in top_pop that don’t have matching values in HDI_cate_dict are filled with NA for the variables in HDI_cate_dict, and vice versa.

7.7.5 Filtering Joins

For the previous four types of joins, the resulting table contains columns from both input tables. Sometimes, you may want to only keep the columns from one of the input tables. This is called filtering joins, meaning that it is actually doing the filtering operations like the filter() function we learned in Section 7.1.

a. Semi Join

The first type of filtering joins is called semi join, in which semi_join(tb1, tb2) returns all rows from tb1 where there are matching values in tb2 for the keys.

semi_join(top_pop, HDI_cate_dict, by = "HDI_category")
#> # A tibble: 4 × 3
#>   country   population HDI_category
#>   <chr>          <dbl> <chr>       
#> 1 China        1360000 medium      
#> 2 India        1220000 medium      
#> 3 Indonesia     239000 medium      
#> 4 Brazil        194000 high

From the result, you can see that only the rows in top_pop that have matching values in HDI_cate_dict for HDI_category are kept. Again, the columns from HDI_cate_dict are not included in the output, making it a filtering operation.

b. Anti Join

Another type of filtering joins is called anti join, in which anti_join(tb1, tb2) returns all rows from tb1 where there don’t exist matching values in tb2 on the keys.

anti_join(top_pop, HDI_cate_dict, by = "HDI_category")
#> # A tibble: 1 × 3
#>   country       population HDI_category
#>   <chr>              <dbl> <chr>       
#> 1 United States     306000 very high

From the result, you can see that only the rows in top_pop that don’t have matching values in HDI_cate_dict for HDI_category are kept. The columns from HDI_cate_dict are not included in the output, making it a filtering operation.

7.7.6 A Summary of All Joins

Finally, we summarize all the six joins in the following table, in terms of which rows and columns are selected in the final output.

Code Rows Columns
inner_join(tb1, tb2, by) Rows in both tb1 and tb2 Columns in either tb1 or tb2
left_join(tb1, tb2, by) Rows in tb1 Columns in either tb1 or tb2
right_join(tb1, tb2, by) Rows in tb2 Columns in either tb1 or tb2
full_join(tb1, tb2, by) All rows in tb1 or tb2 Columns in either tb1 or tb2
semi_join(tb1, tb2, by) Rows in tb1 with matching keys in tb2 Only columns in tb1
anti_join(tb1, tb2, by) Rows in tb1 without matching keys in tb2 Only columns in tb1

7.7.7 Exercises

First, let’s prepare two data sets for the exercises.

library(tidyverse)
library(r02pro)
d1 <- ahp %>%
    select(dt_sold, bsmt_area, bsmt_ht) %>%
    head(n = 5)
d2 <- tibble(bsmt_ht = c("Excellent", "Good", "Average", "Poor"), height = c("100+ inches",
    "90-99 inches", "80-89 inches", "<70 inches"))
d1
#> # A tibble: 5 × 3
#>   dt_sold    bsmt_area bsmt_ht
#>   <date>         <dbl> <chr>  
#> 1 2010-03-25       725 Average
#> 2 2009-04-10       913 Good   
#> 3 2010-01-15      1057 Average
#> 4 2010-04-19       384 Good   
#> 5 2010-03-22       676 Fair
d2
#> # A tibble: 4 × 2
#>   bsmt_ht   height      
#>   <chr>     <chr>       
#> 1 Excellent 100+ inches 
#> 2 Good      90-99 inches
#> 3 Average   80-89 inches
#> 4 Poor      <70 inches
  1. Perform an inner join between d1 and d2 by the key bsmt_ht. What are the rows and columns in the final output?
  2. Perform a left join between d1 and d2 by the key bsmt_ht. What are the rows and columns in the final output?
  3. Perform a right join between d1 and d2 by the key bsmt_ht. What are the rows and columns in the final output?
  4. Perform a full join between d1 and d2 by the key bsmt_ht. What are the rows and columns in the final output?
  5. Perform a semi join between d1 and d2 by the key bsmt_ht. What are the rows and columns in the final output?
  6. Perform an anti join between d1 and d2 by the key bsmt_ht. What are the rows and columns in the final output?
  7. Define d2_new by adding a new column to d2 called height_code with values 1, 2, 3, and 4 corresponding to "Excellent", "Good", "Average", and "Poor" in bsmt_ht, respectively. Perform an inner join between d1 and d2_new by bsmt_ht. Does adding this new column affect the join results?
  8. Define d1_filter by filtering d1 to only include rows where bsmt_area is between 600 and 800. Perform a left join between d1_filter and d2 by bsmt_ht. How many rows in the joined dataset come from d2?
  9. Defined d1_na by replacing bsmt_ht in the first row of d1 with NA. Perform a full join between d1_na and d2 by bsmt_ht. How does this missing value affect the join result?
  10. Add a row in d2 with bsmt_ht = "Very Good" and height = "95-99 inches". Perform an anti join between d1 and the updated d2 by bsmt_ht. Explain which rows are retained in the output and why.