7.7 Joins
So far, we have been working on a single data set. In applications, however, we often have the need to combine multiple data sets into a single one before doing data analysis. Let’s prepare two data sets to show the various ways we can join two data sets.
library(tidyverse)
library(r02pro)
top_pop <- gm %>%
filter(year == 2008 & !is.na(continent)) %>%
mutate(r = rank(desc(population), ties.method = "first")) %>%
filter(r <= 5) %>%
select(country, population, HDI_category) %>%
arrange(desc(population))
HDI_cate_dict <- tibble(HDI_category = c("low", "medium", "high"), HDI_description = c("below 0.549",
"between 0.550–0.699", "between 0.700–0.799"))Here, top_pop is a tibble of the top 3 countries with the highest population in 2008. HDI_cate_dict is a tibble that provides the description of the Human Development Index (HDI) categories.
It is clear that the two tibbles share a variable named HDI_category. It could be interesting to join the two tibbles into a single one by including all variables while matching the values of HDI_category. The bridging variable HDI_category is usually called the key to the joining operation.
Next, we will discuss several different join operations between top_pop and HDI_cate_dict, with the key being HDI_category.
7.7.1 Inner Join
The first type of join operation on two tables with a set of keys is called inner join. You can use the inner_join(tb1, tb2) function, which returns all rows in tb1 such that there are matching values of the corresponding keys in tb2.
inner <- inner_join(top_pop, HDI_cate_dict, by = "HDI_category")
inner
#> # A tibble: 4 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 Indonesia 239000 medium between 0.550–0.699
#> 4 Brazil 194000 high between 0.700–0.799From the result, you can see that there are 4 observations. Each observation corresponds to one row in top_pop that has a matching value in HDI_cate_dict for HDI_category. The observation corresponding to United States in top_pop is dropped in the result since its HDI_category value is "very high", which doesn’t have a matched row in HDI_cate_dict.
7.7.2 Left Join
The second type of join operation on two tables with a set of keys is called left join. You can use the left_join(tb1, tb2) function, which returns all rows in tb1, no matter whether there are matching values of the corresponding keys in tb2.
left <- left_join(top_pop, HDI_cate_dict, by = "HDI_category")
left
#> # A tibble: 5 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 United States 306000 very high <NA>
#> 4 Indonesia 239000 medium between 0.550–0.699
#> 5 Brazil 194000 high between 0.700–0.799As you can see from the result, all rows in top_pop is kept, even though United States doesn’t have a matching observation in HDI_cate_dict for the variable HDI_category. In this case, the corresponding value for HDI_description is NA.
7.7.3 Right Join
The third type of join operation on two tables with a set of keys is called right join, which works in a similar fashion as left join. You can use the right_join(tb1, tb2) function, which returns all rows in tb2, no matter whether there are matching values of the corresponding keys in tb1 or not.
right <- right_join(top_pop, HDI_cate_dict, by = "HDI_category")
right
#> # A tibble: 5 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 Indonesia 239000 medium between 0.550–0.699
#> 4 Brazil 194000 high between 0.700–0.799
#> 5 <NA> NA low below 0.549As you can see from the result, all rows in HDI_cate_dict is kept, even though there aren’t any matching observations in top_pop for HDI_category == "low". In this case, the corresponding values for country and population are NA.
7.7.4 Full Join
The fourth type of join operation on two tables with a set of keys is called full join. You can use the full_join(tb1, tb2) function, which returns all rows in both tb1 and tb2, no matter whether there are matching values of the keys in either tb1 or tb2.
full <- full_join(top_pop, HDI_cate_dict, by = "HDI_category")
full
#> # A tibble: 6 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 United States 306000 very high <NA>
#> 4 Indonesia 239000 medium between 0.550–0.699
#> 5 Brazil 194000 high between 0.700–0.799
#> 6 <NA> NA low below 0.549As you can see from the result, all rows in both top_pop and HDI_cate_dict are kept. The rows in top_pop that don’t have matching values in HDI_cate_dict are filled with NA for the variables in HDI_cate_dict, and vice versa.
7.7.5 Filtering Joins
For the previous four types of joins, the resulting table contains columns from both input tables. Sometimes, you may want to only keep the columns from one of the input tables. This is called filtering joins, meaning that it is actually doing the filtering operations like the filter() function we learned in Section 7.1.
a. Semi Join
The first type of filtering joins is called semi join, in which semi_join(tb1, tb2) returns all rows from tb1 where there are matching values in tb2 for the keys.
semi_join(top_pop, HDI_cate_dict, by = "HDI_category")
#> # A tibble: 4 × 3
#> country population HDI_category
#> <chr> <dbl> <chr>
#> 1 China 1360000 medium
#> 2 India 1220000 medium
#> 3 Indonesia 239000 medium
#> 4 Brazil 194000 highFrom the result, you can see that only the rows in top_pop that have matching values in HDI_cate_dict for HDI_category are kept. Again, the columns from HDI_cate_dict are not included in the output, making it a filtering operation.
b. Anti Join
Another type of filtering joins is called anti join, in which anti_join(tb1, tb2) returns all rows from tb1 where there don’t exist matching values in tb2 on the keys.
anti_join(top_pop, HDI_cate_dict, by = "HDI_category")
#> # A tibble: 1 × 3
#> country population HDI_category
#> <chr> <dbl> <chr>
#> 1 United States 306000 very highFrom the result, you can see that only the rows in top_pop that don’t have matching values in HDI_cate_dict for HDI_category are kept. The columns from HDI_cate_dict are not included in the output, making it a filtering operation.
7.7.6 A Summary of All Joins
Finally, we summarize all the six joins in the following table, in terms of which rows and columns are selected in the final output.
| Code | Rows | Columns |
|---|---|---|
inner_join(tb1, tb2, by)
|
Rows in both tb1 and tb2
|
Columns in either tb1 or tb2
|
left_join(tb1, tb2, by)
|
Rows in tb1
|
Columns in either tb1 or tb2
|
right_join(tb1, tb2, by)
|
Rows in tb2
|
Columns in either tb1 or tb2
|
full_join(tb1, tb2, by)
|
All rows in tb1 or tb2
|
Columns in either tb1 or tb2
|
semi_join(tb1, tb2, by)
|
Rows in tb1 with matching keys in tb2
|
Only columns in tb1
|
anti_join(tb1, tb2, by)
|
Rows in tb1 without matching keys in tb2
|
Only columns in tb1
|
7.7.7 Exercises
First, let’s prepare two data sets for the exercises.
library(tidyverse)
library(r02pro)
d1 <- ahp %>%
select(dt_sold, bsmt_area, bsmt_ht) %>%
head(n = 5)
d2 <- tibble(bsmt_ht = c("Excellent", "Good", "Average", "Poor"), height = c("100+ inches",
"90-99 inches", "80-89 inches", "<70 inches"))
d1
#> # A tibble: 5 × 3
#> dt_sold bsmt_area bsmt_ht
#> <date> <dbl> <chr>
#> 1 2010-03-25 725 Average
#> 2 2009-04-10 913 Good
#> 3 2010-01-15 1057 Average
#> 4 2010-04-19 384 Good
#> 5 2010-03-22 676 Fair
d2
#> # A tibble: 4 × 2
#> bsmt_ht height
#> <chr> <chr>
#> 1 Excellent 100+ inches
#> 2 Good 90-99 inches
#> 3 Average 80-89 inches
#> 4 Poor <70 inches- Perform an inner join between
d1andd2by the keybsmt_ht. What are the rows and columns in the final output? - Perform a left join between
d1andd2by the keybsmt_ht. What are the rows and columns in the final output? - Perform a right join between
d1andd2by the keybsmt_ht. What are the rows and columns in the final output? - Perform a full join between
d1andd2by the keybsmt_ht. What are the rows and columns in the final output? - Perform a semi join between
d1andd2by the keybsmt_ht. What are the rows and columns in the final output? - Perform an anti join between
d1andd2by the keybsmt_ht. What are the rows and columns in the final output? - Define
d2_newby adding a new column tod2calledheight_codewith values 1, 2, 3, and 4 corresponding to"Excellent","Good","Average", and"Poor"inbsmt_ht, respectively. Perform an inner join betweend1andd2_newbybsmt_ht. Does adding this new column affect the join results? - Define
d1_filterby filteringd1to only include rows wherebsmt_areais between 600 and 800. Perform a left join betweend1_filterandd2bybsmt_ht. How many rows in the joined dataset come fromd2? - Defined
d1_naby replacingbsmt_htin the first row ofd1withNA. Perform a full join betweend1_naandd2bybsmt_ht. How does this missing value affect the join result? - Add a row in
d2withbsmt_ht="Very Good"andheight = "95-99 inches". Perform an anti join betweend1and the updatedd2bybsmt_ht. Explain which rows are retained in the output and why.