7.7 Joins
So far, we have been working on a single data set. In applications, however, we often have the need to combine multiple data sets into a single one before doing data analysis. Let’s prepare two data sets to show the various ways we can join two data sets.
library(tidyverse)
library(r02pro)
top_pop <- gm %>%
filter(year == 2008 & !is.na(continent)) %>%
mutate(r = rank(desc(population), ties.method = "first")) %>%
filter(r <= 5) %>%
select(country, population, HDI_category) %>%
arrange(desc(population))
HDI_cate_dict <- tibble(HDI_category = c("low", "medium", "high"), HDI_description = c("below 0.549",
"between 0.550–0.699", "between 0.700–0.799"))
Here, top_pop
is a tibble of the top 3 countries with the highest population in 2008. HDI_cate_dict
is a tibble that provides the description of the Human Development Index (HDI) categories.
It is clear that the two tibbles share a variable named HDI_category
. It could be interesting to join the two tibbles into a single one by including all variables while matching the values of HDI_category
. The bridging variable HDI_category
is usually called the key to the joining operation.
Next, we will discuss several different join operations between top_pop
and HDI_cate_dict
, with the key being HDI_category
.
7.7.1 Inner Join
The first type of join operation on two tables with a set of keys is called inner join. You can use the inner_join(tb1, tb2)
function, which returns all rows in tb1
such that there are matching values of the corresponding keys in tb2
.
inner <- inner_join(top_pop, HDI_cate_dict, by = "HDI_category")
inner
#> # A tibble: 4 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 Indonesia 239000 medium between 0.550–0.699
#> 4 Brazil 194000 high between 0.700–0.799
From the result, you can see that there are 4 observations. Each observation corresponds to one row in top_pop
that has a matching value in HDI_cate_dict
for HDI_category
. The observation corresponding to United States in top_pop
is dropped in the result since its HDI_category
value is "very high"
, which doesn’t have a matched row in HDI_cate_dict
.
7.7.2 Left Join
The second type of join operation on two tables with a set of keys is called left join. You can use the left_join(tb1, tb2)
function, which returns all rows in tb1
, no matter whether there are matching values of the corresponding keys in tb2
.
left <- left_join(top_pop, HDI_cate_dict, by = "HDI_category")
left
#> # A tibble: 5 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 United States 306000 very high <NA>
#> 4 Indonesia 239000 medium between 0.550–0.699
#> 5 Brazil 194000 high between 0.700–0.799
As you can see from the result, all rows in top_pop
is kept, even though United States doesn’t have a matching observation in HDI_cate_dict
for the variable HDI_category
. In this case, the corresponding value for HDI_description
is NA
.
7.7.3 Right Join
The third type of join operation on two tables with a set of keys is called right join, which works in a similar fashion as left join. You can use the right_join(tb1, tb2)
function, which returns all rows in tb2
, no matter whether there are matching values of the corresponding keys in tb1
or not.
right <- right_join(top_pop, HDI_cate_dict, by = "HDI_category")
right
#> # A tibble: 5 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 Indonesia 239000 medium between 0.550–0.699
#> 4 Brazil 194000 high between 0.700–0.799
#> 5 <NA> NA low below 0.549
As you can see from the result, all rows in HDI_cate_dict
is kept, even though there aren’t any matching observations in top_pop
for HDI_category == "low"
. In this case, the corresponding values for country
and population
are NA
.
7.7.4 Full Join
The fourth type of join operation on two tables with a set of keys is called full join. You can use the full_join(tb1, tb2)
function, which returns all rows in both tb1
and tb2
, no matter whether there are matching values of the keys in either tb1
or tb2
.
full <- full_join(top_pop, HDI_cate_dict, by = "HDI_category")
full
#> # A tibble: 6 × 4
#> country population HDI_category HDI_description
#> <chr> <dbl> <chr> <chr>
#> 1 China 1360000 medium between 0.550–0.699
#> 2 India 1220000 medium between 0.550–0.699
#> 3 United States 306000 very high <NA>
#> 4 Indonesia 239000 medium between 0.550–0.699
#> 5 Brazil 194000 high between 0.700–0.799
#> 6 <NA> NA low below 0.549
As you can see from the result, all rows in both top_pop
and HDI_cate_dict
are kept. The rows in top_pop
that don’t have matching values in HDI_cate_dict
are filled with NA
for the variables in HDI_cate_dict
, and vice versa.
7.7.5 Filtering Joins
For the previous four types of joins, the resulting table contains columns from both input tables. Sometimes, you may want to only keep the columns from one of the input tables. This is called filtering joins, meaning that it is actually doing the filtering operations like the filter()
function we learned in Section 7.1.
a. Semi Join
The first type of filtering joins is called semi join, in which semi_join(tb1, tb2)
returns all rows from tb1
where there are matching values in tb2
for the keys.
semi_join(top_pop, HDI_cate_dict, by = "HDI_category")
#> # A tibble: 4 × 3
#> country population HDI_category
#> <chr> <dbl> <chr>
#> 1 China 1360000 medium
#> 2 India 1220000 medium
#> 3 Indonesia 239000 medium
#> 4 Brazil 194000 high
From the result, you can see that only the rows in top_pop
that have matching values in HDI_cate_dict
for HDI_category
are kept. Again, the columns from HDI_cate_dict
are not included in the output, making it a filtering operation.
b. Anti Join
Another type of filtering joins is called anti join, in which anti_join(tb1, tb2)
returns all rows from tb1
where there don’t exist matching values in tb2
on the keys.
anti_join(top_pop, HDI_cate_dict, by = "HDI_category")
#> # A tibble: 1 × 3
#> country population HDI_category
#> <chr> <dbl> <chr>
#> 1 United States 306000 very high
From the result, you can see that only the rows in top_pop
that don’t have matching values in HDI_cate_dict
for HDI_category
are kept. The columns from HDI_cate_dict
are not included in the output, making it a filtering operation.
7.7.6 A Summary of All Joins
Finally, we summarize all the six joins in the following table, in terms of which rows and columns are selected in the final output.
Code | Rows | Columns |
---|---|---|
inner_join(tb1, tb2, by)
|
Rows in both tb1 and tb2
|
Columns in either tb1 or tb2
|
left_join(tb1, tb2, by)
|
Rows in tb1
|
Columns in either tb1 or tb2
|
right_join(tb1, tb2, by)
|
Rows in tb2
|
Columns in either tb1 or tb2
|
full_join(tb1, tb2, by)
|
All rows in tb1 or tb2
|
Columns in either tb1 or tb2
|
semi_join(tb1, tb2, by)
|
Rows in tb1 with matching keys in tb2
|
Only columns in tb1
|
anti_join(tb1, tb2, by)
|
Rows in tb1 without matching keys in tb2
|
Only columns in tb1
|
7.7.7 Exercises
First, let’s prepare two data sets for the exercises.
library(tidyverse)
library(r02pro)
d1 <- ahp %>%
select(dt_sold, bsmt_area, bsmt_ht) %>%
head(n = 5)
d2 <- tibble(bsmt_ht = c("Excellent", "Good", "Average", "Poor"), height = c("100+ inches",
"90-99 inches", "80-89 inches", "<70 inches"))
d1
#> # A tibble: 5 × 3
#> dt_sold bsmt_area bsmt_ht
#> <date> <dbl> <chr>
#> 1 2010-03-25 725 Average
#> 2 2009-04-10 913 Good
#> 3 2010-01-15 1057 Average
#> 4 2010-04-19 384 Good
#> 5 2010-03-22 676 Fair
d2
#> # A tibble: 4 × 2
#> bsmt_ht height
#> <chr> <chr>
#> 1 Excellent 100+ inches
#> 2 Good 90-99 inches
#> 3 Average 80-89 inches
#> 4 Poor <70 inches
- Perform an inner join between
d1
andd2
by the keybsmt_ht
. What are the rows and columns in the final output? - Perform a left join between
d1
andd2
by the keybsmt_ht
. What are the rows and columns in the final output? - Perform a right join between
d1
andd2
by the keybsmt_ht
. What are the rows and columns in the final output? - Perform a full join between
d1
andd2
by the keybsmt_ht
. What are the rows and columns in the final output? - Perform a semi join between
d1
andd2
by the keybsmt_ht
. What are the rows and columns in the final output? - Perform an anti join between
d1
andd2
by the keybsmt_ht
. What are the rows and columns in the final output? - Define
d2_new
by adding a new column tod2
calledheight_code
with values 1, 2, 3, and 4 corresponding to"Excellent"
,"Good"
,"Average"
, and"Poor"
inbsmt_ht
, respectively. Perform an inner join betweend1
andd2_new
bybsmt_ht
. Does adding this new column affect the join results? - Define
d1_filter
by filteringd1
to only include rows wherebsmt_area
is between 600 and 800. Perform a left join betweend1_filter
andd2
bybsmt_ht
. How many rows in the joined dataset come fromd2
? - Defined
d1_na
by replacingbsmt_ht
in the first row ofd1
withNA
. Perform a full join betweend1_na
andd2
bybsmt_ht
. How does this missing value affect the join result? - Add a row in
d2
withbsmt_ht
="Very Good"
andheight = "95-99 inches"
. Perform an anti join betweend1
and the updatedd2
bybsmt_ht
. Explain which rows are retained in the output and why.