6.8 Joins
So far, we have been working on a single data set. In applications, however, we often have the need to combine multiple data sets into a single one before doing data analysis. Let’s prepare two data sets to show the various ways we can join two data sets.
library(tidyverse)
library(r02pro)
<- ahp %>%
d1 select(dt_sold, bsmt_area, bsmt_ht) %>%
head(n = 5)
<- tibble(bsmt_ht = c("Excellent", "Good", "Average", "Poor"), height = c("100+ inches", "90-99 inches", "80-89 inches", "<70 inches"))
d2
d1 d2
From the results, we know d1
is a tibble of
5 observations and 3 variables including dt_sold
, bsmt_area
, and bsmt_ht
. d2
is another tibble of 4 observations and 2 variales including bsmt_ht
and height
.
It is clear that the two tibbles share a variable named bsmt_ht
. It could be interesting to join the two tibbles into a single one by including all variables while matching the values of bsmt_ht
. The bridging variable bsmt_ht
is usually called the key to the joining operation.
Next, we will discuss several different join operations between d1
and d2
, with the key being bsmt_ht
.
6.8.1 Inner Join
The first type of join operation on two tables (d1
and d2
) with a set of keys is called inner join. You can use the inner_join()
function, which returns all rows in d1
such that there are matching values of all keys in d2
.
<- inner_join(d1, d2, by = "bsmt_ht")
inner
inner#> # A tibble: 4 × 4
#> dt_sold bsmt_area bsmt_ht height
#> <date> <dbl> <chr> <chr>
#> 1 2010-03-25 725 Average 80-89 inches
#> 2 2009-04-10 913 Good 90-99 inches
#> 3 2010-01-15 1057 Average 80-89 inches
#> 4 2010-04-19 384 Good 90-99 inches
From the result, you can see that there are 4 observations. Each observation corresponds to one row in d1
that has a matching value in d2
for bsmt_ht
. The 5th observation in d1
is dropped in the result since its bsmt_ht
value is "Fair"
, which doesn’t have a matched observation in d2
.
6.8.2 Left Join
The second type of join operation on two tables (d1
and d2
) with a set of keys is called left join. You can use the left_join()
function, which returns all rows in d1
, no matter whether there are matching values of all keys in d2
or not.
<- left_join(d1, d2, by = "bsmt_ht")
left
left#> # A tibble: 5 × 4
#> dt_sold bsmt_area bsmt_ht height
#> <date> <dbl> <chr> <chr>
#> 1 2010-03-25 725 Average 80-89 inches
#> 2 2009-04-10 913 Good 90-99 inches
#> 3 2010-01-15 1057 Average 80-89 inches
#> 4 2010-04-19 384 Good 90-99 inches
#> 5 2010-03-22 676 Fair <NA>
As you can see from the result, all rows in d1
is kept, even though the last row doesn’t have a matching observation in d2
for the variable bsmt_ht
. In this case, the corresponding value for height
is NA
.
6.8.3 Right Join
The third type of join operation on two tables (d1
and d2
) with a set of keys is called right join, which works in a similar fashion as left join. You can use the right_join()
function, which returns all rows in d2
, no matter whether there are matching values of all keys in d1
or not.
<- right_join(d1, d2, by = "bsmt_ht")
right
right#> # A tibble: 6 × 4
#> dt_sold bsmt_area bsmt_ht height
#> <date> <dbl> <chr> <chr>
#> 1 2010-03-25 725 Average 80-89 inches
#> 2 2009-04-10 913 Good 90-99 inches
#> 3 2010-01-15 1057 Average 80-89 inches
#> 4 2010-04-19 384 Good 90-99 inches
#> 5 NA NA Excellent 100+ inches
#> 6 NA NA Poor <70 inches
As you can see from the result, all rows in d2
is kept, even though the last two rows don’t have a matching observation in d1
for the variable bsmt_ht
. In this case, the corresponding values for dt_sold
and bsmt_area
are NA
.
6.8.4 Full Join
The fourth type of join operation on two tables (d1
and d2
) with a set of keys is called full join. You can use the full_join()
function, which returns all rows in both d1
and d2
, no matter whether there are matching values of all keys in either d1
or d2
.
<- full_join(d1, d2, by = "bsmt_ht")
full
full#> # A tibble: 7 × 4
#> dt_sold bsmt_area bsmt_ht height
#> <date> <dbl> <chr> <chr>
#> 1 2010-03-25 725 Average 80-89 inches
#> 2 2009-04-10 913 Good 90-99 inches
#> 3 2010-01-15 1057 Average 80-89 inches
#> 4 2010-04-19 384 Good 90-99 inches
#> 5 2010-03-22 676 Fair <NA>
#> 6 NA NA Excellent 100+ inches
#> 7 NA NA Poor <70 inches
As you can tell from the result, it includes all observations from both d1
and d2
. You can view it as the union of left_join()
and right_join()
.
6.8.5 Filtering Joins
For the previous four types of joins, the resulting table contains columns from both input tables. Sometimes, you may want to only keep the columns from one of the input tables. This is called filtering joins, meaning that it is actually doing the filtering operations like the filter()
function we learned in Section 6.1.
a. Semi Join
The first type of filtering joins is called semi join, which returns all rows from d1
where there are matching values in d2
on the keys.
<- semi_join(d1, d2, by = "bsmt_ht")
semi
semi#> # A tibble: 4 × 3
#> dt_sold bsmt_area bsmt_ht
#> <date> <dbl> <chr>
#> 1 2010-03-25 725 Average
#> 2 2009-04-10 913 Good
#> 3 2010-01-15 1057 Average
#> 4 2010-04-19 384 Good
As expected, the output is the first four rows of d1
where there are matching values of bsmt_area
in d2
. Note that only the columns of d1
are kept in the output, making it a filtering operation.
b. Anti Join
Another type of filtering joins is called anti join, which returns all rows from d1
where there don’t exist matching values in d2
on the keys.
<- anti_join(d1, d2, by = "bsmt_ht")
anti
anti#> # A tibble: 1 × 3
#> dt_sold bsmt_area bsmt_ht
#> <date> <dbl> <chr>
#> 1 2010-03-22 676 Fair
The output only contains rows of d1
where there are no matching values of bsmt_area
in d2
, which contains only the last row of d1
. Again, only the columns of d1
are kept in the output, making it another filtering operation.
6.8.6 A Summary of All Joins
Finally, we summarize all the six joins in the following table, in terms of which rows and columns are selected in the final output.
Code | Rows | Columns |
---|---|---|
inner_join(d1, d2, by)
|
All in d1 and d2
|
All columns in d1 and d2
|
left_join(d1, d2, by)
|
All in d1
|
All columns in d1 and d2
|
right_join(d1, d2, by)
|
All in d2
|
All columns in d1 and d2
|
full_join(d1, d2, by)
|
All in d1 or d2
|
All columns in d1 and d2
|
semi_join(d1, d2, by)
|
Only in d1 with matching keys in d2
|
Only columns in d1
|
anti_join(d1, d2, by)
|
Only in d1 with no matching keys in d2
|
Only columns in d1
|