6.8 Joins

So far, we have been working on a single data set. In applications, however, we often have the need to combine multiple data sets into a single one before doing data analysis. Let’s prepare two data sets to show the various ways we can join two data sets.

library(tidyverse)
library(r02pro)
d1 <- ahp %>% 
  select(dt_sold, bsmt_area, bsmt_ht) %>%
  head(n = 5)
d2 <- tibble(bsmt_ht = c("Excellent", "Good", "Average", "Poor"), height = c("100+ inches", "90-99 inches", "80-89 inches", "<70 inches"))
d1
d2

From the results, we know d1 is a tibble of 5 observations and 3 variables including dt_sold, bsmt_area, and bsmt_ht. d2 is another tibble of 4 observations and 2 variales including bsmt_ht and height.

It is clear that the two tibbles share a variable named bsmt_ht. It could be interesting to join the two tibbles into a single one by including all variables while matching the values of bsmt_ht. The bridging variable bsmt_ht is usually called the key to the joining operation.

Next, we will discuss several different join operations between d1 and d2, with the key being bsmt_ht.

6.8.1 Inner Join

The first type of join operation on two tables (d1 and d2) with a set of keys is called inner join. You can use the inner_join() function, which returns all rows in d1 such that there are matching values of all keys in d2.

inner <- inner_join(d1, d2, by = "bsmt_ht")
inner
#> # A tibble: 4 × 4
#>   dt_sold    bsmt_area bsmt_ht height      
#>   <date>         <dbl> <chr>   <chr>       
#> 1 2010-03-25       725 Average 80-89 inches
#> 2 2009-04-10       913 Good    90-99 inches
#> 3 2010-01-15      1057 Average 80-89 inches
#> 4 2010-04-19       384 Good    90-99 inches

From the result, you can see that there are 4 observations. Each observation corresponds to one row in d1 that has a matching value in d2 for bsmt_ht. The 5th observation in d1 is dropped in the result since its bsmt_ht value is "Fair", which doesn’t have a matched observation in d2.

6.8.2 Left Join

The second type of join operation on two tables (d1 and d2) with a set of keys is called left join. You can use the left_join() function, which returns all rows in d1, no matter whether there are matching values of all keys in d2 or not.

left <- left_join(d1, d2, by = "bsmt_ht")
left
#> # A tibble: 5 × 4
#>   dt_sold    bsmt_area bsmt_ht height      
#>   <date>         <dbl> <chr>   <chr>       
#> 1 2010-03-25       725 Average 80-89 inches
#> 2 2009-04-10       913 Good    90-99 inches
#> 3 2010-01-15      1057 Average 80-89 inches
#> 4 2010-04-19       384 Good    90-99 inches
#> 5 2010-03-22       676 Fair    <NA>

As you can see from the result, all rows in d1 is kept, even though the last row doesn’t have a matching observation in d2 for the variable bsmt_ht. In this case, the corresponding value for height is NA.

6.8.3 Right Join

The third type of join operation on two tables (d1 and d2) with a set of keys is called right join, which works in a similar fashion as left join. You can use the right_join() function, which returns all rows in d2, no matter whether there are matching values of all keys in d1 or not.

right <- right_join(d1, d2, by = "bsmt_ht")
right
#> # A tibble: 6 × 4
#>   dt_sold    bsmt_area bsmt_ht   height      
#>   <date>         <dbl> <chr>     <chr>       
#> 1 2010-03-25       725 Average   80-89 inches
#> 2 2009-04-10       913 Good      90-99 inches
#> 3 2010-01-15      1057 Average   80-89 inches
#> 4 2010-04-19       384 Good      90-99 inches
#> 5 NA                NA Excellent 100+ inches 
#> 6 NA                NA Poor      <70 inches

As you can see from the result, all rows in d2 is kept, even though the last two rows don’t have a matching observation in d1 for the variable bsmt_ht. In this case, the corresponding values for dt_sold and bsmt_area are NA.

6.8.4 Full Join

The fourth type of join operation on two tables (d1 and d2) with a set of keys is called full join. You can use the full_join() function, which returns all rows in both d1 and d2, no matter whether there are matching values of all keys in either d1 or d2.

full <- full_join(d1, d2, by = "bsmt_ht")
full
#> # A tibble: 7 × 4
#>   dt_sold    bsmt_area bsmt_ht   height      
#>   <date>         <dbl> <chr>     <chr>       
#> 1 2010-03-25       725 Average   80-89 inches
#> 2 2009-04-10       913 Good      90-99 inches
#> 3 2010-01-15      1057 Average   80-89 inches
#> 4 2010-04-19       384 Good      90-99 inches
#> 5 2010-03-22       676 Fair      <NA>        
#> 6 NA                NA Excellent 100+ inches 
#> 7 NA                NA Poor      <70 inches

As you can tell from the result, it includes all observations from both d1 and d2. You can view it as the union of left_join() and right_join().

6.8.5 Filtering Joins

For the previous four types of joins, the resulting table contains columns from both input tables. Sometimes, you may want to only keep the columns from one of the input tables. This is called filtering joins, meaning that it is actually doing the filtering operations like the filter() function we learned in Section 6.1.

a. Semi Join

The first type of filtering joins is called semi join, which returns all rows from d1 where there are matching values in d2 on the keys.

semi <- semi_join(d1, d2, by = "bsmt_ht")
semi
#> # A tibble: 4 × 3
#>   dt_sold    bsmt_area bsmt_ht
#>   <date>         <dbl> <chr>  
#> 1 2010-03-25       725 Average
#> 2 2009-04-10       913 Good   
#> 3 2010-01-15      1057 Average
#> 4 2010-04-19       384 Good

As expected, the output is the first four rows of d1 where there are matching values of bsmt_area in d2. Note that only the columns of d1 are kept in the output, making it a filtering operation.

b. Anti Join

Another type of filtering joins is called anti join, which returns all rows from d1 where there don’t exist matching values in d2 on the keys.

anti <- anti_join(d1, d2, by = "bsmt_ht")
anti
#> # A tibble: 1 × 3
#>   dt_sold    bsmt_area bsmt_ht
#>   <date>         <dbl> <chr>  
#> 1 2010-03-22       676 Fair

The output only contains rows of d1 where there are no matching values of bsmt_area in d2, which contains only the last row of d1. Again, only the columns of d1 are kept in the output, making it another filtering operation.

6.8.6 A Summary of All Joins

Finally, we summarize all the six joins in the following table, in terms of which rows and columns are selected in the final output.

Code Rows Columns
inner_join(d1, d2, by) All in d1 and d2 All columns in d1 and d2
left_join(d1, d2, by) All in d1 All columns in d1 and d2
right_join(d1, d2, by) All in d2 All columns in d1 and d2
full_join(d1, d2, by) All in d1 or d2 All columns in d1 and d2
semi_join(d1, d2, by) Only in d1 with matching keys in d2 Only columns in d1
anti_join(d1, d2, by) Only in d1 with no matching keys in d2 Only columns in d1