7.8 Set Operations

In Section 7.7, you learned how to join two data sets via a shared key. In some applications, you may want to perform set operations on two data frames or tibbles. Set operations allow you to compare two data frames and return results based on whether rows exist in both, one, or neither of the data frames. Here, we’ll explore key set operations in the dplyr package: intersect(), union(), union_all(), setdiff(), symdiff(), and setequal().

Let’s first load the dplyr package and create two tibbles to use throughout this section.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df1 <- tibble(id = c(1, 2, 3), value = c("A", "B", "C"))
df2 <- tibble(id = c(2, 3, 4), value = c("B", "C", "D"))

7.8.1 intersect(x, y)

The intersect(x, y) function finds all rows that appear in both x and y. This is useful for identifying the common elements between two data frames.

# Intersection of df1 and df2
intersect(df1, df2)
## # A tibble: 2 × 2
##      id value
##   <dbl> <chr>
## 1     2 B    
## 2     3 C

In this example, intersect(df1, df2) returns the rows that are common to both df1 and df2.

7.8.2 union(x, y)

The union(x, y) function returns all unique rows from x and y, excluding duplicates. This operation is similar to combining two sets in mathematics.

# Example of union()
union(df1, df2)
## # A tibble: 4 × 2
##      id value
##   <dbl> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C    
## 4     4 D

The result includes all rows from both data frames, without any duplicates.

7.8.3 union_all(x, y)

Unlike union(), the union_all(x, y) function includes all rows from both x and y, including duplicates. This can be useful if you want to keep repeated values.

# Example of union_all()
union_all(df1, df2)
## # A tibble: 6 × 2
##      id value
##   <dbl> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C    
## 4     2 B    
## 5     3 C    
## 6     4 D

The output includes all rows from both df1 and df2, and duplicate rows are preserved.

7.8.4 setdiff(x, y)

The setdiff(x, y) function finds all rows that are in x but not in y. This is useful for identifying elements that are unique to x.

# Example of setdiff()
setdiff(df1, df2)
## # A tibble: 1 × 2
##      id value
##   <dbl> <chr>
## 1     1 A

This returns rows from df1 that do not appear in df2.

7.8.5 symdiff(x, y)

The symdiff(x, y) function computes the symmetric difference, which includes all rows that are in x but not in y and all rows that are in y but not in x. Currently, dplyr does not have a built-in symdiff function, but we can achieve this by combining setdiff() and union():

# Example of symmetric difference using setdiff() and union()
symdiff <- function(x, y) {
  dplyr::union(setdiff(x, y), setdiff(y, x))
}
symdiff(df1, df2)
## # A tibble: 2 × 2
##      id value
##   <dbl> <chr>
## 1     1 A    
## 2     4 D

This custom function returns rows that are unique to each data frame.

7.8.6 setequal(x, y)

The setequal(x, y) function checks if x and y contain the same rows, ignoring the order of the rows. It returns TRUE if they are equivalent and FALSE otherwise.

# Example of setequal()
setequal(df1, df2)
## [1] FALSE

This will return FALSE in this example because df1 and df2 do not contain the same rows.

7.8.7 Summary

Set operations in dplyr are a powerful way to compare data frames. They allow you to extract or compare unique and overlapping rows, enabling deeper insights and control over your data analysis.