7.8 Set Operations
In Section 7.7, you learned how to join two data sets via a shared key. In some applications, you may want to perform set operations on two data frames or tibbles. Set operations allow you to compare two data frames and return results based on whether rows exist in both, one, or neither of the data frames. Here, we’ll explore key set operations in the dplyr
package: intersect()
, union()
, union_all()
, setdiff()
, symdiff()
, and setequal()
.
Let’s first load the dplyr package and create two tibbles to use throughout this section.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df1 <- tibble(id = c(1, 2, 3), value = c("A", "B", "C"))
df2 <- tibble(id = c(2, 3, 4), value = c("B", "C", "D"))
7.8.1 intersect(x, y)
The intersect(x, y)
function finds all rows that appear in both x
and y
. This is useful for identifying the common elements between two data frames.
## # A tibble: 2 × 2
## id value
## <dbl> <chr>
## 1 2 B
## 2 3 C
In this example, intersect(df1, df2)
returns the rows that are common to both df1
and df2
.
7.8.2 union(x, y)
The union(x, y)
function returns all unique rows from x
and y
, excluding duplicates. This operation is similar to combining two sets in mathematics.
## # A tibble: 4 × 2
## id value
## <dbl> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
The result includes all rows from both data frames, without any duplicates.
7.8.3 union_all(x, y)
Unlike union()
, the union_all(x, y)
function includes all rows from both x
and y
, including duplicates. This can be useful if you want to keep repeated values.
## # A tibble: 6 × 2
## id value
## <dbl> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
## 4 2 B
## 5 3 C
## 6 4 D
The output includes all rows from both df1
and df2
, and duplicate rows are preserved.
7.8.4 setdiff(x, y)
The setdiff(x, y)
function finds all rows that are in x
but not in y
. This is useful for identifying elements that are unique to x
.
## # A tibble: 1 × 2
## id value
## <dbl> <chr>
## 1 1 A
This returns rows from df1
that do not appear in df2
.
7.8.5 symdiff(x, y)
The symdiff(x, y)
function computes the symmetric difference, which includes all rows that are in x
but not in y
and all rows that are in y
but not in x
. Currently, dplyr
does not have a built-in symdiff
function, but we can achieve this by combining setdiff()
and union()
:
# Example of symmetric difference using setdiff() and union()
symdiff <- function(x, y) {
dplyr::union(setdiff(x, y), setdiff(y, x))
}
symdiff(df1, df2)
## # A tibble: 2 × 2
## id value
## <dbl> <chr>
## 1 1 A
## 2 4 D
This custom function returns rows that are unique to each data frame.