7.8 Set Operations on Datasets
In Section 7.7, you learned how to join two data sets via a shared key. And in Section 2.14, you learned how to conduct set operations on vectors. In some applications, you may want to perform set operations on two data frames or tibbles. Set operations allow you to compare two data frames and return results based on whether rows exist in both, one, or neither of the data frames. Here, we’ll explore key set operations in the dplyr
package: intersect()
, union()
, union_all()
, setdiff()
, symdiff()
, and setequal()
.
Let’s first load the dplyr package and create two tibbles to use throughout this section.
library(dplyr)
df1 <- tibble(id = c(1, 2, 3), value = c("A", "B", "C"))
df2 <- tibble(id = c(2, 3, 4), value = c("B", "C", "D"))
7.8.1 intersect(x, y)
The intersect(x, y)
function finds all rows that appear in both x
and y
. This is useful for identifying the common elements between two data frames.
# Intersection of df1 and df2
intersect(df1, df2)
#> # A tibble: 2 × 2
#> id value
#> <dbl> <chr>
#> 1 2 B
#> 2 3 C
In this example, intersect(df1, df2)
returns the rows that are common to both df1
and df2
.
7.8.2 union(x, y)
The union(x, y)
function returns all unique rows from x
and y
, excluding duplicates. This operation is similar to combining two sets in mathematics.
# Example of union()
union(df1, df2)
#> # A tibble: 4 × 2
#> id value
#> <dbl> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 4 D
The result includes all rows from both data frames, without any duplicates.
7.8.3 union_all(x, y)
Unlike union()
, the union_all(x, y)
function includes all rows from both x
and y
, including duplicates. This can be useful if you want to keep repeated values.
# Example of union_all()
union_all(df1, df2)
#> # A tibble: 6 × 2
#> id value
#> <dbl> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 2 B
#> 5 3 C
#> 6 4 D
The output includes all rows from both df1
and df2
, and duplicate rows are preserved.
7.8.4 setdiff(x, y)
The setdiff(x, y)
function finds all rows that are in x
but not in y
. This is useful for identifying elements that are unique to x
.
This returns rows from df1
that do not appear in df2
.
7.8.5 symdiff(x, y)
The symdiff(x, y)
function computes the symmetric difference, which includes all rows that are in x
but not in y
and all rows that are in y
but not in x
.
# Example of symmetric difference
symdiff(df1, df2)
#> # A tibble: 2 × 2
#> id value
#> <dbl> <chr>
#> 1 1 A
#> 2 4 D
This custom function returns rows that are unique to each data frame.
7.8.6 setequal(x, y)
The setequal(x, y)
function checks if x
and y
contain the same rows, ignoring the order of the rows. It returns TRUE
if they are equivalent and FALSE
otherwise.
This will return FALSE
in this example because df1
and df2
do not contain the same rows.
7.8.7 Summary
Set operations in dplyr
are a powerful way to compare data frames. They allow you to extract or compare unique and overlapping rows, enabling deeper insights and control over your data analysis.
7.8.8 Exercises
Create the following two tibbles.
df_a <- tibble(id = c(1, 2, 3), value = c("X", "Y", "Z"))
df_b <- tibble(id = c(2, 3, 4), value = c("Y", "Z", "W"))
Intersection Exercise: Use
intersect()
to find common rows ofdf_a
anddf_b
.Union Exercise: Find the union of
df_a
anddf_b
. Compare the results ofunion()
andunion_all()
.Difference Exercise: Use
setdiff()
to find rows that are indf_a
but not indf_b
. What about rows that are indf_b
but not indf_a
?Symmetric Difference Exercise: Compute the symmetric difference of
df_a
anddf_b
without usingsymdiff()
function. Verify that your result is the same as the one from thesymdiff()
function.Set Equality Exercise: Use
setequal()
to verify if two data frames with reordered rows are still considered equal: create two data frames with the same rows but in a different order and check if they are equal usingsetequal()
.