7.8 Set Operations on Datasets
In Section 7.7, you learned how to join two data sets via a shared key. And in Section 2.14, you learned how to conduct set operations on vectors. In some applications, you may want to perform set operations on two data frames or tibbles. Set operations allow you to compare two data frames and return results based on whether rows exist in both, one, or neither of the data frames. Here, we’ll explore key set operations in the dplyr package: intersect(), union(), union_all(), setdiff(), symdiff(), and setequal().
Let’s first load the dplyr package and create two tibbles to use throughout this section.
library(dplyr)
df1 <- tibble(id = c(1, 2, 3), value = c("A", "B", "C"))
df2 <- tibble(id = c(2, 3, 4), value = c("B", "C", "D"))After loading the dplyr package, you can see a warning message “The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union”. Indeed, we have the same functions in base R and it is important to distinguish them from the ones in the dplyr package. To ensure you are using the desired version, you can use dplyr::intersect() and base::intersect(), respectively.
7.8.1 intersect(x, y)
The intersect(x, y) function finds all rows that appear in both x and y. This is useful for identifying the common elements between two data frames.
# Intersection of df1 and df2
intersect(df1, df2)
#> # A tibble: 2 × 2
#> id value
#> <dbl> <chr>
#> 1 2 B
#> 2 3 CIn this example, intersect(df1, df2) returns the rows that are common to both df1 and df2.
7.8.2 union(x, y)
The union(x, y) function returns all unique rows from x and y, excluding duplicates. This operation is similar to combining two sets in mathematics.
# Example of union()
union(df1, df2)
#> # A tibble: 4 × 2
#> id value
#> <dbl> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 4 DThe result includes all rows from both data frames, without any duplicates.
7.8.3 union_all(x, y)
Unlike union(), the union_all(x, y) function includes all rows from both x and y, including duplicates. This can be useful if you want to keep repeated values.
# Example of union_all()
union_all(df1, df2)
#> # A tibble: 6 × 2
#> id value
#> <dbl> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 2 B
#> 5 3 C
#> 6 4 DThe output includes all rows from both df1 and df2, and duplicate rows are preserved.
7.8.4 setdiff(x, y)
The setdiff(x, y) function finds all rows that are in x but not in y. This is useful for identifying elements that are unique to x.
This returns rows from df1 that do not appear in df2.
7.8.5 symdiff(x, y)
The symdiff(x, y) function computes the symmetric difference, which includes all rows that are in x but not in y and all rows that are in y but not in x.
# Example of symmetric difference
symdiff(df1, df2)
#> # A tibble: 2 × 2
#> id value
#> <dbl> <chr>
#> 1 1 A
#> 2 4 DThis custom function returns rows that are unique to each data frame.
7.8.6 setequal(x, y)
The setequal(x, y) function checks if x and y contain the same rows, ignoring the order of the rows. It returns TRUE if they are equivalent and FALSE otherwise.
This will return FALSE in this example because df1 and df2 do not contain the same rows.
7.8.7 Summary
Set operations in dplyr are a powerful way to compare data frames. They allow you to extract or compare unique and overlapping rows, enabling deeper insights and control over your data analysis.
7.8.8 Exercises
Create the following two tibbles.
df_a <- tibble(id = c(1, 2, 3), value = c("X", "Y", "Z"))
df_b <- tibble(id = c(2, 3, 4), value = c("Y", "Z", "W"))Intersection Exercise: Use
intersect()to find common rows ofdf_aanddf_b.Union Exercise: Find the union of
df_aanddf_b. Compare the results ofunion()andunion_all().Difference Exercise: Use
setdiff()to find rows that are indf_abut not indf_b. What about rows that are indf_bbut not indf_a?Symmetric Difference Exercise: Compute the symmetric difference of
df_aanddf_bwithout usingsymdiff()function. Verify that your result is the same as the one from thesymdiff()function.Set Equality Exercise: Use
setequal()to verify if two data frames with reordered rows are still considered equal: create two data frames with the same rows but in a different order and check if they are equal usingsetequal().