6.1 Filter Observations and Objects Masking
Let’s start with the first task outlined at the beginning of this chapter. Suppose we want to find the houses that are sold in Jan 2009. You can use the function filter()
in the dplyr package, a member of the tidyverse package. If you haven’t installed the tidyverse package, you need to install it. Let’s first load the dplyr package.
library(dplyr)
6.1.1 Objects Masking
After loading the package dplyr, you can see the following message
The following objects are masked from ‘package:stats’:
filter, lag
The message appears because dplyr contains the functions filter()
and lag()
which are already defined and preloaded in the R package stats. As a result, the original functions are masked by the new definition in dplyr.
In this scenario when the same function name is shared by multiple packages, we can add the package name as a prefix to the function name with double colon (::
). For example, stats::filter()
represents the filter()
function in the stats package, while dplyr::filter()
represents the filter()
function in the dplyr package. You can also look at their documentations.
::filter
?stats::filter ?dplyr
It is helpful to verify which version of filter()
you are using by typing the function name filter
.
filter
Usually, R will use the function in the package that is loaded at a later time. To verify the search path, you can use the search()
function. R will show a list of attached packages and R objects.
search()
#> [1] ".GlobalEnv" "package:haven" "package:readxl"
#> [4] "package:kableExtra" "package:forcats" "package:stringr"
#> [7] "package:purrr" "package:readr" "package:tidyr"
#> [10] "package:tidyverse" "package:ggplot2" "package:tibble"
#> [13] "package:r02pro" "package:dplyr" "package:stats"
#> [16] "package:graphics" "package:grDevices" "package:utils"
#> [19] "package:datasets" "package:methods" "Autoloads"
#> [22] "package:base"
6.1.2 Filter Observations
Now, let’s introduce how to use filter()
to get the subset of ahp
which consists of houses that are sold in Jan 2009. To use the filter()
function, you put the dataset in the first argument, and put the logical statements as individual arguments after that. We know the ahp
dataset has the year information in yr_sold
and the month information in mo_sold
. Apparently, Jan 2009 corresponding to yr_sold == 2009
and mo_sold == 1
.
library(r02pro)
filter(ahp, yr_sold == 2009, mo_sold == 1)
#> # A tibble: 10 × 56
#> dt_sold yr_sold mo_sold yr_built yr_remodel bldg_class bldg_type
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2009-01-07 2009 1 1979 1998 20 1Fam
#> 2 2009-01-09 2009 1 1920 1950 30 1Fam
#> 3 2009-01-04 2009 1 1958 1958 20 1Fam
#> 4 2009-01-17 2009 1 2004 2004 20 1Fam
#> 5 2009-01-16 2009 1 NA 2007 20 1Fam
#> 6 2009-01-18 2009 1 2008 2008 20 1Fam
#> 7 2009-01-07 2009 1 2008 2009 60 1Fam
#> 8 2009-01-28 2009 1 2004 2004 60 1Fam
#> 9 2009-01-12 2009 1 1926 2004 45 1Fam
#> 10 2009-01-07 2009 1 2004 2005 20 1Fam
#> # … with 49 more variables: house_style <chr>, zoning <chr>, neighborhd <chr>,
#> # oa_cond <dbl>, oa_qual <dbl>, func <chr>, liv_area <dbl>, `1fl_area` <dbl>,
#> # `2fl_area` <dbl>, tot_rms <dbl>, bedroom <dbl>, bathroom <dbl>, kit <dbl>,
#> # kit_qual <chr>, central_air <chr>, elect <chr>, bsmt_area <dbl>,
#> # bsmt_cond <chr>, bsmt_exp <chr>, bsmt_fin_qual <chr>, bsmt_ht <chr>,
#> # ext_cond <chr>, ext_cover <chr>, ext_qual <chr>, fdn <chr>, fence <chr>,
#> # fp <dbl>, fp_qual <chr>, gar_area <dbl>, gar_car <dbl>, gar_cond <chr>, …
In the filter()
function, each logical statement will be computed, which leads to a logical vector of the same length as the number of observations. Then, only the observations that have TRUE
values in all logical vectors are kept.
It is helpful to learn the mechanism of filter()
by reproducing the results using what we learned on data frame subsetting in Section @ref(subset_df).
$yr_sold == 2009 & ahp$mo_sold == 1, ] ahp[ahp
Although we got the same answer, we hope you agree with us that the filter()
function provides more intuitive and simpler codes than the raw data frame subsetting. For example, the tibble name ahp
appeared three times in the data frame subsetting while it only appears once in the filter()
function.
This is an example of the power of creating new R functions and R packages. They usually enable us to do tasks that couldn’t be done using the existing functions in base R, or making coding easier than just using the existing functions. Recall that the same thing happened in the visualization where we compared the visualization functions in base R with the ggplot()
function. No matter how complicated the figure we want to create is, we only need to put the data set name once in the ggplot()
function if all the layers are using the same data set.
It is worth noting that the filter()
function only returns the obsevations when the conditions are all TRUE
, without including the NA
observations.
Using the filter()
function, the original tibble is unchanged, which is actually the feature of many functions we will learn in this Chapter. To save the filtered tibble, you can either assign the value to the original tibble name, which will overwrite it; or assign the value to a new name, which will create a new tibble with the new values. Let’s save all houses that were sold in Apr 2009 in a new tibble.
<- filter(ahp, yr_sold == 2009, mo_sold == 4)
apr09 apr09
In addition to using separate logical statements, you can also have logical operations between multiple logical vectors inside each statement. This makes the filter()
function very flexible in expressing different kinds of filtering operation. Let’s say we want to find all houses that are sold in 2007, remodeled in 2006, and with house style either “1Story” or “2Story.”
filter(ahp, dt_sold >= "2007-01-01", dt_sold < "2008-01-01", yr_remodel == 2006, house_style == "1Story" | house_style == "2Story")
From the above example, you can see that a variable can used multiple times in different logical statements.