Let’s start with the first task outlined at the beginning of this chapter. Suppose we want to find the houses that are sold in Jan 2009. You can use the function
filter() in the dplyr package, a member of the tidyverse package. If you haven’t installed the tidyverse package, you need to install it. Let’s first load the dplyr package.
After loading the package dplyr, you can see the following message
The following objects are masked from ‘package:stats’: filter, lag
The message appears because dplyr contains the functions
lag() which are already defined and preloaded in the R package stats. As a result, the original functions are masked by the new definition in dplyr.
In this scenario when the same function name is shared by multiple packages, we can add the package name as a prefix to the function name with double colon (
::). For example,
stats::filter() represents the
filter() function in the stats package, while
dplyr::filter() represents the
filter() function in the dplyr package. You can also look at their documentations.
It is helpful to verify which version of
filter() you are using by typing the function name
Usually, R will use the function in the package that is loaded at a later time. To verify the search path, you can use the
search() function. R will show a list of attached packages and R objects.
search() #>  ".GlobalEnv" "package:haven" "package:readxl" #>  "package:kableExtra" "package:forcats" "package:stringr" #>  "package:purrr" "package:readr" "package:tidyr" #>  "package:tidyverse" "package:ggplot2" "package:tibble" #>  "package:r02pro" "package:dplyr" "package:stats" #>  "package:graphics" "package:grDevices" "package:utils" #>  "package:datasets" "package:methods" "Autoloads" #>  "package:base"
Now, let’s introduce how to use
filter() to get the subset of
ahp which consists of houses that are sold in Jan 2009. To use the
filter() function, you put the dataset in the first argument, and put the logical statements as individual arguments after that. We know the
ahp dataset has the year information in
yr_sold and the month information in
mo_sold. Apparently, Jan 2009 corresponding to
yr_sold == 2009 and
mo_sold == 1.
library(r02pro) filter(ahp, yr_sold == 2009, mo_sold == 1) #> # A tibble: 10 × 56 #> dt_sold yr_sold mo_sold yr_built yr_remodel bldg_class bldg_type #> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 2009-01-07 2009 1 1979 1998 20 1Fam #> 2 2009-01-09 2009 1 1920 1950 30 1Fam #> 3 2009-01-04 2009 1 1958 1958 20 1Fam #> 4 2009-01-17 2009 1 2004 2004 20 1Fam #> 5 2009-01-16 2009 1 NA 2007 20 1Fam #> 6 2009-01-18 2009 1 2008 2008 20 1Fam #> 7 2009-01-07 2009 1 2008 2009 60 1Fam #> 8 2009-01-28 2009 1 2004 2004 60 1Fam #> 9 2009-01-12 2009 1 1926 2004 45 1Fam #> 10 2009-01-07 2009 1 2004 2005 20 1Fam #> # … with 49 more variables: house_style <chr>, zoning <chr>, neighborhd <chr>, #> # oa_cond <dbl>, oa_qual <dbl>, func <chr>, liv_area <dbl>, `1fl_area` <dbl>, #> # `2fl_area` <dbl>, tot_rms <dbl>, bedroom <dbl>, bathroom <dbl>, kit <dbl>, #> # kit_qual <chr>, central_air <chr>, elect <chr>, bsmt_area <dbl>, #> # bsmt_cond <chr>, bsmt_exp <chr>, bsmt_fin_qual <chr>, bsmt_ht <chr>, #> # ext_cond <chr>, ext_cover <chr>, ext_qual <chr>, fdn <chr>, fence <chr>, #> # fp <dbl>, fp_qual <chr>, gar_area <dbl>, gar_car <dbl>, gar_cond <chr>, …
filter() function, each logical statement will be computed, which leads to a logical vector of the same length as the number of observations. Then, only the observations that have
TRUE values in all logical vectors are kept.
It is helpful to learn the mechanism of
filter() by reproducing the results using what we learned on data frame subsetting in Section @ref(subset_df).
$yr_sold == 2009 & ahp$mo_sold == 1, ] ahp[ahp
Although we got the same answer, we hope you agree with us that the
filter() function provides more intuitive and simpler codes than the raw data frame subsetting. For example, the tibble name
ahp appeared three times in the data frame subsetting while it only appears once in the
This is an example of the power of creating new R functions and R packages. They usually enable us to do tasks that couldn’t be done using the existing functions in base R, or making coding easier than just using the existing functions. Recall that the same thing happened in the visualization where we compared the visualization functions in base R with the
ggplot() function. No matter how complicated the figure we want to create is, we only need to put the data set name once in the
ggplot() function if all the layers are using the same data set.
It is worth noting that the
filter() function only returns the obsevations when the conditions are all
TRUE, without including the
filter() function, the original tibble is unchanged, which is actually the feature of many functions we will learn in this Chapter. To save the filtered tibble, you can either assign the value to the original tibble name, which will overwrite it; or assign the value to a new name, which will create a new tibble with the new values. Let’s save all houses that were sold in Apr 2009 in a new tibble.
<- filter(ahp, yr_sold == 2009, mo_sold == 4) apr09 apr09
In addition to using separate logical statements, you can also have logical operations between multiple logical vectors inside each statement. This makes the
filter() function very flexible in expressing different kinds of filtering operation. Let’s say we want to find all houses that are sold in 2007, remodeled in 2006, and with house style either “1Story” or “2Story.”
filter(ahp, dt_sold >= "2007-01-01", dt_sold < "2008-01-01", yr_remodel == 2006, house_style == "1Story" | house_style == "2Story")
From the above example, you can see that a variable can used multiple times in different logical statements.