6.4 Select, Rename, and Reorder Variables
So far, we have only been manipulating observations, including filtering and arranging the observations. In many applications, there are some variables which may not be the focus of the research problem. As a result, learning how to select variables is very useful in data analysis.
6.4.1 Select Variables via Names
Let’s again consider the ahp
data set, which contains 56 variables in total. If we want to focus on the sold date, house style, living area, and sale price, we can use the select()
function in the dplyr package. We will be using the pipe introduced in Section 6.3 throughout the rest of this Chapter. Inside the select()
function, you can include the variables to select as different arguments. Note that the quotation marks are not necessary.
library(r02pro)
library(tidyverse)
%>% select(dt_sold, house_style, liv_area, sale_price)
ahp #> # A tibble: 2,048 × 4
#> dt_sold house_style liv_area sale_price
#> <date> <chr> <dbl> <dbl>
#> 1 2010-03-25 2Story 1479 130.
#> 2 2009-04-10 2Story 2122 NA
#> 3 2010-01-15 1Story 1057 109
#> 4 2010-04-19 2Story 1444 174
#> 5 2010-03-22 1.5Fin 1445 138.
#> 6 2010-06-06 1Story 1888 190
#> 7 2006-06-14 SFoyer 1072 140
#> 8 2010-05-08 1Story 1188 142
#> 9 2007-06-14 1Story 924 112.
#> 10 2007-09-01 2Story 2080 135
#> # … with 2,038 more rows
As you can see from the output, the select()
function returns another tibble with all the observations and the selected variables.
6.4.2 Rename Variables
In the select()
function, you can also rename the variables in the form of new_name = old_name
as the arguments. Let’s see the following example where we want to rename dt_sold
as sold_date
and liv_area
as living_area
.
%>% select(sold_date = dt_sold, living_area = liv_area, sale_price)
ahp #> # A tibble: 2,048 × 3
#> sold_date living_area sale_price
#> <date> <dbl> <dbl>
#> 1 2010-03-25 1479 130.
#> 2 2009-04-10 2122 NA
#> 3 2010-01-15 1057 109
#> 4 2010-04-19 1444 174
#> 5 2010-03-22 1445 138.
#> 6 2010-06-06 1888 190
#> 7 2006-06-14 1072 140
#> 8 2010-05-08 1188 142
#> 9 2007-06-14 924 112.
#> 10 2007-09-01 2080 135
#> # … with 2,038 more rows
Note that in this renaming operations, only the variables mentioned in the select()
function are kept. If you want to generate a tibble with some variables renamed while keeping all variables, you can use the rename()
function.
%>% rename(sold_date = dt_sold, living_area = liv_area)
ahp #> # A tibble: 2,048 × 56
#> sold_date yr_sold mo_sold yr_built yr_remodel bldg_class bldg_type
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2010-03-25 2010 3 1976 2005 60 1Fam
#> 2 2009-04-10 2009 4 1996 1997 60 1Fam
#> 3 2010-01-15 2010 1 1953 2007 20 1Fam
#> 4 2010-04-19 2010 4 2006 2007 60 1Fam
#> 5 2010-03-22 2010 3 1900 1993 50 1Fam
#> 6 2010-06-06 2010 6 1966 2002 20 1Fam
#> 7 2006-06-14 2006 6 2005 2006 180 TwnhsE
#> 8 2010-05-08 2010 5 1959 1959 20 1Fam
#> 9 2007-06-14 2007 6 1952 1952 20 1Fam
#> 10 2007-09-01 2007 9 1969 1969 60 1Fam
#> # … with 2,038 more rows, and 49 more variables: house_style <chr>,
#> # zoning <chr>, neighborhd <chr>, oa_cond <dbl>, oa_qual <dbl>, func <chr>,
#> # living_area <dbl>, `1fl_area` <dbl>, `2fl_area` <dbl>, tot_rms <dbl>,
#> # bedroom <dbl>, bathroom <dbl>, kit <dbl>, kit_qual <chr>,
#> # central_air <chr>, elect <chr>, bsmt_area <dbl>, bsmt_cond <chr>,
#> # bsmt_exp <chr>, bsmt_fin_qual <chr>, bsmt_ht <chr>, ext_cond <chr>,
#> # ext_cover <chr>, ext_qual <chr>, fdn <chr>, fence <chr>, fp <dbl>, …
From this example, you can see the different behavior between select()
and rename()
when renaming variables.
6.4.3 Select Variables with Patterns
Sometimes, you may want to select many variables that follows a particular pattern.
a. select consecutive variables
The first type of variable patterns is consecutive variables in a tibble.
To get started, you can use select(1:m)
to get the first m
variables. To get the last column, you can use select(last_col())
. To get the last m
variables, you can use select(last_col(m-1):last_col())
.
%>% select(1:3) # the first 3 columns
ahp #> # A tibble: 2,048 × 3
#> dt_sold yr_sold mo_sold
#> <date> <dbl> <dbl>
#> 1 2010-03-25 2010 3
#> 2 2009-04-10 2009 4
#> 3 2010-01-15 2010 1
#> 4 2010-04-19 2010 4
#> 5 2010-03-22 2010 3
#> 6 2010-06-06 2010 6
#> 7 2006-06-14 2006 6
#> 8 2010-05-08 2010 5
#> 9 2007-06-14 2007 6
#> 10 2007-09-01 2007 9
#> # … with 2,038 more rows
%>% select(last_col()) # the last column
ahp #> # A tibble: 2,048 × 1
#> sale_price
#> <dbl>
#> 1 130.
#> 2 NA
#> 3 109
#> 4 174
#> 5 138.
#> 6 190
#> 7 140
#> 8 142
#> 9 112.
#> 10 135
#> # … with 2,038 more rows
%>% select(last_col(1)) # the second to last column
ahp #> # A tibble: 2,048 × 1
#> wd_area
#> <dbl>
#> 1 0
#> 2 0
#> 3 0
#> 4 100
#> 5 0
#> 6 0
#> 7 0
#> 8 0
#> 9 0
#> 10 335
#> # … with 2,038 more rows
%>% select(last_col(2):last_col()) # the last 3 columns
ahp #> # A tibble: 2,048 × 3
#> ep_area wd_area sale_price
#> <dbl> <dbl> <dbl>
#> 1 0 0 130.
#> 2 0 0 NA
#> 3 0 0 109
#> 4 0 100 174
#> 5 0 0 138.
#> 6 0 0 190
#> 7 0 0 140
#> 8 0 0 142
#> 9 0 0 112.
#> 10 0 335 135
#> # … with 2,038 more rows
If you want to select all variables from yr_sold
to yr_remodel
in the tibble, you can use yr_sold:yr_remodel
as a single argument in the select()
function. Note that the result highly depends on the particular ordering of variables in the tibble.
%>% select(yr_sold:yr_remodel, sale_price)
ahp #> # A tibble: 2,048 × 5
#> yr_sold mo_sold yr_built yr_remodel sale_price
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2010 3 1976 2005 130.
#> 2 2009 4 1996 1997 NA
#> 3 2010 1 1953 2007 109
#> 4 2010 4 2006 2007 174
#> 5 2010 3 1900 1993 138.
#> 6 2010 6 1966 2002 190
#> 7 2006 6 2005 2006 140
#> 8 2010 5 1959 1959 142
#> 9 2007 6 1952 1952 112.
#> 10 2007 9 1969 1969 135
#> # … with 2,038 more rows
b. select variables with names that start with, end with, or contain a string
Another commonly used pattern is to select all variables with names that start with, end with, or contain a string. To select all variables that begin with "yr"
(representing year information) in the ahp
data set, you can use the starts_with()
function in the select()
function.
%>% select(starts_with("yr"), sale_price)
ahp #> # A tibble: 2,048 × 4
#> yr_sold yr_built yr_remodel sale_price
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2010 1976 2005 130.
#> 2 2009 1996 1997 NA
#> 3 2010 1953 2007 109
#> 4 2010 2006 2007 174
#> 5 2010 1900 1993 138.
#> 6 2010 1966 2002 190
#> 7 2006 2005 2006 140
#> 8 2010 1959 1959 142
#> 9 2007 1952 1952 112.
#> 10 2007 1969 1969 135
#> # … with 2,038 more rows
In the output, we can see three variables that begins with "yr"
including the sold year, built year, and remodel year. We also include the sale_price
in the select()
function.
You can also use ends_with()
to select all variables with names that end with a particular string.
%>% select(ends_with("qual")) #ending with qual
ahp #> # A tibble: 2,048 × 7
#> oa_qual kit_qual bsmt_fin_qual ext_qual fp_qual gar_qual heat_qual
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 6 Good GLQ Good <NA> Average Excellent
#> 2 7 Good GLQ Good Average Average Excellent
#> 3 5 Good GLQ Average <NA> Average Average
#> 4 5 Average Unf Average <NA> Average Good
#> 5 6 Average Unf Average <NA> Average Average
#> 6 6 Good ALQ Good Good Average Excellent
#> 7 6 Average GLQ Average <NA> Average Good
#> 8 5 Average Rec Average <NA> Average Fair
#> 9 5 Average Rec Average <NA> Average Average
#> 10 5 Fair Rec Average Average Average Average
#> # … with 2,038 more rows
This produces a tibble with all quality related variables (ending with qual
).
If you want to select all variables with names containing a string (not necessarily starts with or ends with), you can use the contains()
function with the string as its argument.
%>% select(contains("mo"))
ahp #> # A tibble: 2,048 × 2
#> mo_sold yr_remodel
#> <dbl> <dbl>
#> 1 3 2005
#> 2 4 1997
#> 3 1 2007
#> 4 4 2007
#> 5 3 1993
#> 6 6 2002
#> 7 6 2006
#> 8 5 1959
#> 9 6 1952
#> 10 9 1969
#> # … with 2,038 more rows
This outputs two variables named mo_sold
and yr_remodel
, both containing mo
.
You can also combine multiple conditions using logical operators. First of all, the multiple arguments in the select()
function are combined with **or operations (|
). For example, to get all variables that begins with b
or ends with m
, you can do the following.
%>% select(starts_with("b"), ends_with("m"))
ahp #> # A tibble: 2,048 × 9
#> bldg_class bldg_type bedroom bathroom bsmt_area bsmt_cond bsmt_exp
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 60 1Fam 3 2.5 725 Average No
#> 2 60 1Fam 4 3.5 913 Average No
#> 3 20 1Fam 3 2 1057 Average No
#> 4 60 1Fam 3 2.5 384 Average No
#> 5 50 1Fam 3 2 676 Good No
#> 6 20 1Fam 2 2.5 1517 Average No
#> 7 180 TwnhsE 2 3 547 Average Gd
#> 8 20 1Fam 3 2 1188 Average No
#> 9 20 1Fam 2 1 924 Average No
#> 10 60 1Fam 5 2.5 1040 Average Av
#> # … with 2,038 more rows, and 2 more variables: bsmt_fin_qual <chr>,
#> # bsmt_ht <chr>
#ahp %>% select(starts_with("b") | ends_with("m")) #equivalent
For example, to get all variables that begins with b
and ends with m
, you can do the following.
%>% select(starts_with("b") & ends_with("m"))
ahp #> # A tibble: 2,048 × 2
#> bedroom bathroom
#> <dbl> <dbl>
#> 1 3 2.5
#> 2 4 3.5
#> 3 3 2
#> 4 3 2.5
#> 5 3 2
#> 6 2 2.5
#> 7 2 3
#> 8 3 2
#> 9 2 1
#> 10 5 2.5
#> # … with 2,038 more rows
c. select variables except for the specified ones
Sometimes, instead of specifying the variables to select, you want to keep all the variables except a few given ones. It works similar to the vector subsetting where you can add the negation sign !
before the indices. For example, to keep all variables except the ones of names ending with "room"
in the sahp
data set, you can use select(!ends_with("room"))
.
%>% select(!ends_with("room"))
sahp #> # A tibble: 165 × 10
#> dt_sold gar_car oa_qual liv_area lot_area house_style kit_qual heat_qual
#> <date> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 2010-03-25 2 6 1479 13517 2Story Good Excellent
#> 2 2009-04-10 2 7 2122 11492 2Story Good Excellent
#> 3 2010-01-15 1 5 1057 7922 1Story Good Average
#> 4 2010-04-19 2 5 1444 9802 2Story Average Good
#> 5 2010-03-22 2 6 1445 14235 1.5Fin Average Average
#> 6 2010-06-06 2 6 1888 16492 1Story Good Excellent
#> 7 2006-06-14 2 6 1072 3675 SFoyer Average Good
#> 8 2010-05-08 2 5 1188 12160 1Story Average Fair
#> 9 2007-06-14 1 5 924 15783 1Story Average Average
#> 10 2007-09-01 2 5 2080 11606 2Story Fair Average
#> # … with 155 more rows, and 2 more variables: central_air <chr>,
#> # sale_price <dbl>
6.4.4 Reorder Variables via Names
Knowing how to select and rename variables, let’s learn how to reorder variables in a tibble. The first approach to reordering variables is to use the select()
function by listing all the variables in the desired order. However, you may realized that it would be time consuming if there are many variables in the tibble and we only want to move a few to the beginning. It turns out we can utilize the everything()
function as an argument in the select()
function, which will select all the remaining variables. Let’s say we want to move yr_sold
and sale_price
to the beginning of the tibble.
%>% select(yr_sold, sale_price, everything())
ahp #> # A tibble: 2,048 × 56
#> yr_sold sale_price dt_sold mo_sold yr_built yr_remodel bldg_class
#> <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 2010 130. 2010-03-25 3 1976 2005 60
#> 2 2009 NA 2009-04-10 4 1996 1997 60
#> 3 2010 109 2010-01-15 1 1953 2007 20
#> 4 2010 174 2010-04-19 4 2006 2007 60
#> 5 2010 138. 2010-03-22 3 1900 1993 50
#> 6 2010 190 2010-06-06 6 1966 2002 20
#> 7 2006 140 2006-06-14 6 2005 2006 180
#> 8 2010 142 2010-05-08 5 1959 1959 20
#> 9 2007 112. 2007-06-14 6 1952 1952 20
#> 10 2007 135 2007-09-01 9 1969 1969 60
#> # … with 2,038 more rows, and 49 more variables: bldg_type <chr>,
#> # house_style <chr>, zoning <chr>, neighborhd <chr>, oa_cond <dbl>,
#> # oa_qual <dbl>, func <chr>, liv_area <dbl>, `1fl_area` <dbl>,
#> # `2fl_area` <dbl>, tot_rms <dbl>, bedroom <dbl>, bathroom <dbl>, kit <dbl>,
#> # kit_qual <chr>, central_air <chr>, elect <chr>, bsmt_area <dbl>,
#> # bsmt_cond <chr>, bsmt_exp <chr>, bsmt_fin_qual <chr>, bsmt_ht <chr>,
#> # ext_cond <chr>, ext_cover <chr>, ext_qual <chr>, fdn <chr>, fence <chr>, …
You can see the result is another tibble of 56 columns, with the first two columns being yr_sold
and sale_price
, followed by the remaining variables in the same order as in the original tibble ahp
.
6.4.5 Exercise
Using the ahp
dataset and the pipe operator for the following exercises.
Select all variables with names ends with
qual
orarea
, then give the first three variables the new namesa
,b
andc
respectively. Then assign the new dataset to a nameQual
.In
Qual
, moveliv_area
to be the first column andheat_qual
to be the second column while keeping the same order for the remaining columns.