6.4 Select, Rename, and Reorder Variables

So far, we have only been manipulating observations, including filtering and arranging the observations. In many applications, there are some variables which may not be the focus of the research problem. As a result, learning how to select variables is very useful in data analysis.

6.4.1 Select Variables via Names

Let’s again consider the ahp data set, which contains 56 variables in total. If we want to focus on the sold date, house style, living area, and sale price, we can use the select() function in the dplyr package. We will be using the pipe introduced in Section 6.3 throughout the rest of this Chapter. Inside the select() function, you can include the variables to select as different arguments. Note that the quotation marks are not necessary.

library(r02pro)
library(tidyverse)
ahp %>% select(dt_sold, house_style, liv_area, sale_price)
#> # A tibble: 2,048 × 4
#>    dt_sold    house_style liv_area sale_price
#>    <date>     <chr>          <dbl>      <dbl>
#>  1 2010-03-25 2Story          1479       130.
#>  2 2009-04-10 2Story          2122        NA 
#>  3 2010-01-15 1Story          1057       109 
#>  4 2010-04-19 2Story          1444       174 
#>  5 2010-03-22 1.5Fin          1445       138.
#>  6 2010-06-06 1Story          1888       190 
#>  7 2006-06-14 SFoyer          1072       140 
#>  8 2010-05-08 1Story          1188       142 
#>  9 2007-06-14 1Story           924       112.
#> 10 2007-09-01 2Story          2080       135 
#> # … with 2,038 more rows

As you can see from the output, the select() function returns another tibble with all the observations and the selected variables.

6.4.2 Rename Variables

In the select() function, you can also rename the variables in the form of new_name = old_name as the arguments. Let’s see the following example where we want to rename dt_sold as sold_date and liv_area as living_area.

ahp %>% select(sold_date = dt_sold, living_area = liv_area, sale_price)
#> # A tibble: 2,048 × 3
#>    sold_date  living_area sale_price
#>    <date>           <dbl>      <dbl>
#>  1 2010-03-25        1479       130.
#>  2 2009-04-10        2122        NA 
#>  3 2010-01-15        1057       109 
#>  4 2010-04-19        1444       174 
#>  5 2010-03-22        1445       138.
#>  6 2010-06-06        1888       190 
#>  7 2006-06-14        1072       140 
#>  8 2010-05-08        1188       142 
#>  9 2007-06-14         924       112.
#> 10 2007-09-01        2080       135 
#> # … with 2,038 more rows

Note that in this renaming operations, only the variables mentioned in the select() function are kept. If you want to generate a tibble with some variables renamed while keeping all variables, you can use the rename() function.

ahp %>% rename(sold_date = dt_sold, living_area = liv_area)
#> # A tibble: 2,048 × 56
#>    sold_date  yr_sold mo_sold yr_built yr_remodel bldg_class bldg_type
#>    <date>       <dbl>   <dbl>    <dbl>      <dbl>      <dbl> <chr>    
#>  1 2010-03-25    2010       3     1976       2005         60 1Fam     
#>  2 2009-04-10    2009       4     1996       1997         60 1Fam     
#>  3 2010-01-15    2010       1     1953       2007         20 1Fam     
#>  4 2010-04-19    2010       4     2006       2007         60 1Fam     
#>  5 2010-03-22    2010       3     1900       1993         50 1Fam     
#>  6 2010-06-06    2010       6     1966       2002         20 1Fam     
#>  7 2006-06-14    2006       6     2005       2006        180 TwnhsE   
#>  8 2010-05-08    2010       5     1959       1959         20 1Fam     
#>  9 2007-06-14    2007       6     1952       1952         20 1Fam     
#> 10 2007-09-01    2007       9     1969       1969         60 1Fam     
#> # … with 2,038 more rows, and 49 more variables: house_style <chr>,
#> #   zoning <chr>, neighborhd <chr>, oa_cond <dbl>, oa_qual <dbl>, func <chr>,
#> #   living_area <dbl>, `1fl_area` <dbl>, `2fl_area` <dbl>, tot_rms <dbl>,
#> #   bedroom <dbl>, bathroom <dbl>, kit <dbl>, kit_qual <chr>,
#> #   central_air <chr>, elect <chr>, bsmt_area <dbl>, bsmt_cond <chr>,
#> #   bsmt_exp <chr>, bsmt_fin_qual <chr>, bsmt_ht <chr>, ext_cond <chr>,
#> #   ext_cover <chr>, ext_qual <chr>, fdn <chr>, fence <chr>, fp <dbl>, …

From this example, you can see the different behavior between select() and rename() when renaming variables.

6.4.3 Select Variables with Patterns

Sometimes, you may want to select many variables that follows a particular pattern.

a. select consecutive variables

The first type of variable patterns is consecutive variables in a tibble.

To get started, you can use select(1:m) to get the first m variables. To get the last column, you can use select(last_col()). To get the last m variables, you can use select(last_col(m-1):last_col()).

ahp %>% select(1:3)                    # the first 3 columns
#> # A tibble: 2,048 × 3
#>    dt_sold    yr_sold mo_sold
#>    <date>       <dbl>   <dbl>
#>  1 2010-03-25    2010       3
#>  2 2009-04-10    2009       4
#>  3 2010-01-15    2010       1
#>  4 2010-04-19    2010       4
#>  5 2010-03-22    2010       3
#>  6 2010-06-06    2010       6
#>  7 2006-06-14    2006       6
#>  8 2010-05-08    2010       5
#>  9 2007-06-14    2007       6
#> 10 2007-09-01    2007       9
#> # … with 2,038 more rows
ahp %>% select(last_col())             # the last column
#> # A tibble: 2,048 × 1
#>    sale_price
#>         <dbl>
#>  1       130.
#>  2        NA 
#>  3       109 
#>  4       174 
#>  5       138.
#>  6       190 
#>  7       140 
#>  8       142 
#>  9       112.
#> 10       135 
#> # … with 2,038 more rows
ahp %>% select(last_col(1))            # the second to last column
#> # A tibble: 2,048 × 1
#>    wd_area
#>      <dbl>
#>  1       0
#>  2       0
#>  3       0
#>  4     100
#>  5       0
#>  6       0
#>  7       0
#>  8       0
#>  9       0
#> 10     335
#> # … with 2,038 more rows
ahp %>% select(last_col(2):last_col()) # the last 3 columns
#> # A tibble: 2,048 × 3
#>    ep_area wd_area sale_price
#>      <dbl>   <dbl>      <dbl>
#>  1       0       0       130.
#>  2       0       0        NA 
#>  3       0       0       109 
#>  4       0     100       174 
#>  5       0       0       138.
#>  6       0       0       190 
#>  7       0       0       140 
#>  8       0       0       142 
#>  9       0       0       112.
#> 10       0     335       135 
#> # … with 2,038 more rows

If you want to select all variables from yr_sold to yr_remodel in the tibble, you can use yr_sold:yr_remodel as a single argument in the select() function. Note that the result highly depends on the particular ordering of variables in the tibble.

ahp %>% select(yr_sold:yr_remodel, sale_price)
#> # A tibble: 2,048 × 5
#>    yr_sold mo_sold yr_built yr_remodel sale_price
#>      <dbl>   <dbl>    <dbl>      <dbl>      <dbl>
#>  1    2010       3     1976       2005       130.
#>  2    2009       4     1996       1997        NA 
#>  3    2010       1     1953       2007       109 
#>  4    2010       4     2006       2007       174 
#>  5    2010       3     1900       1993       138.
#>  6    2010       6     1966       2002       190 
#>  7    2006       6     2005       2006       140 
#>  8    2010       5     1959       1959       142 
#>  9    2007       6     1952       1952       112.
#> 10    2007       9     1969       1969       135 
#> # … with 2,038 more rows

b. select variables with names that start with, end with, or contain a string Another commonly used pattern is to select all variables with names that start with, end with, or contain a string. To select all variables that begin with "yr" (representing year information) in the ahp data set, you can use the starts_with() function in the select() function.

ahp %>% select(starts_with("yr"), sale_price)
#> # A tibble: 2,048 × 4
#>    yr_sold yr_built yr_remodel sale_price
#>      <dbl>    <dbl>      <dbl>      <dbl>
#>  1    2010     1976       2005       130.
#>  2    2009     1996       1997        NA 
#>  3    2010     1953       2007       109 
#>  4    2010     2006       2007       174 
#>  5    2010     1900       1993       138.
#>  6    2010     1966       2002       190 
#>  7    2006     2005       2006       140 
#>  8    2010     1959       1959       142 
#>  9    2007     1952       1952       112.
#> 10    2007     1969       1969       135 
#> # … with 2,038 more rows

In the output, we can see three variables that begins with "yr" including the sold year, built year, and remodel year. We also include the sale_price in the select() function.

You can also use ends_with() to select all variables with names that end with a particular string.

ahp %>% select(ends_with("qual"))   #ending with qual
#> # A tibble: 2,048 × 7
#>    oa_qual kit_qual bsmt_fin_qual ext_qual fp_qual gar_qual heat_qual
#>      <dbl> <chr>    <chr>         <chr>    <chr>   <chr>    <chr>    
#>  1       6 Good     GLQ           Good     <NA>    Average  Excellent
#>  2       7 Good     GLQ           Good     Average Average  Excellent
#>  3       5 Good     GLQ           Average  <NA>    Average  Average  
#>  4       5 Average  Unf           Average  <NA>    Average  Good     
#>  5       6 Average  Unf           Average  <NA>    Average  Average  
#>  6       6 Good     ALQ           Good     Good    Average  Excellent
#>  7       6 Average  GLQ           Average  <NA>    Average  Good     
#>  8       5 Average  Rec           Average  <NA>    Average  Fair     
#>  9       5 Average  Rec           Average  <NA>    Average  Average  
#> 10       5 Fair     Rec           Average  Average Average  Average  
#> # … with 2,038 more rows

This produces a tibble with all quality related variables (ending with qual).

If you want to select all variables with names containing a string (not necessarily starts with or ends with), you can use the contains() function with the string as its argument.

ahp %>% select(contains("mo"))
#> # A tibble: 2,048 × 2
#>    mo_sold yr_remodel
#>      <dbl>      <dbl>
#>  1       3       2005
#>  2       4       1997
#>  3       1       2007
#>  4       4       2007
#>  5       3       1993
#>  6       6       2002
#>  7       6       2006
#>  8       5       1959
#>  9       6       1952
#> 10       9       1969
#> # … with 2,038 more rows

This outputs two variables named mo_sold and yr_remodel, both containing mo.

You can also combine multiple conditions using logical operators. First of all, the multiple arguments in the select() function are combined with **or operations (|). For example, to get all variables that begins with b or ends with m, you can do the following.

ahp %>% select(starts_with("b"), ends_with("m"))
#> # A tibble: 2,048 × 9
#>    bldg_class bldg_type bedroom bathroom bsmt_area bsmt_cond bsmt_exp
#>         <dbl> <chr>       <dbl>    <dbl>     <dbl> <chr>     <chr>   
#>  1         60 1Fam            3      2.5       725 Average   No      
#>  2         60 1Fam            4      3.5       913 Average   No      
#>  3         20 1Fam            3      2        1057 Average   No      
#>  4         60 1Fam            3      2.5       384 Average   No      
#>  5         50 1Fam            3      2         676 Good      No      
#>  6         20 1Fam            2      2.5      1517 Average   No      
#>  7        180 TwnhsE          2      3         547 Average   Gd      
#>  8         20 1Fam            3      2        1188 Average   No      
#>  9         20 1Fam            2      1         924 Average   No      
#> 10         60 1Fam            5      2.5      1040 Average   Av      
#> # … with 2,038 more rows, and 2 more variables: bsmt_fin_qual <chr>,
#> #   bsmt_ht <chr>
#ahp %>% select(starts_with("b") | ends_with("m")) #equivalent

For example, to get all variables that begins with b and ends with m, you can do the following.

ahp %>% select(starts_with("b") & ends_with("m"))
#> # A tibble: 2,048 × 2
#>    bedroom bathroom
#>      <dbl>    <dbl>
#>  1       3      2.5
#>  2       4      3.5
#>  3       3      2  
#>  4       3      2.5
#>  5       3      2  
#>  6       2      2.5
#>  7       2      3  
#>  8       3      2  
#>  9       2      1  
#> 10       5      2.5
#> # … with 2,038 more rows

c. select variables except for the specified ones

Sometimes, instead of specifying the variables to select, you want to keep all the variables except a few given ones. It works similar to the vector subsetting where you can add the negation sign ! before the indices. For example, to keep all variables except the ones of names ending with "room" in the sahp data set, you can use select(!ends_with("room")).

sahp %>% select(!ends_with("room"))
#> # A tibble: 165 × 10
#>    dt_sold    gar_car oa_qual liv_area lot_area house_style kit_qual heat_qual
#>    <date>       <dbl>   <dbl>    <dbl>    <dbl> <chr>       <chr>    <chr>    
#>  1 2010-03-25       2       6     1479    13517 2Story      Good     Excellent
#>  2 2009-04-10       2       7     2122    11492 2Story      Good     Excellent
#>  3 2010-01-15       1       5     1057     7922 1Story      Good     Average  
#>  4 2010-04-19       2       5     1444     9802 2Story      Average  Good     
#>  5 2010-03-22       2       6     1445    14235 1.5Fin      Average  Average  
#>  6 2010-06-06       2       6     1888    16492 1Story      Good     Excellent
#>  7 2006-06-14       2       6     1072     3675 SFoyer      Average  Good     
#>  8 2010-05-08       2       5     1188    12160 1Story      Average  Fair     
#>  9 2007-06-14       1       5      924    15783 1Story      Average  Average  
#> 10 2007-09-01       2       5     2080    11606 2Story      Fair     Average  
#> # … with 155 more rows, and 2 more variables: central_air <chr>,
#> #   sale_price <dbl>

6.4.4 Reorder Variables via Names

Knowing how to select and rename variables, let’s learn how to reorder variables in a tibble. The first approach to reordering variables is to use the select() function by listing all the variables in the desired order. However, you may realized that it would be time consuming if there are many variables in the tibble and we only want to move a few to the beginning. It turns out we can utilize the everything() function as an argument in the select() function, which will select all the remaining variables. Let’s say we want to move yr_sold and sale_price to the beginning of the tibble.

ahp %>% select(yr_sold, sale_price, everything())
#> # A tibble: 2,048 × 56
#>    yr_sold sale_price dt_sold    mo_sold yr_built yr_remodel bldg_class
#>      <dbl>      <dbl> <date>       <dbl>    <dbl>      <dbl>      <dbl>
#>  1    2010       130. 2010-03-25       3     1976       2005         60
#>  2    2009        NA  2009-04-10       4     1996       1997         60
#>  3    2010       109  2010-01-15       1     1953       2007         20
#>  4    2010       174  2010-04-19       4     2006       2007         60
#>  5    2010       138. 2010-03-22       3     1900       1993         50
#>  6    2010       190  2010-06-06       6     1966       2002         20
#>  7    2006       140  2006-06-14       6     2005       2006        180
#>  8    2010       142  2010-05-08       5     1959       1959         20
#>  9    2007       112. 2007-06-14       6     1952       1952         20
#> 10    2007       135  2007-09-01       9     1969       1969         60
#> # … with 2,038 more rows, and 49 more variables: bldg_type <chr>,
#> #   house_style <chr>, zoning <chr>, neighborhd <chr>, oa_cond <dbl>,
#> #   oa_qual <dbl>, func <chr>, liv_area <dbl>, `1fl_area` <dbl>,
#> #   `2fl_area` <dbl>, tot_rms <dbl>, bedroom <dbl>, bathroom <dbl>, kit <dbl>,
#> #   kit_qual <chr>, central_air <chr>, elect <chr>, bsmt_area <dbl>,
#> #   bsmt_cond <chr>, bsmt_exp <chr>, bsmt_fin_qual <chr>, bsmt_ht <chr>,
#> #   ext_cond <chr>, ext_cover <chr>, ext_qual <chr>, fdn <chr>, fence <chr>, …

You can see the result is another tibble of 56 columns, with the first two columns being yr_sold and sale_price, followed by the remaining variables in the same order as in the original tibble ahp.

6.4.5 Exercise

Using the ahp dataset and the pipe operator for the following exercises.

  1. Select all variables with names ends with qual or area, then give the first three variables the new names a, b and c respectively. Then assign the new dataset to a name Qual.

  2. In Qual, move liv_area to be the first column and heat_qual to be the second column while keeping the same order for the remaining columns.