7.2 Select, Rename, and Reorder Variables

In Section 7.1, we have learned how to filter observations according to a set of conditions. In many applications, there are some variables which may not be the focus of the research problem. As a result, learning how to select variables is very useful in data analysis.

7.2.1 Select Variables via Names

Let’s consider the gm data set in the r02pro package, which contains 33 variables. If we want to focus on the country, continent, year, and Human Development Index (HDI) category, we can use the select() function in the dplyr package. We will be using the pipe introduced in Section 7.1.3 throughout the rest of this Chapter. Inside the select() function, you can include the variables to select as different arguments. Note that the quotation marks are not necessary.

library(r02pro)
library(dplyr)
gm %>%
    select(country, continent, year, HDI_category)
#> # A tibble: 65,531 × 4
#>    country continent  year HDI_category
#>    <chr>   <chr>     <dbl> <chr>       
#>  1 Albania Europe     1999 medium      
#>  2 Albania Europe     2000 medium      
#>  3 Albania Europe     2001 medium      
#>  4 Albania Europe     2002 medium      
#>  5 Albania Europe     2003 medium      
#>  6 Albania Europe     2004 high        
#>  7 Andorra <NA>       1999 very high   
#>  8 Andorra <NA>       2000 very high   
#>  9 Andorra <NA>       2001 very high   
#> 10 Andorra <NA>       2002 very high   
#> # ℹ 65,521 more rows

As you can see from the output, the select() function returns a tibble with only the selected variables.

Now, let’s say we only want the observations for the year 2006. We can use the filter() function to get the subset of the gm data set with only the year 2006, and then use the select() function to keep only the variables of interest.

gm %>%
    filter(year == 2006) %>%
    select(country, continent, year, HDI_category)
#> # A tibble: 236 × 4
#>    country              continent  year HDI_category
#>    <chr>                <chr>     <dbl> <chr>       
#>  1 Afghanistan          <NA>       2006 low         
#>  2 Angola               Africa     2006 low         
#>  3 Albania              Europe     2006 high        
#>  4 Andorra              <NA>       2006 very high   
#>  5 United Arab Emirates Asia       2006 very high   
#>  6 Argentina            Americas   2006 very high   
#>  7 Armenia              Asia       2006 high        
#>  8 American Samoa       <NA>       2006 <NA>        
#>  9 Antigua and Barbuda  Americas   2006 high        
#> 10 Australia            Oceania    2006 very high   
#> # ℹ 226 more rows

7.2.2 Rename Variables

In the select() function, you can also rename the variables in the form of new_name = old_name as the arguments. Let’s see the following example where we want to rename the country to nation and HDI_category to HDI_cat.

gm %>%
    select(nation = country, continent, year, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 4
#>    nation  continent  year HDI_cat  
#>    <chr>   <chr>     <dbl> <chr>    
#>  1 Albania Europe     1999 medium   
#>  2 Albania Europe     2000 medium   
#>  3 Albania Europe     2001 medium   
#>  4 Albania Europe     2002 medium   
#>  5 Albania Europe     2003 medium   
#>  6 Albania Europe     2004 high     
#>  7 Andorra <NA>       1999 very high
#>  8 Andorra <NA>       2000 very high
#>  9 Andorra <NA>       2001 very high
#> 10 Andorra <NA>       2002 very high
#> # ℹ 65,521 more rows

Note that in this renaming operations, only the variables mentioned in the select() function are kept. If you want to generate a tibble with some variables renamed while keeping all variables, you can use the rename() function.

gm %>%
    rename(nation = country, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 33
#>    nation   year smoking_female smoking_male lungcancer_newcases_female
#>    <chr>   <dbl>          <dbl>        <dbl>                      <dbl>
#>  1 Albania  1999            4           NA                         11.1
#>  2 Albania  2000           NA           NA                         10.8
#>  3 Albania  2001            4           40.5                       11.2
#>  4 Albania  2002           NA           NA                         11.5
#>  5 Albania  2003           NA           NA                         11.2
#>  6 Albania  2004            4           40.5                       11  
#>  7 Andorra  1999           29.2         NA                         14.3
#>  8 Andorra  2000           NA           NA                         14.3
#>  9 Andorra  2001           29.2         36.5                       14.4
#> 10 Andorra  2002           NA           NA                         14.4
#> # ℹ 65,521 more rows
#> # ℹ 28 more variables: lungcancer_newcases_male <dbl>, owid_edu_idx <dbl>,
#> #   food_supply <dbl>, average_daily_income <dbl>, sanitation <dbl>,
#> #   child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> #   alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>, …

From this example, you can see the different behavior between select() and rename() when renaming variables.

7.2.3 Select Variables with Patterns

Sometimes, you may want to select many variables that follows a particular pattern.

a. select consecutive variables

The first type of variable patterns is consecutive variables in a tibble.

To get started, you can use select(1:m) to get the first m variables. To get the last column, you can use select(last_col()). To get the last m variables, you can use select(last_col(m-1):last_col()).

gm %>%
    select(1:3)  # the first 3 columns
#> # A tibble: 65,531 × 3
#>    country  year smoking_female
#>    <chr>   <dbl>          <dbl>
#>  1 Albania  1999            4  
#>  2 Albania  2000           NA  
#>  3 Albania  2001            4  
#>  4 Albania  2002           NA  
#>  5 Albania  2003           NA  
#>  6 Albania  2004            4  
#>  7 Andorra  1999           29.2
#>  8 Andorra  2000           NA  
#>  9 Andorra  2001           29.2
#> 10 Andorra  2002           NA  
#> # ℹ 65,521 more rows
gm %>%
    select(last_col())  # the last column
#> # A tibble: 65,531 × 1
#>    HDI_category
#>    <chr>       
#>  1 medium      
#>  2 medium      
#>  3 medium      
#>  4 medium      
#>  5 medium      
#>  6 high        
#>  7 very high   
#>  8 very high   
#>  9 very high   
#> 10 very high   
#> # ℹ 65,521 more rows
gm %>%
    select(last_col(1))  # the second to last column
#> # A tibble: 65,531 × 1
#>    GDP_per_capita
#>             <dbl>
#>  1           1.96
#>  2           2.14
#>  3           2.25
#>  4           2.38
#>  5           2.52
#>  6           2.68
#>  7          34.3 
#>  8          36   
#>  9          36.2 
#> 10          37.6 
#> # ℹ 65,521 more rows
gm %>%
    select(last_col(2):last_col())  # the last 3 columns
#> # A tibble: 65,531 × 3
#>    health_spending GDP_per_capita HDI_category
#>              <dbl>          <dbl> <chr>       
#>  1            6.35           1.96 medium      
#>  2            6.04           2.14 medium      
#>  3            6.28           2.25 medium      
#>  4            6.16           2.38 medium      
#>  5            6.88           2.52 medium      
#>  6            6.84           2.68 high        
#>  7            7.58          34.3  very high   
#>  8            6.79          36    very high   
#>  9            7.04          36.2  very high   
#> 10            7.12          37.6  very high   
#> # ℹ 65,521 more rows

If you want to select the variable country and all variables from HDI to continent in the tibble, you can use HDI:continent as a single argument in the select() function. Note that the result highly depends on the particular ordering of variables in the tibble.

gm %>%
    select(country, HDI:continent)
#> # A tibble: 65,531 × 11
#>    country   HDI alcohol_male alcohol_female livercancer_newcases_male
#>    <chr>   <dbl>        <dbl>          <dbl>                     <dbl>
#>  1 Albania 0.671         10.6           2.5                      17.8 
#>  2 Albania 0.678         11.5           2.74                     17.1 
#>  3 Albania 0.684         11.5           2.74                     16.4 
#>  4 Albania 0.691         11.5           2.74                     15.7 
#>  5 Albania 0.696         11.5           2.74                     15.1 
#>  6 Albania 0.706         12.4           2.96                     14.4 
#>  7 Andorra 0.813         20.4           6.57                      5.99
#>  8 Andorra 0.815         19.8           6.41                      6.05
#>  9 Andorra 0.82          19.6           6.33                      6.12
#> 10 Andorra 0.827         19.5           6.24                      6.19
#> # ℹ 65,521 more rows
#> # ℹ 6 more variables: livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> #   cholesterol_fat_in_blood_female <dbl>, continent <chr>

b. select variables with names that start with, end with, or contain a string

Another commonly used pattern is to select all variables with names that start with, end with, or contain a string. To select all variables that end with "female" (representing the female related measures) in the gm data set, you can use the ends_with() function in the select() function.

gm %>%
    select(ends_with("female"))
#> # A tibble: 65,531 × 7
#>    smoking_female lungcancer_newcases_fe…¹ alcohol_female livercancer_newcases…²
#>             <dbl>                    <dbl>          <dbl>                  <dbl>
#>  1            4                       11.1           2.5                    8.28
#>  2           NA                       10.8           2.74                   7.66
#>  3            4                       11.2           2.74                   7.1 
#>  4           NA                       11.5           2.74                   6.58
#>  5           NA                       11.2           2.74                   6.1 
#>  6            4                       11             2.96                   5.66
#>  7           29.2                     14.3           6.57                   2.35
#>  8           NA                       14.3           6.41                   2.35
#>  9           29.2                     14.4           6.33                   2.36
#> 10           NA                       14.4           6.24                   2.37
#> # ℹ 65,521 more rows
#> # ℹ abbreviated names: ¹​lungcancer_newcases_female,
#> #   ²​livercancer_newcases_female
#> # ℹ 3 more variables: mortality_female <dbl>,
#> #   cholesterol_fat_in_blood_female <dbl>, BMI_female <dbl>

In the output, you can see the tibble with all variables ending with "female".

You can also use starts_with() to select all variables with names that end with a particular string.

gm %>%
    select(starts_with("BMI"))  #starts with BMI
#> # A tibble: 65,531 × 4
#>    BMI_female BMI_female_group BMI_male BMI_male_group
#>         <dbl> <chr>               <dbl> <chr>         
#>  1       25.1 pre_obesity          25.6 pre_obesity   
#>  2       25.2 pre_obesity          25.7 pre_obesity   
#>  3       25.3 pre_obesity          25.8 pre_obesity   
#>  4       25.3 pre_obesity          25.9 pre_obesity   
#>  5       25.4 pre_obesity          26   pre_obesity   
#>  6       25.5 pre_obesity          26.1 pre_obesity   
#>  7       26.1 pre_obesity          26.8 pre_obesity   
#>  8       26.1 pre_obesity          26.9 pre_obesity   
#>  9       26.2 pre_obesity          27   pre_obesity   
#> 10       26.2 pre_obesity          27.1 pre_obesity   
#> # ℹ 65,521 more rows

This produces a tibble with all variables starting with BMI.

If you want to select all variables with names containing a string (not necessarily starts with or ends with), you can use the contains() function with the string as its argument.

gm %>%
    select(contains("newcases"))
#> # A tibble: 65,531 × 4
#>    lungcancer_newcases_female lungcancer_newcases_male livercancer_newcases_male
#>                         <dbl>                    <dbl>                     <dbl>
#>  1                       11.1                     46.1                     17.8 
#>  2                       10.8                     43.8                     17.1 
#>  3                       11.2                     44.1                     16.4 
#>  4                       11.5                     45.2                     15.7 
#>  5                       11.2                     44.8                     15.1 
#>  6                       11                       42.8                     14.4 
#>  7                       14.3                     70.4                      5.99
#>  8                       14.3                     70                        6.05
#>  9                       14.4                     69.7                      6.12
#> 10                       14.4                     69.6                      6.19
#> # ℹ 65,521 more rows
#> # ℹ 1 more variable: livercancer_newcases_female <dbl>

This produces a tibble with all variables containing the string "newcases".

You can also combine multiple conditions using logical operators. First of all, the multiple arguments in the select() function are combined with or operations (|). For example, to get all variables that begins with b or ends with m, you can do the following.

gm %>%
    select(starts_with("smoking"), ends_with("female"))
#> # A tibble: 65,531 × 8
#>    smoking_female smoking_male lungcancer_newcases_female alcohol_female
#>             <dbl>        <dbl>                      <dbl>          <dbl>
#>  1            4           NA                         11.1           2.5 
#>  2           NA           NA                         10.8           2.74
#>  3            4           40.5                       11.2           2.74
#>  4           NA           NA                         11.5           2.74
#>  5           NA           NA                         11.2           2.74
#>  6            4           40.5                       11             2.96
#>  7           29.2         NA                         14.3           6.57
#>  8           NA           NA                         14.3           6.41
#>  9           29.2         36.5                       14.4           6.33
#> 10           NA           NA                         14.4           6.24
#> # ℹ 65,521 more rows
#> # ℹ 4 more variables: livercancer_newcases_female <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> #   BMI_female <dbl>

In the output, you can see the tibble with all variables starting with "smoking" or ending with "female".

7.2.4 Reorder Variables via Names

Knowing how to select and rename variables, let’s learn how to reorder variables in a tibble. The first approach to reordering variables is to use the select() function by listing all the variables in the desired order. However, you may realized that it would be time consuming if there are many variables in the tibble and we only want to move a few to the beginning. It turns out we can utilize the everything() function as an argument in the select() function, which will select all the remaining variables. Let’s say we want to move country and continent to the beginning of the tibble.

gm %>%
    select(country, continent, everything())
#> # A tibble: 65,531 × 33
#>    country continent  year smoking_female smoking_male lungcancer_newcases_fem…¹
#>    <chr>   <chr>     <dbl>          <dbl>        <dbl>                     <dbl>
#>  1 Albania Europe     1999            4           NA                        11.1
#>  2 Albania Europe     2000           NA           NA                        10.8
#>  3 Albania Europe     2001            4           40.5                      11.2
#>  4 Albania Europe     2002           NA           NA                        11.5
#>  5 Albania Europe     2003           NA           NA                        11.2
#>  6 Albania Europe     2004            4           40.5                      11  
#>  7 Andorra <NA>       1999           29.2         NA                        14.3
#>  8 Andorra <NA>       2000           NA           NA                        14.3
#>  9 Andorra <NA>       2001           29.2         36.5                      14.4
#> 10 Andorra <NA>       2002           NA           NA                        14.4
#> # ℹ 65,521 more rows
#> # ℹ abbreviated name: ¹​lungcancer_newcases_female
#> # ℹ 27 more variables: lungcancer_newcases_male <dbl>, owid_edu_idx <dbl>,
#> #   food_supply <dbl>, average_daily_income <dbl>, sanitation <dbl>,
#> #   child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> #   alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>, …

This produces a tibble with country and continent as the first two columns, followed by the remaining variables in the same order as in the original tibble.

7.2.5 Exercise

Using the ahp dataset and the pipe operator for the following exercises.

  1. Select the variables dt_sold, house_style, liv_area, and sale_price, while renaming dt_sold to date_sold and house_style to house_type.

  2. Select the variables that starts with yr or ends with qual.