7.2 Select, Rename, and Reorder Variables

In Section 7.1, we have learned how to filter observations according to a set of conditions. In many applications, there are some variables which may not be the focus of the research problem. As a result, learning how to select variables is very useful in data analysis.

7.2.1 Select Variables via Names

Let’s consider the gm data set in the r02pro package, which contains 33 variables. If we want to focus on the country, continent, year, and Human Development Index (HDI) category, we can use the select() function in the dplyr package. We will be using the pipe introduced in Section 7.1.3 throughout the rest of this Chapter. Inside the select() function, you can include the variables to select as different arguments. Note that the quotation marks are not necessary.

library(r02pro)
library(dplyr)
gm %>%
    select(country, continent, year, HDI_category)
#> # A tibble: 65,531 × 4
#>    country continent  year HDI_category
#>    <chr>   <chr>     <dbl> <chr>       
#>  1 Albania Europe     1999 medium      
#>  2 Albania Europe     2000 medium      
#>  3 Albania Europe     2001 medium      
#>  4 Albania Europe     2002 medium      
#>  5 Albania Europe     2003 medium      
#>  6 Albania Europe     2004 high        
#>  7 Andorra <NA>       1999 very high   
#>  8 Andorra <NA>       2000 very high   
#>  9 Andorra <NA>       2001 very high   
#> 10 Andorra <NA>       2002 very high   
#> # … with 65,521 more rows

As you can see from the output, the select() function returns a tibble with only the selected variables.

Now, let’s say we only want the observations for the year 2006. We can use the filter() function to get the subset of the gm data set with only the year 2006, and then use the select() function to keep only the variables of interest.

gm %>%
    filter(year == 2006) %>%
    select(country, continent, year, HDI_category)
#> # A tibble: 236 × 4
#>    country              continent  year HDI_category
#>    <chr>                <chr>     <dbl> <chr>       
#>  1 Afghanistan          <NA>       2006 low         
#>  2 Angola               Africa     2006 low         
#>  3 Albania              Europe     2006 high        
#>  4 Andorra              <NA>       2006 very high   
#>  5 United Arab Emirates Asia       2006 very high   
#>  6 Argentina            Americas   2006 very high   
#>  7 Armenia              Asia       2006 high        
#>  8 American Samoa       <NA>       2006 <NA>        
#>  9 Antigua and Barbuda  Americas   2006 high        
#> 10 Australia            Oceania    2006 very high   
#> # … with 226 more rows

7.2.2 Rename Variables

In the select() function, you can also rename the variables in the form of new_name = old_name as the arguments. Let’s see the following example where we want to rename the country to nation and HDI_category to HDI_cat.

gm %>%
    select(nation = country, continent, year, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 4
#>    nation  continent  year HDI_cat  
#>    <chr>   <chr>     <dbl> <chr>    
#>  1 Albania Europe     1999 medium   
#>  2 Albania Europe     2000 medium   
#>  3 Albania Europe     2001 medium   
#>  4 Albania Europe     2002 medium   
#>  5 Albania Europe     2003 medium   
#>  6 Albania Europe     2004 high     
#>  7 Andorra <NA>       1999 very high
#>  8 Andorra <NA>       2000 very high
#>  9 Andorra <NA>       2001 very high
#> 10 Andorra <NA>       2002 very high
#> # … with 65,521 more rows

Note that in this renaming operations, only the variables mentioned in the select() function are kept. If you want to generate a tibble with some variables renamed while keeping all variables, you can use the rename() function.

gm %>%
    rename(nation = country, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 33
#>    nation   year smoki…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Albania  1999     4      NA      11.1    46.1    58.7    2730    5.96    89.5
#>  2 Albania  2000    NA      NA      10.8    43.8    58      2800    6.3     90  
#>  3 Albania  2001     4      40.5    11.2    44.1    60      2860    6.81    90.6
#>  4 Albania  2002    NA      NA      11.5    45.2    60      2770    7.22    91.2
#>  5 Albania  2003    NA      NA      11.2    44.8    60.7    2790    7.39    91.7
#>  6 Albania  2004     4      40.5    11      42.8    60.7    2870    7.7     92.3
#>  7 Andorra  1999    29.2    NA      14.3    70.4    44.7      NA   39.2    100  
#>  8 Andorra  2000    NA      NA      14.3    70      47.3      NA   39.4    100  
#>  9 Andorra  2001    29.2    36.5    14.4    69.7    50.7      NA   39.5    100  
#> 10 Andorra  2002    NA      NA      14.4    69.6    67.3      NA   42.7    100  
#> # … with 65,521 more rows, 23 more variables: child_mortality <dbl>,
#> #   income_per_person <dbl>, HDI <dbl>, alcohol_male <dbl>,
#> #   alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> #   cholesterol_fat_in_blood_female <dbl>, continent <chr>, region <chr>,
#> #   population <dbl>, life_expectancy <dbl>, sugar <dbl>, BMI_female <dbl>, …

From this example, you can see the different behavior between select() and rename() when renaming variables.

7.2.3 Select Variables with Patterns

Sometimes, you may want to select many variables that follows a particular pattern.

a. select consecutive variables

The first type of variable patterns is consecutive variables in a tibble.

To get started, you can use select(1:m) to get the first m variables. To get the last column, you can use select(last_col()). To get the last m variables, you can use select(last_col(m-1):last_col()).

gm %>%
    select(1:3)  # the first 3 columns
#> # A tibble: 65,531 × 3
#>    country  year smoking_female
#>    <chr>   <dbl>          <dbl>
#>  1 Albania  1999            4  
#>  2 Albania  2000           NA  
#>  3 Albania  2001            4  
#>  4 Albania  2002           NA  
#>  5 Albania  2003           NA  
#>  6 Albania  2004            4  
#>  7 Andorra  1999           29.2
#>  8 Andorra  2000           NA  
#>  9 Andorra  2001           29.2
#> 10 Andorra  2002           NA  
#> # … with 65,521 more rows
gm %>%
    select(last_col())  # the last column
#> # A tibble: 65,531 × 1
#>    HDI_category
#>    <chr>       
#>  1 medium      
#>  2 medium      
#>  3 medium      
#>  4 medium      
#>  5 medium      
#>  6 high        
#>  7 very high   
#>  8 very high   
#>  9 very high   
#> 10 very high   
#> # … with 65,521 more rows
gm %>%
    select(last_col(1))  # the second to last column
#> # A tibble: 65,531 × 1
#>    GDP_per_capita
#>             <dbl>
#>  1           1.96
#>  2           2.14
#>  3           2.25
#>  4           2.38
#>  5           2.52
#>  6           2.68
#>  7          34.3 
#>  8          36   
#>  9          36.2 
#> 10          37.6 
#> # … with 65,521 more rows
gm %>%
    select(last_col(2):last_col())  # the last 3 columns
#> # A tibble: 65,531 × 3
#>    health_spending GDP_per_capita HDI_category
#>              <dbl>          <dbl> <chr>       
#>  1            6.35           1.96 medium      
#>  2            6.04           2.14 medium      
#>  3            6.28           2.25 medium      
#>  4            6.16           2.38 medium      
#>  5            6.88           2.52 medium      
#>  6            6.84           2.68 high        
#>  7            7.58          34.3  very high   
#>  8            6.79          36    very high   
#>  9            7.04          36.2  very high   
#> 10            7.12          37.6  very high   
#> # … with 65,521 more rows

If you want to select the variable country and all variables from HDI to continent in the tibble, you can use HDI:continent as a single argument in the select() function. Note that the result highly depends on the particular ordering of variables in the tibble.

gm %>%
    select(country, HDI:continent)
#> # A tibble: 65,531 × 11
#>    country   HDI alcoh…¹ alcoh…² liver…³ liver…⁴ morta…⁵ morta…⁶ chole…⁷ chole…⁸
#>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Albania 0.671    10.6    2.5    17.8     8.28    23.5    12.9    4.9     4.92
#>  2 Albania 0.678    11.5    2.74   17.1     7.66    19.6    11      4.9     4.92
#>  3 Albania 0.684    11.5    2.74   16.4     7.1     20.2    12.3    4.91    4.92
#>  4 Albania 0.691    11.5    2.74   15.7     6.58    21.2    12.6    4.91    4.92
#>  5 Albania 0.696    11.5    2.74   15.1     6.1     20.6    12      4.91    4.91
#>  6 Albania 0.706    12.4    2.96   14.4     5.66    25.2    14.5    4.9     4.9 
#>  7 Andorra 0.813    20.4    6.57    5.99    2.35    NA      NA      5.67    5.57
#>  8 Andorra 0.815    19.8    6.41    6.05    2.35    NA      NA      5.65    5.55
#>  9 Andorra 0.82     19.6    6.33    6.12    2.36    NA      NA      5.64    5.53
#> 10 Andorra 0.827    19.5    6.24    6.19    2.37    NA      NA      5.62    5.51
#> # … with 65,521 more rows, 1 more variable: continent <chr>, and abbreviated
#> #   variable names ¹​alcohol_male, ²​alcohol_female, ³​livercancer_newcases_male,
#> #   ⁴​livercancer_newcases_female, ⁵​mortality_male, ⁶​mortality_female,
#> #   ⁷​cholesterol_fat_in_blood_male, ⁸​cholesterol_fat_in_blood_female

b. select variables with names that start with, end with, or contain a string

Another commonly used pattern is to select all variables with names that start with, end with, or contain a string. To select all variables that end with "female" (representing the female related measures) in the gm data set, you can use the ends_with() function in the select() function.

gm %>%
    select(ends_with("female"))
#> # A tibble: 65,531 × 7
#>    smoking_female lungcancer_newcases_…¹ alcoh…² liver…³ morta…⁴ chole…⁵ BMI_f…⁶
#>             <dbl>                  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1            4                     11.1    2.5     8.28    12.9    4.92    25.1
#>  2           NA                     10.8    2.74    7.66    11      4.92    25.2
#>  3            4                     11.2    2.74    7.1     12.3    4.92    25.3
#>  4           NA                     11.5    2.74    6.58    12.6    4.92    25.3
#>  5           NA                     11.2    2.74    6.1     12      4.91    25.4
#>  6            4                     11      2.96    5.66    14.5    4.9     25.5
#>  7           29.2                   14.3    6.57    2.35    NA      5.57    26.1
#>  8           NA                     14.3    6.41    2.35    NA      5.55    26.1
#>  9           29.2                   14.4    6.33    2.36    NA      5.53    26.2
#> 10           NA                     14.4    6.24    2.37    NA      5.51    26.2
#> # … with 65,521 more rows, and abbreviated variable names
#> #   ¹​lungcancer_newcases_female, ²​alcohol_female, ³​livercancer_newcases_female,
#> #   ⁴​mortality_female, ⁵​cholesterol_fat_in_blood_female, ⁶​BMI_female

In the output, you can see the tibble with all variables ending with "female".

You can also use starts_with() to select all variables with names that end with a particular string.

gm %>%
    select(starts_with("BMI"))  #starts with BMI
#> # A tibble: 65,531 × 4
#>    BMI_female BMI_female_group BMI_male BMI_male_group
#>         <dbl> <chr>               <dbl> <chr>         
#>  1       25.1 pre_obesity          25.6 pre_obesity   
#>  2       25.2 pre_obesity          25.7 pre_obesity   
#>  3       25.3 pre_obesity          25.8 pre_obesity   
#>  4       25.3 pre_obesity          25.9 pre_obesity   
#>  5       25.4 pre_obesity          26   pre_obesity   
#>  6       25.5 pre_obesity          26.1 pre_obesity   
#>  7       26.1 pre_obesity          26.8 pre_obesity   
#>  8       26.1 pre_obesity          26.9 pre_obesity   
#>  9       26.2 pre_obesity          27   pre_obesity   
#> 10       26.2 pre_obesity          27.1 pre_obesity   
#> # … with 65,521 more rows

This produces a tibble with all variables starting with BMI.

If you want to select all variables with names containing a string (not necessarily starts with or ends with), you can use the contains() function with the string as its argument.

gm %>%
    select(contains("newcases"))
#> # A tibble: 65,531 × 4
#>    lungcancer_newcases_female lungcancer_newcases_male livercancer_new…¹ liver…²
#>                         <dbl>                    <dbl>             <dbl>   <dbl>
#>  1                       11.1                     46.1             17.8     8.28
#>  2                       10.8                     43.8             17.1     7.66
#>  3                       11.2                     44.1             16.4     7.1 
#>  4                       11.5                     45.2             15.7     6.58
#>  5                       11.2                     44.8             15.1     6.1 
#>  6                       11                       42.8             14.4     5.66
#>  7                       14.3                     70.4              5.99    2.35
#>  8                       14.3                     70                6.05    2.35
#>  9                       14.4                     69.7              6.12    2.36
#> 10                       14.4                     69.6              6.19    2.37
#> # … with 65,521 more rows, and abbreviated variable names
#> #   ¹​livercancer_newcases_male, ²​livercancer_newcases_female

This produces a tibble with all variables containing the string "newcases".

You can also combine multiple conditions using logical operators. First of all, the multiple arguments in the select() function are combined with or operations (|). For example, to get all variables that begins with b or ends with m, you can do the following.

gm %>%
    select(starts_with("smoking"), ends_with("female"))
#> # A tibble: 65,531 × 8
#>    smoking_female smoking_male lungcan…¹ alcoh…² liver…³ morta…⁴ chole…⁵ BMI_f…⁶
#>             <dbl>        <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1            4           NA        11.1    2.5     8.28    12.9    4.92    25.1
#>  2           NA           NA        10.8    2.74    7.66    11      4.92    25.2
#>  3            4           40.5      11.2    2.74    7.1     12.3    4.92    25.3
#>  4           NA           NA        11.5    2.74    6.58    12.6    4.92    25.3
#>  5           NA           NA        11.2    2.74    6.1     12      4.91    25.4
#>  6            4           40.5      11      2.96    5.66    14.5    4.9     25.5
#>  7           29.2         NA        14.3    6.57    2.35    NA      5.57    26.1
#>  8           NA           NA        14.3    6.41    2.35    NA      5.55    26.1
#>  9           29.2         36.5      14.4    6.33    2.36    NA      5.53    26.2
#> 10           NA           NA        14.4    6.24    2.37    NA      5.51    26.2
#> # … with 65,521 more rows, and abbreviated variable names
#> #   ¹​lungcancer_newcases_female, ²​alcohol_female, ³​livercancer_newcases_female,
#> #   ⁴​mortality_female, ⁵​cholesterol_fat_in_blood_female, ⁶​BMI_female

In the output, you can see the tibble with all variables starting with "smoking" or ending with "female".

7.2.4 Reorder Variables via Names

Knowing how to select and rename variables, let’s learn how to reorder variables in a tibble. The first approach to reordering variables is to use the select() function by listing all the variables in the desired order. However, you may realized that it would be time consuming if there are many variables in the tibble and we only want to move a few to the beginning. It turns out we can utilize the everything() function as an argument in the select() function, which will select all the remaining variables. Let’s say we want to move country and continent to the beginning of the tibble.

gm %>%
    select(country, continent, everything())
#> # A tibble: 65,531 × 33
#>    country conti…¹  year smoki…² smoki…³ lungc…⁴ lungc…⁵ owid_…⁶ food_…⁷ avera…⁸
#>    <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Albania Europe   1999     4      NA      11.1    46.1    58.7    2730    5.96
#>  2 Albania Europe   2000    NA      NA      10.8    43.8    58      2800    6.3 
#>  3 Albania Europe   2001     4      40.5    11.2    44.1    60      2860    6.81
#>  4 Albania Europe   2002    NA      NA      11.5    45.2    60      2770    7.22
#>  5 Albania Europe   2003    NA      NA      11.2    44.8    60.7    2790    7.39
#>  6 Albania Europe   2004     4      40.5    11      42.8    60.7    2870    7.7 
#>  7 Andorra <NA>     1999    29.2    NA      14.3    70.4    44.7      NA   39.2 
#>  8 Andorra <NA>     2000    NA      NA      14.3    70      47.3      NA   39.4 
#>  9 Andorra <NA>     2001    29.2    36.5    14.4    69.7    50.7      NA   39.5 
#> 10 Andorra <NA>     2002    NA      NA      14.4    69.6    67.3      NA   42.7 
#> # … with 65,521 more rows, 23 more variables: sanitation <dbl>,
#> #   child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> #   alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> #   cholesterol_fat_in_blood_female <dbl>, region <chr>, population <dbl>,
#> #   life_expectancy <dbl>, sugar <dbl>, BMI_female <dbl>, …

This produces a tibble with country and continent as the first two columns, followed by the remaining variables in the same order as in the original tibble.

7.2.5 Exercise

Using the ahp dataset and the pipe operator for the following exercises.

  1. Select the variables dt_sold, house_style, liv_area, and sale_price, while renaming dt_sold to date_sold and house_style to house_type.

  2. Select the variables that starts with yr or ends with qual.