7.2 Select, Rename, and Reorder Variables
In Section 7.1, we have learned how to filter observations according to a set of conditions. In many applications, there are some variables which may not be the focus of the research problem. As a result, learning how to select variables is very useful in data analysis.
7.2.1 Select Variables via Names
Let’s consider the gm
data set in the r02pro package, which contains 33 variables. If we want to focus on the country, continent, year, and Human Development Index (HDI) category, we can use the select()
function in the dplyr package. We will be using the pipe introduced in Section 7.1.3 throughout the rest of this Chapter. Inside the select()
function, you can include the variables to select as different arguments. Note that the quotation marks are not necessary.
library(r02pro)
library(dplyr)
gm %>%
select(country, continent, year, HDI_category)
#> # A tibble: 65,531 × 4
#> country continent year HDI_category
#> <chr> <chr> <dbl> <chr>
#> 1 Albania Europe 1999 medium
#> 2 Albania Europe 2000 medium
#> 3 Albania Europe 2001 medium
#> 4 Albania Europe 2002 medium
#> 5 Albania Europe 2003 medium
#> 6 Albania Europe 2004 high
#> 7 Andorra <NA> 1999 very high
#> 8 Andorra <NA> 2000 very high
#> 9 Andorra <NA> 2001 very high
#> 10 Andorra <NA> 2002 very high
#> # … with 65,521 more rows
As you can see from the output, the select()
function returns a tibble with only the selected variables.
Now, let’s say we only want the observations for the year 2006. We can use the filter()
function to get the subset of the gm
data set with only the year 2006, and then use the select()
function to keep only the variables of interest.
gm %>%
filter(year == 2006) %>%
select(country, continent, year, HDI_category)
#> # A tibble: 236 × 4
#> country continent year HDI_category
#> <chr> <chr> <dbl> <chr>
#> 1 Afghanistan <NA> 2006 low
#> 2 Angola Africa 2006 low
#> 3 Albania Europe 2006 high
#> 4 Andorra <NA> 2006 very high
#> 5 United Arab Emirates Asia 2006 very high
#> 6 Argentina Americas 2006 very high
#> 7 Armenia Asia 2006 high
#> 8 American Samoa <NA> 2006 <NA>
#> 9 Antigua and Barbuda Americas 2006 high
#> 10 Australia Oceania 2006 very high
#> # … with 226 more rows
7.2.2 Rename Variables
In the select()
function, you can also rename the variables in the form of new_name = old_name
as the arguments. Let’s see the following example where we want to rename the country
to nation
and HDI_category
to HDI_cat
.
gm %>%
select(nation = country, continent, year, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 4
#> nation continent year HDI_cat
#> <chr> <chr> <dbl> <chr>
#> 1 Albania Europe 1999 medium
#> 2 Albania Europe 2000 medium
#> 3 Albania Europe 2001 medium
#> 4 Albania Europe 2002 medium
#> 5 Albania Europe 2003 medium
#> 6 Albania Europe 2004 high
#> 7 Andorra <NA> 1999 very high
#> 8 Andorra <NA> 2000 very high
#> 9 Andorra <NA> 2001 very high
#> 10 Andorra <NA> 2002 very high
#> # … with 65,521 more rows
Note that in this renaming operations, only the variables mentioned in the select()
function are kept. If you want to generate a tibble with some variables renamed while keeping all variables, you can use the rename()
function.
gm %>%
rename(nation = country, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 33
#> nation year smoki…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 1999 4 NA 11.1 46.1 58.7 2730 5.96 89.5
#> 2 Albania 2000 NA NA 10.8 43.8 58 2800 6.3 90
#> 3 Albania 2001 4 40.5 11.2 44.1 60 2860 6.81 90.6
#> 4 Albania 2002 NA NA 11.5 45.2 60 2770 7.22 91.2
#> 5 Albania 2003 NA NA 11.2 44.8 60.7 2790 7.39 91.7
#> 6 Albania 2004 4 40.5 11 42.8 60.7 2870 7.7 92.3
#> 7 Andorra 1999 29.2 NA 14.3 70.4 44.7 NA 39.2 100
#> 8 Andorra 2000 NA NA 14.3 70 47.3 NA 39.4 100
#> 9 Andorra 2001 29.2 36.5 14.4 69.7 50.7 NA 39.5 100
#> 10 Andorra 2002 NA NA 14.4 69.6 67.3 NA 42.7 100
#> # … with 65,521 more rows, 23 more variables: child_mortality <dbl>,
#> # income_per_person <dbl>, HDI <dbl>, alcohol_male <dbl>,
#> # alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> # livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> # mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> # cholesterol_fat_in_blood_female <dbl>, continent <chr>, region <chr>,
#> # population <dbl>, life_expectancy <dbl>, sugar <dbl>, BMI_female <dbl>, …
From this example, you can see the different behavior between select()
and rename()
when renaming variables.
7.2.3 Select Variables with Patterns
Sometimes, you may want to select many variables that follows a particular pattern.
a. select consecutive variables
The first type of variable patterns is consecutive variables in a tibble.
To get started, you can use select(1:m)
to get the first m
variables. To get the last column, you can use select(last_col())
. To get the last m
variables, you can use select(last_col(m-1):last_col())
.
gm %>%
select(1:3) # the first 3 columns
#> # A tibble: 65,531 × 3
#> country year smoking_female
#> <chr> <dbl> <dbl>
#> 1 Albania 1999 4
#> 2 Albania 2000 NA
#> 3 Albania 2001 4
#> 4 Albania 2002 NA
#> 5 Albania 2003 NA
#> 6 Albania 2004 4
#> 7 Andorra 1999 29.2
#> 8 Andorra 2000 NA
#> 9 Andorra 2001 29.2
#> 10 Andorra 2002 NA
#> # … with 65,521 more rows
gm %>%
select(last_col()) # the last column
#> # A tibble: 65,531 × 1
#> HDI_category
#> <chr>
#> 1 medium
#> 2 medium
#> 3 medium
#> 4 medium
#> 5 medium
#> 6 high
#> 7 very high
#> 8 very high
#> 9 very high
#> 10 very high
#> # … with 65,521 more rows
gm %>%
select(last_col(1)) # the second to last column
#> # A tibble: 65,531 × 1
#> GDP_per_capita
#> <dbl>
#> 1 1.96
#> 2 2.14
#> 3 2.25
#> 4 2.38
#> 5 2.52
#> 6 2.68
#> 7 34.3
#> 8 36
#> 9 36.2
#> 10 37.6
#> # … with 65,521 more rows
gm %>%
select(last_col(2):last_col()) # the last 3 columns
#> # A tibble: 65,531 × 3
#> health_spending GDP_per_capita HDI_category
#> <dbl> <dbl> <chr>
#> 1 6.35 1.96 medium
#> 2 6.04 2.14 medium
#> 3 6.28 2.25 medium
#> 4 6.16 2.38 medium
#> 5 6.88 2.52 medium
#> 6 6.84 2.68 high
#> 7 7.58 34.3 very high
#> 8 6.79 36 very high
#> 9 7.04 36.2 very high
#> 10 7.12 37.6 very high
#> # … with 65,521 more rows
If you want to select the variable country
and all variables from HDI
to continent
in the tibble, you can use HDI:continent
as a single argument in the select()
function. Note that the result highly depends on the particular ordering of variables in the tibble.
gm %>%
select(country, HDI:continent)
#> # A tibble: 65,531 × 11
#> country HDI alcoh…¹ alcoh…² liver…³ liver…⁴ morta…⁵ morta…⁶ chole…⁷ chole…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 0.671 10.6 2.5 17.8 8.28 23.5 12.9 4.9 4.92
#> 2 Albania 0.678 11.5 2.74 17.1 7.66 19.6 11 4.9 4.92
#> 3 Albania 0.684 11.5 2.74 16.4 7.1 20.2 12.3 4.91 4.92
#> 4 Albania 0.691 11.5 2.74 15.7 6.58 21.2 12.6 4.91 4.92
#> 5 Albania 0.696 11.5 2.74 15.1 6.1 20.6 12 4.91 4.91
#> 6 Albania 0.706 12.4 2.96 14.4 5.66 25.2 14.5 4.9 4.9
#> 7 Andorra 0.813 20.4 6.57 5.99 2.35 NA NA 5.67 5.57
#> 8 Andorra 0.815 19.8 6.41 6.05 2.35 NA NA 5.65 5.55
#> 9 Andorra 0.82 19.6 6.33 6.12 2.36 NA NA 5.64 5.53
#> 10 Andorra 0.827 19.5 6.24 6.19 2.37 NA NA 5.62 5.51
#> # … with 65,521 more rows, 1 more variable: continent <chr>, and abbreviated
#> # variable names ¹alcohol_male, ²alcohol_female, ³livercancer_newcases_male,
#> # ⁴livercancer_newcases_female, ⁵mortality_male, ⁶mortality_female,
#> # ⁷cholesterol_fat_in_blood_male, ⁸cholesterol_fat_in_blood_female
b. select variables with names that start with, end with, or contain a string
Another commonly used pattern is to select all variables with names that start with, end with, or contain a string. To select all variables that end with "female"
(representing the female related measures) in the gm
data set, you can use the ends_with()
function in the select()
function.
gm %>%
select(ends_with("female"))
#> # A tibble: 65,531 × 7
#> smoking_female lungcancer_newcases_…¹ alcoh…² liver…³ morta…⁴ chole…⁵ BMI_f…⁶
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 11.1 2.5 8.28 12.9 4.92 25.1
#> 2 NA 10.8 2.74 7.66 11 4.92 25.2
#> 3 4 11.2 2.74 7.1 12.3 4.92 25.3
#> 4 NA 11.5 2.74 6.58 12.6 4.92 25.3
#> 5 NA 11.2 2.74 6.1 12 4.91 25.4
#> 6 4 11 2.96 5.66 14.5 4.9 25.5
#> 7 29.2 14.3 6.57 2.35 NA 5.57 26.1
#> 8 NA 14.3 6.41 2.35 NA 5.55 26.1
#> 9 29.2 14.4 6.33 2.36 NA 5.53 26.2
#> 10 NA 14.4 6.24 2.37 NA 5.51 26.2
#> # … with 65,521 more rows, and abbreviated variable names
#> # ¹lungcancer_newcases_female, ²alcohol_female, ³livercancer_newcases_female,
#> # ⁴mortality_female, ⁵cholesterol_fat_in_blood_female, ⁶BMI_female
In the output, you can see the tibble with all variables ending with "female"
.
You can also use starts_with()
to select all variables with names that end with a particular string.
gm %>%
select(starts_with("BMI")) #starts with BMI
#> # A tibble: 65,531 × 4
#> BMI_female BMI_female_group BMI_male BMI_male_group
#> <dbl> <chr> <dbl> <chr>
#> 1 25.1 pre_obesity 25.6 pre_obesity
#> 2 25.2 pre_obesity 25.7 pre_obesity
#> 3 25.3 pre_obesity 25.8 pre_obesity
#> 4 25.3 pre_obesity 25.9 pre_obesity
#> 5 25.4 pre_obesity 26 pre_obesity
#> 6 25.5 pre_obesity 26.1 pre_obesity
#> 7 26.1 pre_obesity 26.8 pre_obesity
#> 8 26.1 pre_obesity 26.9 pre_obesity
#> 9 26.2 pre_obesity 27 pre_obesity
#> 10 26.2 pre_obesity 27.1 pre_obesity
#> # … with 65,521 more rows
This produces a tibble with all variables starting with BMI
.
If you want to select all variables with names containing a string (not necessarily starts with or ends with), you can use the contains()
function with the string as its argument.
gm %>%
select(contains("newcases"))
#> # A tibble: 65,531 × 4
#> lungcancer_newcases_female lungcancer_newcases_male livercancer_new…¹ liver…²
#> <dbl> <dbl> <dbl> <dbl>
#> 1 11.1 46.1 17.8 8.28
#> 2 10.8 43.8 17.1 7.66
#> 3 11.2 44.1 16.4 7.1
#> 4 11.5 45.2 15.7 6.58
#> 5 11.2 44.8 15.1 6.1
#> 6 11 42.8 14.4 5.66
#> 7 14.3 70.4 5.99 2.35
#> 8 14.3 70 6.05 2.35
#> 9 14.4 69.7 6.12 2.36
#> 10 14.4 69.6 6.19 2.37
#> # … with 65,521 more rows, and abbreviated variable names
#> # ¹livercancer_newcases_male, ²livercancer_newcases_female
This produces a tibble with all variables containing the string "newcases"
.
You can also combine multiple conditions using logical operators. First of all, the multiple arguments in the select()
function are combined with or operations (|
). For example, to get all variables that begins with b
or ends with m
, you can do the following.
gm %>%
select(starts_with("smoking"), ends_with("female"))
#> # A tibble: 65,531 × 8
#> smoking_female smoking_male lungcan…¹ alcoh…² liver…³ morta…⁴ chole…⁵ BMI_f…⁶
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 NA 11.1 2.5 8.28 12.9 4.92 25.1
#> 2 NA NA 10.8 2.74 7.66 11 4.92 25.2
#> 3 4 40.5 11.2 2.74 7.1 12.3 4.92 25.3
#> 4 NA NA 11.5 2.74 6.58 12.6 4.92 25.3
#> 5 NA NA 11.2 2.74 6.1 12 4.91 25.4
#> 6 4 40.5 11 2.96 5.66 14.5 4.9 25.5
#> 7 29.2 NA 14.3 6.57 2.35 NA 5.57 26.1
#> 8 NA NA 14.3 6.41 2.35 NA 5.55 26.1
#> 9 29.2 36.5 14.4 6.33 2.36 NA 5.53 26.2
#> 10 NA NA 14.4 6.24 2.37 NA 5.51 26.2
#> # … with 65,521 more rows, and abbreviated variable names
#> # ¹lungcancer_newcases_female, ²alcohol_female, ³livercancer_newcases_female,
#> # ⁴mortality_female, ⁵cholesterol_fat_in_blood_female, ⁶BMI_female
In the output, you can see the tibble with all variables starting with "smoking"
or ending with "female"
.
7.2.4 Reorder Variables via Names
Knowing how to select and rename variables, let’s learn how to reorder variables in a tibble. The first approach to reordering variables is to use the select()
function by listing all the variables in the desired order. However, you may realized that it would be time consuming if there are many variables in the tibble and we only want to move a few to the beginning. It turns out we can utilize the everything()
function as an argument in the select()
function, which will select all the remaining variables. Let’s say we want to move country
and continent
to the beginning of the tibble.
gm %>%
select(country, continent, everything())
#> # A tibble: 65,531 × 33
#> country conti…¹ year smoki…² smoki…³ lungc…⁴ lungc…⁵ owid_…⁶ food_…⁷ avera…⁸
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania Europe 1999 4 NA 11.1 46.1 58.7 2730 5.96
#> 2 Albania Europe 2000 NA NA 10.8 43.8 58 2800 6.3
#> 3 Albania Europe 2001 4 40.5 11.2 44.1 60 2860 6.81
#> 4 Albania Europe 2002 NA NA 11.5 45.2 60 2770 7.22
#> 5 Albania Europe 2003 NA NA 11.2 44.8 60.7 2790 7.39
#> 6 Albania Europe 2004 4 40.5 11 42.8 60.7 2870 7.7
#> 7 Andorra <NA> 1999 29.2 NA 14.3 70.4 44.7 NA 39.2
#> 8 Andorra <NA> 2000 NA NA 14.3 70 47.3 NA 39.4
#> 9 Andorra <NA> 2001 29.2 36.5 14.4 69.7 50.7 NA 39.5
#> 10 Andorra <NA> 2002 NA NA 14.4 69.6 67.3 NA 42.7
#> # … with 65,521 more rows, 23 more variables: sanitation <dbl>,
#> # child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> # alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> # livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> # mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> # cholesterol_fat_in_blood_female <dbl>, region <chr>, population <dbl>,
#> # life_expectancy <dbl>, sugar <dbl>, BMI_female <dbl>, …
This produces a tibble with country
and continent
as the first two columns, followed by the remaining variables in the same order as in the original tibble.