7.2 Select, Rename, and Reorder Variables
In Section 7.1, we have learned how to filter observations according to a set of conditions. In many applications, there are some variables which may not be the focus of the research problem. As a result, learning how to select variables is very useful in data analysis.
7.2.1 Select Variables via Names
Let’s consider the gm
data set in the r02pro package, which contains 33 variables. If we want to focus on the country, continent, year, and Human Development Index (HDI) category, we can use the select()
function in the dplyr package. We will be using the pipe introduced in Section 7.1.3 throughout the rest of this Chapter. Inside the select()
function, you can include the variables to select as different arguments. Note that the quotation marks are not necessary.
library(r02pro)
library(dplyr)
gm %>%
select(country, continent, year, HDI_category)
#> # A tibble: 65,531 × 4
#> country continent year HDI_category
#> <chr> <chr> <dbl> <chr>
#> 1 Albania Europe 1999 medium
#> 2 Albania Europe 2000 medium
#> 3 Albania Europe 2001 medium
#> 4 Albania Europe 2002 medium
#> 5 Albania Europe 2003 medium
#> 6 Albania Europe 2004 high
#> 7 Andorra <NA> 1999 very high
#> 8 Andorra <NA> 2000 very high
#> 9 Andorra <NA> 2001 very high
#> 10 Andorra <NA> 2002 very high
#> # ℹ 65,521 more rows
As you can see from the output, the select()
function returns a tibble with only the selected variables.
Now, let’s say we only want the observations for the year 2006. We can use the filter()
function to get the subset of the gm
data set with only the year 2006, and then use the select()
function to keep only the variables of interest.
gm %>%
filter(year == 2006) %>%
select(country, continent, year, HDI_category)
#> # A tibble: 236 × 4
#> country continent year HDI_category
#> <chr> <chr> <dbl> <chr>
#> 1 Afghanistan <NA> 2006 low
#> 2 Angola Africa 2006 low
#> 3 Albania Europe 2006 high
#> 4 Andorra <NA> 2006 very high
#> 5 United Arab Emirates Asia 2006 very high
#> 6 Argentina Americas 2006 very high
#> 7 Armenia Asia 2006 high
#> 8 American Samoa <NA> 2006 <NA>
#> 9 Antigua and Barbuda Americas 2006 high
#> 10 Australia Oceania 2006 very high
#> # ℹ 226 more rows
7.2.2 Rename Variables
In the select()
function, you can also rename the variables in the form of new_name = old_name
as the arguments. Let’s see the following example where we want to rename the country
to nation
and HDI_category
to HDI_cat
.
gm %>%
select(nation = country, continent, year, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 4
#> nation continent year HDI_cat
#> <chr> <chr> <dbl> <chr>
#> 1 Albania Europe 1999 medium
#> 2 Albania Europe 2000 medium
#> 3 Albania Europe 2001 medium
#> 4 Albania Europe 2002 medium
#> 5 Albania Europe 2003 medium
#> 6 Albania Europe 2004 high
#> 7 Andorra <NA> 1999 very high
#> 8 Andorra <NA> 2000 very high
#> 9 Andorra <NA> 2001 very high
#> 10 Andorra <NA> 2002 very high
#> # ℹ 65,521 more rows
Note that in this renaming operations, only the variables mentioned in the select()
function are kept. If you want to generate a tibble with some variables renamed while keeping all variables, you can use the rename()
function.
gm %>%
rename(nation = country, HDI_cat = HDI_category)
#> # A tibble: 65,531 × 33
#> nation year smoking_female smoking_male lungcancer_newcases_female
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 1999 4 NA 11.1
#> 2 Albania 2000 NA NA 10.8
#> 3 Albania 2001 4 40.5 11.2
#> 4 Albania 2002 NA NA 11.5
#> 5 Albania 2003 NA NA 11.2
#> 6 Albania 2004 4 40.5 11
#> 7 Andorra 1999 29.2 NA 14.3
#> 8 Andorra 2000 NA NA 14.3
#> 9 Andorra 2001 29.2 36.5 14.4
#> 10 Andorra 2002 NA NA 14.4
#> # ℹ 65,521 more rows
#> # ℹ 28 more variables: lungcancer_newcases_male <dbl>, owid_edu_idx <dbl>,
#> # food_supply <dbl>, average_daily_income <dbl>, sanitation <dbl>,
#> # child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> # alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> # livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> # mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>, …
From this example, you can see the different behavior between select()
and rename()
when renaming variables.
7.2.3 Select Variables with Patterns
Sometimes, you may want to select many variables that follows a particular pattern.
a. select consecutive variables
The first type of variable patterns is consecutive variables in a tibble.
To get started, you can use select(1:m)
to get the first m
variables. To get the last column, you can use select(last_col())
. To get the last m
variables, you can use select(last_col(m-1):last_col())
.
gm %>%
select(1:3) # the first 3 columns
#> # A tibble: 65,531 × 3
#> country year smoking_female
#> <chr> <dbl> <dbl>
#> 1 Albania 1999 4
#> 2 Albania 2000 NA
#> 3 Albania 2001 4
#> 4 Albania 2002 NA
#> 5 Albania 2003 NA
#> 6 Albania 2004 4
#> 7 Andorra 1999 29.2
#> 8 Andorra 2000 NA
#> 9 Andorra 2001 29.2
#> 10 Andorra 2002 NA
#> # ℹ 65,521 more rows
gm %>%
select(last_col()) # the last column
#> # A tibble: 65,531 × 1
#> HDI_category
#> <chr>
#> 1 medium
#> 2 medium
#> 3 medium
#> 4 medium
#> 5 medium
#> 6 high
#> 7 very high
#> 8 very high
#> 9 very high
#> 10 very high
#> # ℹ 65,521 more rows
gm %>%
select(last_col(1)) # the second to last column
#> # A tibble: 65,531 × 1
#> GDP_per_capita
#> <dbl>
#> 1 1.96
#> 2 2.14
#> 3 2.25
#> 4 2.38
#> 5 2.52
#> 6 2.68
#> 7 34.3
#> 8 36
#> 9 36.2
#> 10 37.6
#> # ℹ 65,521 more rows
gm %>%
select(last_col(2):last_col()) # the last 3 columns
#> # A tibble: 65,531 × 3
#> health_spending GDP_per_capita HDI_category
#> <dbl> <dbl> <chr>
#> 1 6.35 1.96 medium
#> 2 6.04 2.14 medium
#> 3 6.28 2.25 medium
#> 4 6.16 2.38 medium
#> 5 6.88 2.52 medium
#> 6 6.84 2.68 high
#> 7 7.58 34.3 very high
#> 8 6.79 36 very high
#> 9 7.04 36.2 very high
#> 10 7.12 37.6 very high
#> # ℹ 65,521 more rows
If you want to select the variable country
and all variables from HDI
to continent
in the tibble, you can use HDI:continent
as a single argument in the select()
function. Note that the result highly depends on the particular ordering of variables in the tibble.
gm %>%
select(country, HDI:continent)
#> # A tibble: 65,531 × 11
#> country HDI alcohol_male alcohol_female livercancer_newcases_male
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 0.671 10.6 2.5 17.8
#> 2 Albania 0.678 11.5 2.74 17.1
#> 3 Albania 0.684 11.5 2.74 16.4
#> 4 Albania 0.691 11.5 2.74 15.7
#> 5 Albania 0.696 11.5 2.74 15.1
#> 6 Albania 0.706 12.4 2.96 14.4
#> 7 Andorra 0.813 20.4 6.57 5.99
#> 8 Andorra 0.815 19.8 6.41 6.05
#> 9 Andorra 0.82 19.6 6.33 6.12
#> 10 Andorra 0.827 19.5 6.24 6.19
#> # ℹ 65,521 more rows
#> # ℹ 6 more variables: livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> # mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> # cholesterol_fat_in_blood_female <dbl>, continent <chr>
b. select variables with names that start with, end with, or contain a string
Another commonly used pattern is to select all variables with names that start with, end with, or contain a string. To select all variables that end with "female"
(representing the female related measures) in the gm
data set, you can use the ends_with()
function in the select()
function.
gm %>%
select(ends_with("female"))
#> # A tibble: 65,531 × 7
#> smoking_female lungcancer_newcases_fe…¹ alcohol_female livercancer_newcases…²
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 11.1 2.5 8.28
#> 2 NA 10.8 2.74 7.66
#> 3 4 11.2 2.74 7.1
#> 4 NA 11.5 2.74 6.58
#> 5 NA 11.2 2.74 6.1
#> 6 4 11 2.96 5.66
#> 7 29.2 14.3 6.57 2.35
#> 8 NA 14.3 6.41 2.35
#> 9 29.2 14.4 6.33 2.36
#> 10 NA 14.4 6.24 2.37
#> # ℹ 65,521 more rows
#> # ℹ abbreviated names: ¹lungcancer_newcases_female,
#> # ²livercancer_newcases_female
#> # ℹ 3 more variables: mortality_female <dbl>,
#> # cholesterol_fat_in_blood_female <dbl>, BMI_female <dbl>
In the output, you can see the tibble with all variables ending with "female"
.
You can also use starts_with()
to select all variables with names that end with a particular string.
gm %>%
select(starts_with("BMI")) #starts with BMI
#> # A tibble: 65,531 × 4
#> BMI_female BMI_female_group BMI_male BMI_male_group
#> <dbl> <chr> <dbl> <chr>
#> 1 25.1 pre_obesity 25.6 pre_obesity
#> 2 25.2 pre_obesity 25.7 pre_obesity
#> 3 25.3 pre_obesity 25.8 pre_obesity
#> 4 25.3 pre_obesity 25.9 pre_obesity
#> 5 25.4 pre_obesity 26 pre_obesity
#> 6 25.5 pre_obesity 26.1 pre_obesity
#> 7 26.1 pre_obesity 26.8 pre_obesity
#> 8 26.1 pre_obesity 26.9 pre_obesity
#> 9 26.2 pre_obesity 27 pre_obesity
#> 10 26.2 pre_obesity 27.1 pre_obesity
#> # ℹ 65,521 more rows
This produces a tibble with all variables starting with BMI
.
If you want to select all variables with names containing a string (not necessarily starts with or ends with), you can use the contains()
function with the string as its argument.
gm %>%
select(contains("newcases"))
#> # A tibble: 65,531 × 4
#> lungcancer_newcases_female lungcancer_newcases_male livercancer_newcases_male
#> <dbl> <dbl> <dbl>
#> 1 11.1 46.1 17.8
#> 2 10.8 43.8 17.1
#> 3 11.2 44.1 16.4
#> 4 11.5 45.2 15.7
#> 5 11.2 44.8 15.1
#> 6 11 42.8 14.4
#> 7 14.3 70.4 5.99
#> 8 14.3 70 6.05
#> 9 14.4 69.7 6.12
#> 10 14.4 69.6 6.19
#> # ℹ 65,521 more rows
#> # ℹ 1 more variable: livercancer_newcases_female <dbl>
This produces a tibble with all variables containing the string "newcases"
.
You can also combine multiple conditions using logical operators. First of all, the multiple arguments in the select()
function are combined with or operations (|
). For example, to get all variables that begins with b
or ends with m
, you can do the following.
gm %>%
select(starts_with("smoking"), ends_with("female"))
#> # A tibble: 65,531 × 8
#> smoking_female smoking_male lungcancer_newcases_female alcohol_female
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 NA 11.1 2.5
#> 2 NA NA 10.8 2.74
#> 3 4 40.5 11.2 2.74
#> 4 NA NA 11.5 2.74
#> 5 NA NA 11.2 2.74
#> 6 4 40.5 11 2.96
#> 7 29.2 NA 14.3 6.57
#> 8 NA NA 14.3 6.41
#> 9 29.2 36.5 14.4 6.33
#> 10 NA NA 14.4 6.24
#> # ℹ 65,521 more rows
#> # ℹ 4 more variables: livercancer_newcases_female <dbl>,
#> # mortality_female <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> # BMI_female <dbl>
In the output, you can see the tibble with all variables starting with "smoking"
or ending with "female"
.
7.2.4 Reorder Variables via Names
Knowing how to select and rename variables, let’s learn how to reorder variables in a tibble. The first approach to reordering variables is to use the select()
function by listing all the variables in the desired order. However, you may realized that it would be time consuming if there are many variables in the tibble and we only want to move a few to the beginning. It turns out we can utilize the everything()
function as an argument in the select()
function, which will select all the remaining variables. Let’s say we want to move country
and continent
to the beginning of the tibble.
gm %>%
select(country, continent, everything())
#> # A tibble: 65,531 × 33
#> country continent year smoking_female smoking_male lungcancer_newcases_fem…¹
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania Europe 1999 4 NA 11.1
#> 2 Albania Europe 2000 NA NA 10.8
#> 3 Albania Europe 2001 4 40.5 11.2
#> 4 Albania Europe 2002 NA NA 11.5
#> 5 Albania Europe 2003 NA NA 11.2
#> 6 Albania Europe 2004 4 40.5 11
#> 7 Andorra <NA> 1999 29.2 NA 14.3
#> 8 Andorra <NA> 2000 NA NA 14.3
#> 9 Andorra <NA> 2001 29.2 36.5 14.4
#> 10 Andorra <NA> 2002 NA NA 14.4
#> # ℹ 65,521 more rows
#> # ℹ abbreviated name: ¹lungcancer_newcases_female
#> # ℹ 27 more variables: lungcancer_newcases_male <dbl>, owid_edu_idx <dbl>,
#> # food_supply <dbl>, average_daily_income <dbl>, sanitation <dbl>,
#> # child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> # alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> # livercancer_newcases_female <dbl>, mortality_male <dbl>, …
This produces a tibble with country
and continent
as the first two columns, followed by the remaining variables in the same order as in the original tibble.