Chapter 7 Data Manipulation
For conducting data analysis, we often need to conduct various kinds of data manipulation. We will use the gm
data set in the r02pro package throughout this chapter. Let’s first look at the data set.
library(r02pro)
gm
#> # A tibble: 65,531 × 33
#> country year smoki…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 1999 4 NA 11.1 46.1 58.7 2730 5.96 89.5
#> 2 Albania 2000 NA NA 10.8 43.8 58 2800 6.3 90
#> 3 Albania 2001 4 40.5 11.2 44.1 60 2860 6.81 90.6
#> 4 Albania 2002 NA NA 11.5 45.2 60 2770 7.22 91.2
#> 5 Albania 2003 NA NA 11.2 44.8 60.7 2790 7.39 91.7
#> 6 Albania 2004 4 40.5 11 42.8 60.7 2870 7.7 92.3
#> 7 Andorra 1999 29.2 NA 14.3 70.4 44.7 NA 39.2 100
#> 8 Andorra 2000 NA NA 14.3 70 47.3 NA 39.4 100
#> 9 Andorra 2001 29.2 36.5 14.4 69.7 50.7 NA 39.5 100
#> 10 Andorra 2002 NA NA 14.4 69.6 67.3 NA 42.7 100
#> # … with 65,521 more rows, 23 more variables: child_mortality <dbl>,
#> # income_per_person <dbl>, HDI <dbl>, alcohol_male <dbl>,
#> # alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> # livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> # mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> # cholesterol_fat_in_blood_female <dbl>, continent <chr>, region <chr>,
#> # population <dbl>, life_expectancy <dbl>, sugar <dbl>, BMI_female <dbl>, …
gm
is a dataset of 65,531 country and year pairs, with 33 variables among with are many sociodemographic and public health features. To learn more about each variable, you can look at its documentation.
To view the entire dataset, you can use the View()
function, which will open the dataset in the new file window.
To get the first 6 rows of gm
, you can use the head()
function, which also has an optional argument if you want a different number of top rows.
head(gm)
#> # A tibble: 6 × 33
#> country year smokin…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 1999 4 NA 11.1 46.1 58.7 2730 5.96 89.5
#> 2 Albania 2000 NA NA 10.8 43.8 58 2800 6.3 90
#> 3 Albania 2001 4 40.5 11.2 44.1 60 2860 6.81 90.6
#> 4 Albania 2002 NA NA 11.5 45.2 60 2770 7.22 91.2
#> 5 Albania 2003 NA NA 11.2 44.8 60.7 2790 7.39 91.7
#> 6 Albania 2004 4 40.5 11 42.8 60.7 2870 7.7 92.3
#> # … with 23 more variables: child_mortality <dbl>, income_per_person <dbl>,
#> # HDI <dbl>, alcohol_male <dbl>, alcohol_female <dbl>,
#> # livercancer_newcases_male <dbl>, livercancer_newcases_female <dbl>,
#> # mortality_male <dbl>, mortality_female <dbl>,
#> # cholesterol_fat_in_blood_male <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> # continent <chr>, region <chr>, population <dbl>, life_expectancy <dbl>,
#> # sugar <dbl>, BMI_female <dbl>, BMI_female_group <chr>, BMI_male <dbl>, …
head(gm, n = 10) #the first 10 rows of gm
#> # A tibble: 10 × 33
#> country year smoki…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 1999 4 NA 11.1 46.1 58.7 2730 5.96 89.5
#> 2 Albania 2000 NA NA 10.8 43.8 58 2800 6.3 90
#> 3 Albania 2001 4 40.5 11.2 44.1 60 2860 6.81 90.6
#> 4 Albania 2002 NA NA 11.5 45.2 60 2770 7.22 91.2
#> 5 Albania 2003 NA NA 11.2 44.8 60.7 2790 7.39 91.7
#> 6 Albania 2004 4 40.5 11 42.8 60.7 2870 7.7 92.3
#> 7 Andorra 1999 29.2 NA 14.3 70.4 44.7 NA 39.2 100
#> 8 Andorra 2000 NA NA 14.3 70 47.3 NA 39.4 100
#> 9 Andorra 2001 29.2 36.5 14.4 69.7 50.7 NA 39.5 100
#> 10 Andorra 2002 NA NA 14.4 69.6 67.3 NA 42.7 100
#> # … with 23 more variables: child_mortality <dbl>, income_per_person <dbl>,
#> # HDI <dbl>, alcohol_male <dbl>, alcohol_female <dbl>,
#> # livercancer_newcases_male <dbl>, livercancer_newcases_female <dbl>,
#> # mortality_male <dbl>, mortality_female <dbl>,
#> # cholesterol_fat_in_blood_male <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> # continent <chr>, region <chr>, population <dbl>, life_expectancy <dbl>,
#> # sugar <dbl>, BMI_female <dbl>, BMI_female_group <chr>, BMI_male <dbl>, …
The following are some possible questions we may want to explore.
- (Filter observations by their values) Find the observations that represent countries in Europe (
continent
), years between 2006 and 2010 (year
), and Low Human Development Index (HDI_category
).
You will learn how to filter observations in Section 7.1.
- (Select variable by their names) We see there are 33 columns. For a particular data analysis question, perhaps we want to focus on a subset of the columns.
You will learn how to select variables in Section 7.2.
- (Reorder the observations) In year 2008, find the 10 countries with the highest life expectancy.
You will learn how to reorder observations in Section 7.3.
- (Create new variables as functions of existing ones) From the existing variables, perhaps we want to create new ones, for instance, the total number of new liver cancer cases for each country in the year 2008.
You will learn how to create new variables in Section 7.4.
- (Create various summary statistics) We may want to create certain summary statistics. For example, what is the top 5 countries with the highest average life expectancy in each continent for the year 2006?
You will learn how to group observations and create summary statistics for each group in Section 7.5.