Chapter 7 Data Manipulation

For conducting data analysis, we often need to conduct various kinds of data manipulation. We will use the gm data set in the r02pro package throughout this chapter. Let’s first look at the data set.

library(r02pro)
gm
#> # A tibble: 65,531 × 33
#>    country  year smoki…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Albania  1999     4      NA      11.1    46.1    58.7    2730    5.96    89.5
#>  2 Albania  2000    NA      NA      10.8    43.8    58      2800    6.3     90  
#>  3 Albania  2001     4      40.5    11.2    44.1    60      2860    6.81    90.6
#>  4 Albania  2002    NA      NA      11.5    45.2    60      2770    7.22    91.2
#>  5 Albania  2003    NA      NA      11.2    44.8    60.7    2790    7.39    91.7
#>  6 Albania  2004     4      40.5    11      42.8    60.7    2870    7.7     92.3
#>  7 Andorra  1999    29.2    NA      14.3    70.4    44.7      NA   39.2    100  
#>  8 Andorra  2000    NA      NA      14.3    70      47.3      NA   39.4    100  
#>  9 Andorra  2001    29.2    36.5    14.4    69.7    50.7      NA   39.5    100  
#> 10 Andorra  2002    NA      NA      14.4    69.6    67.3      NA   42.7    100  
#> # … with 65,521 more rows, 23 more variables: child_mortality <dbl>,
#> #   income_per_person <dbl>, HDI <dbl>, alcohol_male <dbl>,
#> #   alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> #   cholesterol_fat_in_blood_female <dbl>, continent <chr>, region <chr>,
#> #   population <dbl>, life_expectancy <dbl>, sugar <dbl>, BMI_female <dbl>, …

gm is a dataset of 65,531 country and year pairs, with 33 variables among with are many sociodemographic and public health features. To learn more about each variable, you can look at its documentation.

`?`(gm)

To view the entire dataset, you can use the View() function, which will open the dataset in the new file window.

View(gm)

To get the first 6 rows of gm, you can use the head() function, which also has an optional argument if you want a different number of top rows.

head(gm)
#> # A tibble: 6 × 33
#>   country  year smokin…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#>   <chr>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 Albania  1999        4    NA      11.1    46.1    58.7    2730    5.96    89.5
#> 2 Albania  2000       NA    NA      10.8    43.8    58      2800    6.3     90  
#> 3 Albania  2001        4    40.5    11.2    44.1    60      2860    6.81    90.6
#> 4 Albania  2002       NA    NA      11.5    45.2    60      2770    7.22    91.2
#> 5 Albania  2003       NA    NA      11.2    44.8    60.7    2790    7.39    91.7
#> 6 Albania  2004        4    40.5    11      42.8    60.7    2870    7.7     92.3
#> # … with 23 more variables: child_mortality <dbl>, income_per_person <dbl>,
#> #   HDI <dbl>, alcohol_male <dbl>, alcohol_female <dbl>,
#> #   livercancer_newcases_male <dbl>, livercancer_newcases_female <dbl>,
#> #   mortality_male <dbl>, mortality_female <dbl>,
#> #   cholesterol_fat_in_blood_male <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> #   continent <chr>, region <chr>, population <dbl>, life_expectancy <dbl>,
#> #   sugar <dbl>, BMI_female <dbl>, BMI_female_group <chr>, BMI_male <dbl>, …
head(gm, n = 10)  #the first 10 rows of gm
#> # A tibble: 10 × 33
#>    country  year smoki…¹ smoki…² lungc…³ lungc…⁴ owid_…⁵ food_…⁶ avera…⁷ sanit…⁸
#>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Albania  1999     4      NA      11.1    46.1    58.7    2730    5.96    89.5
#>  2 Albania  2000    NA      NA      10.8    43.8    58      2800    6.3     90  
#>  3 Albania  2001     4      40.5    11.2    44.1    60      2860    6.81    90.6
#>  4 Albania  2002    NA      NA      11.5    45.2    60      2770    7.22    91.2
#>  5 Albania  2003    NA      NA      11.2    44.8    60.7    2790    7.39    91.7
#>  6 Albania  2004     4      40.5    11      42.8    60.7    2870    7.7     92.3
#>  7 Andorra  1999    29.2    NA      14.3    70.4    44.7      NA   39.2    100  
#>  8 Andorra  2000    NA      NA      14.3    70      47.3      NA   39.4    100  
#>  9 Andorra  2001    29.2    36.5    14.4    69.7    50.7      NA   39.5    100  
#> 10 Andorra  2002    NA      NA      14.4    69.6    67.3      NA   42.7    100  
#> # … with 23 more variables: child_mortality <dbl>, income_per_person <dbl>,
#> #   HDI <dbl>, alcohol_male <dbl>, alcohol_female <dbl>,
#> #   livercancer_newcases_male <dbl>, livercancer_newcases_female <dbl>,
#> #   mortality_male <dbl>, mortality_female <dbl>,
#> #   cholesterol_fat_in_blood_male <dbl>, cholesterol_fat_in_blood_female <dbl>,
#> #   continent <chr>, region <chr>, population <dbl>, life_expectancy <dbl>,
#> #   sugar <dbl>, BMI_female <dbl>, BMI_female_group <chr>, BMI_male <dbl>, …

The following are some possible questions we may want to explore.

  1. (Filter observations by their values) Find the observations that represent countries in Europe (continent), years between 2006 and 2010 (year), and Low Human Development Index (HDI_category).

You will learn how to filter observations in Section 7.1.

  1. (Select variable by their names) We see there are 33 columns. For a particular data analysis question, perhaps we want to focus on a subset of the columns.

You will learn how to select variables in Section 7.2.

  1. (Reorder the observations) In year 2008, find the 10 countries with the highest life expectancy.

You will learn how to reorder observations in Section 7.3.

  1. (Create new variables as functions of existing ones) From the existing variables, perhaps we want to create new ones, for instance, the total number of new liver cancer cases for each country in the year 2008.

You will learn how to create new variables in Section 7.4.

  1. (Create various summary statistics) We may want to create certain summary statistics. For example, what is the top 5 countries with the highest average life expectancy in each continent for the year 2006?

You will learn how to group observations and create summary statistics for each group in Section 7.5.