Chapter 7 Data Manipulation

For conducting data analysis, we often need to conduct various kinds of data manipulation. We will use the gm data set in the r02pro package throughout this chapter. Let’s first look at the data set.

library(r02pro)
gm
#> # A tibble: 65,531 × 33
#>    country  year smoking_female smoking_male lungcancer_newcases_female
#>    <chr>   <dbl>          <dbl>        <dbl>                      <dbl>
#>  1 Albania  1999            4           NA                         11.1
#>  2 Albania  2000           NA           NA                         10.8
#>  3 Albania  2001            4           40.5                       11.2
#>  4 Albania  2002           NA           NA                         11.5
#>  5 Albania  2003           NA           NA                         11.2
#>  6 Albania  2004            4           40.5                       11  
#>  7 Andorra  1999           29.2         NA                         14.3
#>  8 Andorra  2000           NA           NA                         14.3
#>  9 Andorra  2001           29.2         36.5                       14.4
#> 10 Andorra  2002           NA           NA                         14.4
#> # ℹ 65,521 more rows
#> # ℹ 28 more variables: lungcancer_newcases_male <dbl>, owid_edu_idx <dbl>,
#> #   food_supply <dbl>, average_daily_income <dbl>, sanitation <dbl>,
#> #   child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> #   alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>, …

gm is a dataset of 65,531 country and year pairs, with 33 variables among with are many sociodemographic and public health features. To learn more about each variable, you can look at its documentation.

`?`(gm)

To view the entire dataset, you can use the View() function, which will open the dataset in the new file window.

View(gm)

To get the first 6 rows of gm, you can use the head() function, which also has an optional argument if you want a different number of top rows.

head(gm)
#> # A tibble: 6 × 33
#>   country  year smoking_female smoking_male lungcancer_newcases_female
#>   <chr>   <dbl>          <dbl>        <dbl>                      <dbl>
#> 1 Albania  1999              4         NA                         11.1
#> 2 Albania  2000             NA         NA                         10.8
#> 3 Albania  2001              4         40.5                       11.2
#> 4 Albania  2002             NA         NA                         11.5
#> 5 Albania  2003             NA         NA                         11.2
#> 6 Albania  2004              4         40.5                       11  
#> # ℹ 28 more variables: lungcancer_newcases_male <dbl>, owid_edu_idx <dbl>,
#> #   food_supply <dbl>, average_daily_income <dbl>, sanitation <dbl>,
#> #   child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> #   alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> #   cholesterol_fat_in_blood_female <dbl>, continent <chr>, region <chr>, …
head(gm, n = 10)  #the first 10 rows of gm
#> # A tibble: 10 × 33
#>    country  year smoking_female smoking_male lungcancer_newcases_female
#>    <chr>   <dbl>          <dbl>        <dbl>                      <dbl>
#>  1 Albania  1999            4           NA                         11.1
#>  2 Albania  2000           NA           NA                         10.8
#>  3 Albania  2001            4           40.5                       11.2
#>  4 Albania  2002           NA           NA                         11.5
#>  5 Albania  2003           NA           NA                         11.2
#>  6 Albania  2004            4           40.5                       11  
#>  7 Andorra  1999           29.2         NA                         14.3
#>  8 Andorra  2000           NA           NA                         14.3
#>  9 Andorra  2001           29.2         36.5                       14.4
#> 10 Andorra  2002           NA           NA                         14.4
#> # ℹ 28 more variables: lungcancer_newcases_male <dbl>, owid_edu_idx <dbl>,
#> #   food_supply <dbl>, average_daily_income <dbl>, sanitation <dbl>,
#> #   child_mortality <dbl>, income_per_person <dbl>, HDI <dbl>,
#> #   alcohol_male <dbl>, alcohol_female <dbl>, livercancer_newcases_male <dbl>,
#> #   livercancer_newcases_female <dbl>, mortality_male <dbl>,
#> #   mortality_female <dbl>, cholesterol_fat_in_blood_male <dbl>,
#> #   cholesterol_fat_in_blood_female <dbl>, continent <chr>, region <chr>, …

The following are some possible questions we may want to explore.

  1. (Filter observations by their values) Find the observations that represent countries in Europe (continent), years between 2006 and 2010 (year), and Low Human Development Index (HDI_category).

You will learn how to filter observations in Section 7.1.

  1. (Select variable by their names) We see there are 33 columns. For a particular data analysis question, perhaps we want to focus on a subset of the columns.

You will learn how to select variables in Section 7.2.

  1. (Reorder the observations) In year 2008, find the 10 countries with the highest life expectancy.

You will learn how to reorder observations in Section 7.3.

  1. (Create new variables as functions of existing ones) From the existing variables, perhaps we want to create new ones, for instance, the total number of new liver cancer cases for each country in the year 2008.

You will learn how to create new variables in Section 7.4.

  1. (Create various summary statistics) We may want to create certain summary statistics. For example, what is the top 5 countries with the highest average life expectancy in each continent for the year 2006?

You will learn how to group observations and create summary statistics for each group in Section 7.5.