4.1 Introduction to Data Sets

In this section, we will introduce two datasets that will be used extensively throughout the rest of the book to illustrate necessary concepts like data import and export, visualization, data manipulation, etc. This chapter will greatly enhance your ability in working independently on your own project by integrating the knowledge and skills from the previous chapters.

4.1.1 Gapminder Data Set

We will first introduce a dataset gm2004, located in the R package r02pro, the companion package of this book. The gm2004 dataset is created from the gapminder (https://www.gapminder.org/) website from a wide range of public health related topics. In particular, gm2004 contains 472 observations and 23 health related variables collected in the year 2004. Each observation (row) corresponds to a specific country, with columns representing features like mortality, health spending, and other demographic information.

Let’s begin by loading the r02pro and tibble packages since gm2004 is in a tibble format.

library(r02pro)
library(tibble)

After loading the two packages, you can type gm2004 to have a quick look of the dataset.

gm2004
#> # A tibble: 472 × 23
#>    country    year gender continent region population   BMI livercancer_newcases
#>    <chr>     <dbl> <chr>  <chr>     <chr>       <dbl> <dbl>                <dbl>
#>  1 Albania    2004 female Europe    South…     3090    25.5                 5.66
#>  2 Andorra    2004 female <NA>      <NA>         78.9  26.3                 2.39
#>  3 United A…  2004 female Asia      Weste…     4590    29.3                 4.29
#>  4 Argentina  2004 female Americas  South…    38900    27.1                 3.36
#>  5 Armenia    2004 female Asia      Weste…     2980    26.7                 8.23
#>  6 Australia  2004 female Oceania   Austr…    20200    26.6                 2.28
#>  7 Austria    2004 female Europe    Weste…     8250    25                   4.12
#>  8 Azerbaij…  2004 female Asia      Weste…     8540    27                   7.09
#>  9 Belgium    2004 female Europe    Weste…    10500    25.1                 2.65
#> 10 Burkina …  2004 female Africa    Weste…    13400    21.4                10.1 
#> # ℹ 462 more rows
#> # ℹ 15 more variables: lungcancer_newcases <dbl>, cholesterol <dbl>,
#> #   life_expectancy <dbl>, sugar <dbl>, health_spending <dbl>,
#> #   GDP_per_capita <dbl>, HDI <dbl>, HDI_category <chr>, smoking <dbl>,
#> #   food_supply <dbl>, owid_edu_idx <dbl>, average_daily_income <dbl>,
#> #   income_per_person <dbl>, sanitation <dbl>, child_mortality <dbl>

You can see that gm2004 is a tibble with 472 observations and 23 variables. By default, the output only gives a compact view of the first 10 observations in the tibble along with the first few variables that can fit the window. To view the full dataset, you can use the View() function, which will open the dataset in a new window.

View(gm2004)

To view the first few rows of the dataset, you can use the head() function, which produces the first 6 observations by default. You can also provide an optional second argument to display any given number of observations.

head(gm2004)
#> # A tibble: 6 × 23
#>   country     year gender continent region population   BMI livercancer_newcases
#>   <chr>      <dbl> <chr>  <chr>     <chr>       <dbl> <dbl>                <dbl>
#> 1 Albania     2004 female Europe    South…     3090    25.5                 5.66
#> 2 Andorra     2004 female <NA>      <NA>         78.9  26.3                 2.39
#> 3 United Ar…  2004 female Asia      Weste…     4590    29.3                 4.29
#> 4 Argentina   2004 female Americas  South…    38900    27.1                 3.36
#> 5 Armenia     2004 female Asia      Weste…     2980    26.7                 8.23
#> 6 Australia   2004 female Oceania   Austr…    20200    26.6                 2.28
#> # ℹ 15 more variables: lungcancer_newcases <dbl>, cholesterol <dbl>,
#> #   life_expectancy <dbl>, sugar <dbl>, health_spending <dbl>,
#> #   GDP_per_capita <dbl>, HDI <dbl>, HDI_category <chr>, smoking <dbl>,
#> #   food_supply <dbl>, owid_edu_idx <dbl>, average_daily_income <dbl>,
#> #   income_per_person <dbl>, sanitation <dbl>, child_mortality <dbl>
head(gm2004, 15)
#> # A tibble: 15 × 23
#>    country    year gender continent region population   BMI livercancer_newcases
#>    <chr>     <dbl> <chr>  <chr>     <chr>       <dbl> <dbl>                <dbl>
#>  1 Albania    2004 female Europe    South…     3090    25.5                 5.66
#>  2 Andorra    2004 female <NA>      <NA>         78.9  26.3                 2.39
#>  3 United A…  2004 female Asia      Weste…     4590    29.3                 4.29
#>  4 Argentina  2004 female Americas  South…    38900    27.1                 3.36
#>  5 Armenia    2004 female Asia      Weste…     2980    26.7                 8.23
#>  6 Australia  2004 female Oceania   Austr…    20200    26.6                 2.28
#>  7 Austria    2004 female Europe    Weste…     8250    25                   4.12
#>  8 Azerbaij…  2004 female Asia      Weste…     8540    27                   7.09
#>  9 Belgium    2004 female Europe    Weste…    10500    25.1                 2.65
#> 10 Burkina …  2004 female Africa    Weste…    13400    21.4                10.1 
#> 11 Banglade…  2004 female Asia      South…   139000    20.1                 1.97
#> 12 Bulgaria   2004 female Europe    Easte…     7690    25.4                 4.27
#> 13 Bahrain    2004 female Asia      Weste…      889    28.4                 4.25
#> 14 Bosnia a…  2004 female Europe    South…     3770    26.2                 5.71
#> 15 Belarus    2004 female Europe    Easte…     9560    26.3                 1.88
#> # ℹ 15 more variables: lungcancer_newcases <dbl>, cholesterol <dbl>,
#> #   life_expectancy <dbl>, sugar <dbl>, health_spending <dbl>,
#> #   GDP_per_capita <dbl>, HDI <dbl>, HDI_category <chr>, smoking <dbl>,
#> #   food_supply <dbl>, owid_edu_idx <dbl>, average_daily_income <dbl>,
#> #   income_per_person <dbl>, sanitation <dbl>, child_mortality <dbl>

To get a general idea on the dataset, you can use the summary() function introduced in Section 2.8.5.

summary(gm2004)
#>    country               year         gender           continent        
#>  Length:472         Min.   :2004   Length:472         Length:472        
#>  Class :character   1st Qu.:2004   Class :character   Class :character  
#>  Mode  :character   Median :2004   Mode  :character   Mode  :character  
#>                     Mean   :2004                                        
#>                     3rd Qu.:2004                                        
#>                     Max.   :2004                                        
#>                                                                         
#>     region            population             BMI        livercancer_newcases
#>  Length:472         Min.   :      0.8   Min.   :19.80   Min.   :  1.280     
#>  Class :character   1st Qu.:   1390.0   1st Qu.:23.00   1st Qu.:  4.438     
#>  Mode  :character   Median :   6770.0   Median :25.50   Median :  6.485     
#>                     Mean   :  33169.4   Mean   :25.21   Mean   :  9.288     
#>                     3rd Qu.:  21400.0   3rd Qu.:26.80   3rd Qu.:  9.893     
#>                     Max.   :1330000.0   Max.   :34.30   Max.   :121.000     
#>                     NA's   :78          NA's   :74      NA's   :84          
#>  lungcancer_newcases  cholesterol    life_expectancy     sugar       
#>  Min.   :  1.830     Min.   :3.760   Min.   :43.30   Min.   :  5.21  
#>  1st Qu.:  7.885     1st Qu.:4.420   1st Qu.:62.20   1st Qu.: 46.40  
#>  Median : 15.200     Median :4.750   Median :71.20   Median : 84.45  
#>  Mean   : 23.282     Mean   :4.747   Mean   :68.82   Mean   : 83.23  
#>  3rd Qu.: 32.625     3rd Qu.:5.090   3rd Qu.:75.60   3rd Qu.:115.00  
#>  Max.   :109.000     Max.   :5.720   Max.   :82.50   Max.   :193.00  
#>  NA's   :64          NA's   :74      NA's   :82      NA's   :124     
#>  health_spending  GDP_per_capita         HDI        HDI_category      
#>  Min.   : 1.700   Min.   :  0.289   Min.   :0.294   Length:472        
#>  1st Qu.: 4.460   1st Qu.:  1.688   1st Qu.:0.512   Class :character  
#>  Median : 6.165   Median :  4.570   Median :0.689   Mode  :character  
#>  Mean   : 6.534   Mean   : 14.063   Mean   :0.659                     
#>  3rd Qu.: 8.150   3rd Qu.: 17.825   3rd Qu.:0.791                     
#>  Max.   :17.600   Max.   :126.000   Max.   :0.931                     
#>  NA's   :92       NA's   :74        NA's   :102                       
#>     smoking        food_supply    owid_edu_idx   average_daily_income
#>  Min.   : 0.300   Min.   :1870   Min.   : 8.67   Min.   :  0.764     
#>  1st Qu.: 9.725   1st Qu.:2370   1st Qu.:32.00   1st Qu.:  3.890     
#>  Median :25.450   Median :2750   Median :50.70   Median :  9.250     
#>  Mean   :24.727   Mean   :2748   Mean   :49.82   Mean   : 17.509     
#>  3rd Qu.:36.000   3rd Qu.:3100   3rd Qu.:67.83   3rd Qu.: 18.850     
#>  Max.   :70.100   Max.   :3830   Max.   :88.70   Max.   :187.000     
#>  NA's   :210      NA's   :124    NA's   :106     NA's   :82          
#>  income_per_person   sanitation     child_mortality 
#>  Min.   :   779    Min.   :  4.39   Min.   :  2.96  
#>  1st Qu.:  3360    1st Qu.: 46.30   1st Qu.:  9.73  
#>  Median :  9350    Median : 83.90   Median : 24.90  
#>  Mean   : 17217    Mean   : 71.31   Mean   : 45.70  
#>  3rd Qu.: 21850    3rd Qu.: 97.80   3rd Qu.: 73.10  
#>  Max.   :109000    Max.   :100.00   Max.   :204.00  
#>  NA's   :82        NA's   :52       NA's   :78

In the output, we get the summary statistics for each variable. For numeric variables, we get the minimum, 1st quartile, median, mean, 3rd quartile, and the maximum. It also shows the number of NA’s for a particular variable. For character variables, we only get the length of the vector, the class, and the mode.

Although the types of each variable are shown in the result when typing gm2004, a more detailed list can be found with the function str().

str(gm2004)
#> tibble [472 × 23] (S3: tbl_df/tbl/data.frame)
#>  $ country             : chr [1:472] "Albania" "Andorra" "United Arab Emirates" "Argentina" ...
#>  $ year                : num [1:472] 2004 2004 2004 2004 2004 ...
#>  $ gender              : chr [1:472] "female" "female" "female" "female" ...
#>  $ continent           : chr [1:472] "Europe" NA "Asia" "Americas" ...
#>  $ region              : chr [1:472] "Southern Europe" NA "Western Asia" "South America" ...
#>  $ population          : num [1:472] 3090 78.9 4590 38900 2980 20200 8250 8540 10500 13400 ...
#>  $ BMI                 : num [1:472] 25.5 26.3 29.3 27.1 26.7 26.6 25 27 25.1 21.4 ...
#>  $ livercancer_newcases: num [1:472] 5.66 2.39 4.29 3.36 8.23 2.28 4.12 7.09 2.65 10.1 ...
#>  $ lungcancer_newcases : num [1:472] 11 14.6 17.2 12.7 10.5 24 19.4 7.21 18.4 3.43 ...
#>  $ cholesterol         : num [1:472] 4.9 5.48 5.31 5.09 4.79 5.26 5.32 4.65 5.38 4.1 ...
#>  $ life_expectancy     : num [1:472] 76.2 81.4 69.2 75.3 73 81.2 79.8 67.3 79.3 55.3 ...
#>  $ sugar               : num [1:472] 58.6 NA 110 136 94.1 128 136 48.8 146 16.6 ...
#>  $ health_spending     : num [1:472] 6.84 7.22 2.32 8.45 4.86 8.43 10.3 7.8 10 6.69 ...
#>  $ GDP_per_capita      : num [1:472] 2.68 39.8 53.9 11.2 2.36 50.3 41.3 2.67 38.4 0.518 ...
#>  $ HDI                 : num [1:472] 0.706 0.827 0.809 0.788 0.712 0.906 0.863 0.674 0.897 0.332 ...
#>  $ HDI_category        : chr [1:472] "high" "very high" "very high" "high" ...
#>  $ smoking             : num [1:472] 4 29.2 2.6 25.4 3.7 21.8 40.1 0.9 24.1 11.2 ...
#>  $ food_supply         : num [1:472] 2870 NA 3210 3110 2670 3100 3640 2840 3720 2460 ...
#>  $ owid_edu_idx        : num [1:472] 60.7 65.3 60.7 60.7 72.7 78 66 71.3 71.3 8.67 ...
#>  $ average_daily_income: num [1:472] 7.7 49.3 187 15.6 6.34 46.8 49.9 10.7 47.2 2.4 ...
#>  $ income_per_person   : Named num [1:472] 8040 45000 80800 19400 7420 42200 49400 7110 46300 1530 ...
#>   ..- attr(*, "names")= chr [1:472] NA "k" "k" "k" ...
#>  $ sanitation          : num [1:472] 92.3 100 97.4 89.9 88.9 100 100 72.3 99.5 14.2 ...
#>  $ child_mortality     : num [1:472] 19.2 5.27 9.73 16.5 23.9 5.72 4.9 52.5 4.94 153 ...

The str() function lists out all variables in the dataset with their corresponding type, length, and each variable’s first several values.

4.1.2 Small Ames Housing Price Data Set

Next, we will introduce the sahp dataset, which is part of the Ames Housing Price data. For your convenience, we have also included sahp in the R package r02pro. Similarly, you can type sahp to have a quick look of the dataset.

sahp
#> # A tibble: 165 × 12
#>    dt_sold    bedroom bathroom gar_car oa_qual liv_area lot_area house_style
#>    <date>       <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <chr>      
#>  1 2010-03-25       3      2.5       2       6     1479    13517 2Story     
#>  2 2009-04-10       4      3.5       2       7     2122    11492 2Story     
#>  3 2010-01-15       3      2         1       5     1057     7922 1Story     
#>  4 2010-04-19       3      2.5       2       5     1444     9802 2Story     
#>  5 2010-03-22       3      2         2       6     1445    14235 1.5Fin     
#>  6 2010-06-06       2      2.5       2       6     1888    16492 1Story     
#>  7 2006-06-14       2      3         2       6     1072     3675 SFoyer     
#>  8 2010-05-08       3      2         2       5     1188    12160 1Story     
#>  9 2007-06-14       2      1         1       5      924    15783 1Story     
#> 10 2007-09-01       5      2.5       2       5     2080    11606 2Story     
#> # ℹ 155 more rows
#> # ℹ 4 more variables: kit_qual <chr>, heat_qual <chr>, central_air <chr>,
#> #   sale_price <dbl>

You can see that sahp is a tibble with 165 observations and 12 housing variables, including the sale date, price, and other property quality measurements. Let’s again use the summary() function on the dataset.

summary(sahp)
#>     dt_sold              bedroom         bathroom        gar_car     
#>  Min.   :2006-01-11   Min.   :0.000   Min.   :1.000   Min.   :0.000  
#>  1st Qu.:2007-02-27   1st Qu.:2.000   1st Qu.:1.500   1st Qu.:1.000  
#>  Median :2008-04-09   Median :3.000   Median :2.000   Median :2.000  
#>  Mean   :2008-04-13   Mean   :2.764   Mean   :2.197   Mean   :1.774  
#>  3rd Qu.:2009-06-03   3rd Qu.:3.000   3rd Qu.:2.500   3rd Qu.:2.000  
#>  Max.   :2010-07-24   Max.   :5.000   Max.   :4.500   Max.   :4.000  
#>                                                       NA's   :1      
#>     oa_qual          liv_area       lot_area     house_style       
#>  Min.   : 2.000   Min.   : 438   Min.   : 1533   Length:165        
#>  1st Qu.: 5.000   1st Qu.:1116   1st Qu.: 7288   Class :character  
#>  Median : 6.000   Median :1450   Median : 9260   Mode  :character  
#>  Mean   : 6.128   Mean   :1481   Mean   : 9832                     
#>  3rd Qu.: 7.000   3rd Qu.:1707   3rd Qu.:11645                     
#>  Max.   :10.000   Max.   :3390   Max.   :39384                     
#>  NA's   :1                                                         
#>    kit_qual          heat_qual         central_air          sale_price   
#>  Length:165         Length:165         Length:165         Min.   : 44.0  
#>  Class :character   Class :character   Class :character   1st Qu.:130.0  
#>  Mode  :character   Mode  :character   Mode  :character   Median :157.9  
#>                                                           Mean   :179.9  
#>                                                           3rd Qu.:201.6  
#>                                                           Max.   :545.2  
#>                                                           NA's   :1

Again, we can use the str() function to a detailed list of each variable along with its type.

str(sahp)
#> tibble [165 × 12] (S3: tbl_df/tbl/data.frame)
#>  $ dt_sold    : Date[1:165], format: "2010-03-25" "2009-04-10" ...
#>  $ bedroom    : num [1:165] 3 4 3 3 3 2 2 3 2 5 ...
#>  $ bathroom   : num [1:165] 2.5 3.5 2 2.5 2 2.5 3 2 1 2.5 ...
#>  $ gar_car    : num [1:165] 2 2 1 2 2 2 2 2 1 2 ...
#>  $ oa_qual    : num [1:165] 6 7 5 5 6 6 6 5 5 5 ...
#>  $ liv_area   : num [1:165] 1479 2122 1057 1444 1445 ...
#>  $ lot_area   : num [1:165] 13517 11492 7922 9802 14235 ...
#>  $ house_style: chr [1:165] "2Story" "2Story" "1Story" "2Story" ...
#>  $ kit_qual   : chr [1:165] "Good" "Good" "Good" "Average" ...
#>  $ heat_qual  : chr [1:165] "Excellent" "Excellent" "Average" "Good" ...
#>  $ central_air: chr [1:165] "Y" "Y" "Y" "Y" ...
#>  $ sale_price : num [1:165] 130 NA 109 174 138 ...

Now, let’s try to answer a few questions using the sahp dataset.

4.1.2.1 Sample Analysis: are two-story houses more expensive than one-story ones?

Let’s try to answer this question by doing some analysis. First, let’s create the logical vectors corresponding to two-story and one-story houses.

story_2 <- sahp$house_style == "2Story"
story_1 <- sahp$house_style == "1Story"

Then, we create two vectors containing the prices of the two groups, respectively.

sale_price_2 <- sahp$sale_price[story_2]
sale_price_1 <- sahp$sale_price[story_1]

Finally, we can run the summary() function on both vectors.

summary(sale_price_2)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    55.0   137.9   174.0   197.8   231.5   545.2       1
summary(sale_price_1)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    44.0   129.0   160.0   183.0   224.2   465.0

The results from the summary() function clearly represent that the corresponding statistics are larger for two-story houses compared to one-story ones, for all 6 measures. As a result, we can conclude that the two-story houses indeed have higher sale prices than the one-story ones.

4.1.3 Converting Data Types

When you import a dataset into R, some variables may not have the desired types. In this case, it would be useful to convert them into the types you want before conducting further data analyses.

a. Converting a character vector to an unordered factor

Let’s look at the variable house_style in sahp. We can see from the output of str(sahp) that it is of chr type. Let’s check it and confirm the structure with summary().

is.character(sahp$house_style)
#> [1] TRUE
summary(sahp$house_style)
#>    Length     Class      Mode 
#>       165 character character

As briefly mentioned before, using the summary() function on a character vector doesn’t provide us much useful information. Let’s find the unique values of this vector and get the frequency table.

unique(sahp$house_style)
#> [1] "2Story" "1Story" "1.5Fin" "SFoyer" "SLvl"
table(sahp$house_style)
#> 
#> 1.5Fin 1Story 2Story SFoyer   SLvl 
#>     21     81     50      5      8

We can see that there are five house styles along with their frequencies. It turns out to be particularly useful to convert this type of variable into a factor type. Let’s use the function factor() to proceed the conversion and as.factor() to ensure the conversion is successfully completed.

house_style_factor <- factor(sahp$house_style)
is.factor(house_style_factor)
#> [1] TRUE
summary(house_style_factor)
#> 1.5Fin 1Story 2Story SFoyer   SLvl 
#>     21     81     50      5      8

Instead of calling table() to see the frequencies, we can obtain them by calling just summary().

b. Converting a character vector to an ordered factor

Now, let’s take a look at another variable called kit_qual, measuring the kitchen quality. Again, let’s check the unique values.

unique(sahp$kit_qual)
#> [1] "Good"      "Average"   "Fair"      "Excellent"

In addition to having four different quality values, they have an internal ordering among them. In particular, we know Fair < Average < Good < Excellent. To reflect this intrinsic order, you can convert this variable into an ordered factor using the same factor() function, setting ordered = TRUE and specifying the levels in the ascending order of the desired ordering.

kit_qual_ordered_factor <- factor(sahp$kit_qual, ordered = TRUE, levels = c("Fair",
    "Average", "Good", "Excellent"))  #covert to ordered factor
summary(kit_qual_ordered_factor)
#>      Fair   Average      Good Excellent 
#>         9        85        57        14
str(kit_qual_ordered_factor)
#>  Ord.factor w/ 4 levels "Fair"<"Average"<..: 3 3 3 2 2 3 2 2 2 1 ...

c. Converting a character vector to a logical vector

Lastly, let’s look at the variable central_air, representing a house’s AC condition. As before, let’s get the unique elements.

unique(sahp$central_air)
#> [1] "Y" "N"

Intuitively, you can create a logical vector representing whether the house has central AC or not.

central_air_logi <- sahp$central_air == "Y"
summary(central_air_logi)
#>    Mode   FALSE    TRUE 
#> logical      16     149
str(central_air_logi)
#>  logi [1:165] TRUE TRUE TRUE TRUE TRUE TRUE ...

Another scenario would be creating an additional variable from the existing ones. For example, we know the overall quality (oa_qual) of the house ranges from 2 to 10.

table(sahp$oa_qual)
#> 
#>  2  3  4  5  6  7  8  9 10 
#>  2  1 10 47 44 31 21  6  2

If we want to crate a new variable representing houses of good quality with a oa_qual greater than 5, this can be achieved by creating a new logical variable named good_qual as shown below.

good_qual <- sahp$oa_qual > 5
summary(good_qual)
#>    Mode   FALSE    TRUE    NA's 
#> logical      60     104       1
str(good_qual)
#>  logi [1:165] TRUE TRUE FALSE FALSE TRUE TRUE ...

4.1.4 Recover Modified Values

When you are working with a dataset provided by a package, you may accidentally modify some values in the original dataset.

In this situation, there’s no need to panic, as this mistake can be easily recovered by setting the data into its “factory” setting (i.e. the original version inside the package).

To do this, you just need to use the data() function with the dataset name as its argument. Let’s try to modify one value of sahp and recover the data set afterward.

However, it is strongly recommended to develop the habit of saving an independent copy of dataset under a different object name, especially with those provided by a loaded package. We encourage you to label your code with necessary, clear comments along with this process. It is a good habit that can greatly enhance your efficiency of coding while avoiding unnecessary errors.

sahp[1, 2]  #get the original value
#> # A tibble: 1 × 1
#>   bedroom
#>     <dbl>
#> 1       3
sahp[1, 2] <- 5  #modify the value
sahp[1, 2]  #verify the modified value
#> # A tibble: 1 × 1
#>   bedroom
#>     <dbl>
#> 1       5
data(sahp)  #recover the data
sahp[1, 2]  #verify the value is recovered
#> # A tibble: 1 × 1
#>   bedroom
#>     <dbl>
#> 1       3