4.1 Introduction to Data Sets
In this section, we will introduce two datasets that will be used extensively throughout the rest of the book to illustrate necessary concepts like data import and export, visualization, data manipulation, etc. This chapter will greatly enhance your ability in working independently on your own project by integrating the knowledge and skills from the previous chapters.
4.1.1 Gapminder Data Set
We will first introduce a dataset gm2004
, located in the R package r02pro, the companion package of this book. The gm2004
dataset is created from the gapminder (https://www.gapminder.org/) website from a wide range of public health related topics. In particular, gm2004
contains 472 observations and 23 health related variables collected in the year 2004. Each observation (row) corresponds to a specific country, with columns representing features like mortality, health spending, and other demographic information.
Let’s begin by loading the r02pro and tibble packages since gm2004
is in a tibble format.
After loading the two packages, you can type gm2004
to have a quick look of the dataset.
gm2004
#> # A tibble: 472 × 23
#> country year gender conti…¹ region popul…² BMI liver…³ lungc…⁴ chole…⁵
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 2004 female Europe South… 3090 25.5 5.66 11 4.9
#> 2 Andorra 2004 female <NA> <NA> 78.9 26.3 2.39 14.6 5.48
#> 3 United Ara… 2004 female Asia Weste… 4590 29.3 4.29 17.2 5.31
#> 4 Argentina 2004 female Americ… South… 38900 27.1 3.36 12.7 5.09
#> 5 Armenia 2004 female Asia Weste… 2980 26.7 8.23 10.5 4.79
#> 6 Australia 2004 female Oceania Austr… 20200 26.6 2.28 24 5.26
#> 7 Austria 2004 female Europe Weste… 8250 25 4.12 19.4 5.32
#> 8 Azerbaijan 2004 female Asia Weste… 8540 27 7.09 7.21 4.65
#> 9 Belgium 2004 female Europe Weste… 10500 25.1 2.65 18.4 5.38
#> 10 Burkina Fa… 2004 female Africa Weste… 13400 21.4 10.1 3.43 4.1
#> # … with 462 more rows, 13 more variables: life_expectancy <dbl>, sugar <dbl>,
#> # health_spending <dbl>, GDP_per_capita <dbl>, HDI <dbl>, HDI_category <chr>,
#> # smoking <dbl>, food_supply <dbl>, owid_edu_idx <dbl>,
#> # average_daily_income <dbl>, income_per_person <dbl>, sanitation <dbl>,
#> # child_mortality <dbl>, and abbreviated variable names ¹continent,
#> # ²population, ³livercancer_newcases, ⁴lungcancer_newcases, ⁵cholesterol
You can see that gm2004
is a tibble with 472 observations and 23 variables. By default, the output only gives a compact view of the first 10 observations in the tibble along with the first few variables that can fit the window. To view the full dataset, you can use the View()
function, which will open the dataset in a new window.
To view the first few rows of the dataset, you can use the head()
function, which produces the first 6 observations by default. You can also provide an optional second argument to display any given number of observations.
head(gm2004)
#> # A tibble: 6 × 23
#> country year gender conti…¹ region popul…² BMI liver…³ lungc…⁴ chole…⁵
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 2004 female Europe South… 3090 25.5 5.66 11 4.9
#> 2 Andorra 2004 female <NA> <NA> 78.9 26.3 2.39 14.6 5.48
#> 3 United Arab… 2004 female Asia Weste… 4590 29.3 4.29 17.2 5.31
#> 4 Argentina 2004 female Americ… South… 38900 27.1 3.36 12.7 5.09
#> 5 Armenia 2004 female Asia Weste… 2980 26.7 8.23 10.5 4.79
#> 6 Australia 2004 female Oceania Austr… 20200 26.6 2.28 24 5.26
#> # … with 13 more variables: life_expectancy <dbl>, sugar <dbl>,
#> # health_spending <dbl>, GDP_per_capita <dbl>, HDI <dbl>, HDI_category <chr>,
#> # smoking <dbl>, food_supply <dbl>, owid_edu_idx <dbl>,
#> # average_daily_income <dbl>, income_per_person <dbl>, sanitation <dbl>,
#> # child_mortality <dbl>, and abbreviated variable names ¹continent,
#> # ²population, ³livercancer_newcases, ⁴lungcancer_newcases, ⁵cholesterol
head(gm2004, 15)
#> # A tibble: 15 × 23
#> country year gender conti…¹ region popul…² BMI liver…³ lungc…⁴ chole…⁵
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Albania 2004 female Europe South… 3.09e3 25.5 5.66 11 4.9
#> 2 Andorra 2004 female <NA> <NA> 7.89e1 26.3 2.39 14.6 5.48
#> 3 United Ara… 2004 female Asia Weste… 4.59e3 29.3 4.29 17.2 5.31
#> 4 Argentina 2004 female Americ… South… 3.89e4 27.1 3.36 12.7 5.09
#> 5 Armenia 2004 female Asia Weste… 2.98e3 26.7 8.23 10.5 4.79
#> 6 Australia 2004 female Oceania Austr… 2.02e4 26.6 2.28 24 5.26
#> 7 Austria 2004 female Europe Weste… 8.25e3 25 4.12 19.4 5.32
#> 8 Azerbaijan 2004 female Asia Weste… 8.54e3 27 7.09 7.21 4.65
#> 9 Belgium 2004 female Europe Weste… 1.05e4 25.1 2.65 18.4 5.38
#> 10 Burkina Fa… 2004 female Africa Weste… 1.34e4 21.4 10.1 3.43 4.1
#> 11 Bangladesh 2004 female Asia South… 1.39e5 20.1 1.97 2.76 4.38
#> 12 Bulgaria 2004 female Europe Easte… 7.69e3 25.4 4.27 9.68 5.06
#> 13 Bahrain 2004 female Asia Weste… 8.89e2 28.4 4.25 17.6 5.18
#> 14 Bosnia and… 2004 female Europe South… 3.77e3 26.2 5.71 15.2 4.76
#> 15 Belarus 2004 female Europe Easte… 9.56e3 26.3 1.88 5.71 5.1
#> # … with 13 more variables: life_expectancy <dbl>, sugar <dbl>,
#> # health_spending <dbl>, GDP_per_capita <dbl>, HDI <dbl>, HDI_category <chr>,
#> # smoking <dbl>, food_supply <dbl>, owid_edu_idx <dbl>,
#> # average_daily_income <dbl>, income_per_person <dbl>, sanitation <dbl>,
#> # child_mortality <dbl>, and abbreviated variable names ¹continent,
#> # ²population, ³livercancer_newcases, ⁴lungcancer_newcases, ⁵cholesterol
To get a general idea on the dataset, you can use the summary()
function introduced in Section 2.8.5.
summary(gm2004)
#> country year gender continent
#> Length:472 Min. :2004 Length:472 Length:472
#> Class :character 1st Qu.:2004 Class :character Class :character
#> Mode :character Median :2004 Mode :character Mode :character
#> Mean :2004
#> 3rd Qu.:2004
#> Max. :2004
#>
#> region population BMI livercancer_newcases
#> Length:472 Min. : 0.8 Min. :19.80 Min. : 1.280
#> Class :character 1st Qu.: 1390.0 1st Qu.:23.00 1st Qu.: 4.438
#> Mode :character Median : 6770.0 Median :25.50 Median : 6.485
#> Mean : 33169.4 Mean :25.21 Mean : 9.288
#> 3rd Qu.: 21400.0 3rd Qu.:26.80 3rd Qu.: 9.893
#> Max. :1330000.0 Max. :34.30 Max. :121.000
#> NA's :78 NA's :74 NA's :84
#> lungcancer_newcases cholesterol life_expectancy sugar
#> Min. : 1.830 Min. :3.760 Min. :43.30 Min. : 5.21
#> 1st Qu.: 7.885 1st Qu.:4.420 1st Qu.:62.20 1st Qu.: 46.40
#> Median : 15.200 Median :4.750 Median :71.20 Median : 84.45
#> Mean : 23.282 Mean :4.747 Mean :68.82 Mean : 83.23
#> 3rd Qu.: 32.625 3rd Qu.:5.090 3rd Qu.:75.60 3rd Qu.:115.00
#> Max. :109.000 Max. :5.720 Max. :82.50 Max. :193.00
#> NA's :64 NA's :74 NA's :82 NA's :124
#> health_spending GDP_per_capita HDI HDI_category
#> Min. : 1.700 Min. : 0.289 Min. :0.294 Length:472
#> 1st Qu.: 4.460 1st Qu.: 1.688 1st Qu.:0.512 Class :character
#> Median : 6.165 Median : 4.570 Median :0.689 Mode :character
#> Mean : 6.534 Mean : 14.063 Mean :0.659
#> 3rd Qu.: 8.150 3rd Qu.: 17.825 3rd Qu.:0.791
#> Max. :17.600 Max. :126.000 Max. :0.931
#> NA's :92 NA's :74 NA's :102
#> smoking food_supply owid_edu_idx average_daily_income
#> Min. : 0.300 Min. :1870 Min. : 8.67 Min. : 0.764
#> 1st Qu.: 9.725 1st Qu.:2370 1st Qu.:32.00 1st Qu.: 3.890
#> Median :25.450 Median :2750 Median :50.70 Median : 9.250
#> Mean :24.727 Mean :2748 Mean :49.82 Mean : 17.509
#> 3rd Qu.:36.000 3rd Qu.:3100 3rd Qu.:67.83 3rd Qu.: 18.850
#> Max. :70.100 Max. :3830 Max. :88.70 Max. :187.000
#> NA's :210 NA's :124 NA's :106 NA's :82
#> income_per_person sanitation child_mortality
#> Min. : 779 Min. : 4.39 Min. : 2.96
#> 1st Qu.: 3360 1st Qu.: 46.30 1st Qu.: 9.73
#> Median : 9350 Median : 83.90 Median : 24.90
#> Mean : 17217 Mean : 71.31 Mean : 45.70
#> 3rd Qu.: 21850 3rd Qu.: 97.80 3rd Qu.: 73.10
#> Max. :109000 Max. :100.00 Max. :204.00
#> NA's :82 NA's :52 NA's :78
In the output, we get the summary statistics for each variable. For numeric variables, we get the minimum, 1st quartile, median, mean, 3rd quartile, and the maximum. It also shows the number of NA
’s for a particular variable. For character variables, we only get the length of the vector, the class, and the mode.
Although the types of each variable are shown in the result when typing gm2004
, a more detailed list can be found with the function str()
.
str(gm2004)
#> tibble [472 × 23] (S3: tbl_df/tbl/data.frame)
#> $ country : chr [1:472] "Albania" "Andorra" "United Arab Emirates" "Argentina" ...
#> $ year : num [1:472] 2004 2004 2004 2004 2004 ...
#> $ gender : chr [1:472] "female" "female" "female" "female" ...
#> $ continent : chr [1:472] "Europe" NA "Asia" "Americas" ...
#> $ region : chr [1:472] "Southern Europe" NA "Western Asia" "South America" ...
#> $ population : num [1:472] 3090 78.9 4590 38900 2980 20200 8250 8540 10500 13400 ...
#> $ BMI : num [1:472] 25.5 26.3 29.3 27.1 26.7 26.6 25 27 25.1 21.4 ...
#> $ livercancer_newcases: num [1:472] 5.66 2.39 4.29 3.36 8.23 2.28 4.12 7.09 2.65 10.1 ...
#> $ lungcancer_newcases : num [1:472] 11 14.6 17.2 12.7 10.5 24 19.4 7.21 18.4 3.43 ...
#> $ cholesterol : num [1:472] 4.9 5.48 5.31 5.09 4.79 5.26 5.32 4.65 5.38 4.1 ...
#> $ life_expectancy : num [1:472] 76.2 81.4 69.2 75.3 73 81.2 79.8 67.3 79.3 55.3 ...
#> $ sugar : num [1:472] 58.6 NA 110 136 94.1 128 136 48.8 146 16.6 ...
#> $ health_spending : num [1:472] 6.84 7.22 2.32 8.45 4.86 8.43 10.3 7.8 10 6.69 ...
#> $ GDP_per_capita : num [1:472] 2.68 39.8 53.9 11.2 2.36 50.3 41.3 2.67 38.4 0.518 ...
#> $ HDI : num [1:472] 0.706 0.827 0.809 0.788 0.712 0.906 0.863 0.674 0.897 0.332 ...
#> $ HDI_category : chr [1:472] "high" "very high" "very high" "high" ...
#> $ smoking : num [1:472] 4 29.2 2.6 25.4 3.7 21.8 40.1 0.9 24.1 11.2 ...
#> $ food_supply : num [1:472] 2870 NA 3210 3110 2670 3100 3640 2840 3720 2460 ...
#> $ owid_edu_idx : num [1:472] 60.7 65.3 60.7 60.7 72.7 78 66 71.3 71.3 8.67 ...
#> $ average_daily_income: num [1:472] 7.7 49.3 187 15.6 6.34 46.8 49.9 10.7 47.2 2.4 ...
#> $ income_per_person : Named num [1:472] 8040 45000 80800 19400 7420 42200 49400 7110 46300 1530 ...
#> ..- attr(*, "names")= chr [1:472] NA "k" "k" "k" ...
#> $ sanitation : num [1:472] 92.3 100 97.4 89.9 88.9 100 100 72.3 99.5 14.2 ...
#> $ child_mortality : num [1:472] 19.2 5.27 9.73 16.5 23.9 5.72 4.9 52.5 4.94 153 ...
The str()
function lists out all variables in the dataset with their corresponding type, length, and each variable’s first several values.
4.1.2 Small Ames Housing Price Data Set
Next, we will introduce the sahp
dataset, which is part of the Ames Housing Price data. For your convenience, we have also included sahp
in the R package r02pro. Similarly, you can type sahp
to have a quick look of the dataset.
sahp
#> # A tibble: 165 × 12
#> dt_sold bedroom bathroom gar_car oa_qual liv_area lot_area house…¹ kit_q…²
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 2010-03-25 3 2.5 2 6 1479 13517 2Story Good
#> 2 2009-04-10 4 3.5 2 7 2122 11492 2Story Good
#> 3 2010-01-15 3 2 1 5 1057 7922 1Story Good
#> 4 2010-04-19 3 2.5 2 5 1444 9802 2Story Average
#> 5 2010-03-22 3 2 2 6 1445 14235 1.5Fin Average
#> 6 2010-06-06 2 2.5 2 6 1888 16492 1Story Good
#> 7 2006-06-14 2 3 2 6 1072 3675 SFoyer Average
#> 8 2010-05-08 3 2 2 5 1188 12160 1Story Average
#> 9 2007-06-14 2 1 1 5 924 15783 1Story Average
#> 10 2007-09-01 5 2.5 2 5 2080 11606 2Story Fair
#> # … with 155 more rows, 3 more variables: heat_qual <chr>, central_air <chr>,
#> # sale_price <dbl>, and abbreviated variable names ¹house_style, ²kit_qual
You can see that sahp
is a tibble with 165 observations and 12 housing variables, including the sale date, price, and other property quality measurements. Let’s again use the summary()
function on the dataset.
summary(sahp)
#> dt_sold bedroom bathroom gar_car
#> Min. :2006-01-11 Min. :0.000 Min. :1.000 Min. :0.000
#> 1st Qu.:2007-02-27 1st Qu.:2.000 1st Qu.:1.500 1st Qu.:1.000
#> Median :2008-04-09 Median :3.000 Median :2.000 Median :2.000
#> Mean :2008-04-13 Mean :2.764 Mean :2.197 Mean :1.774
#> 3rd Qu.:2009-06-03 3rd Qu.:3.000 3rd Qu.:2.500 3rd Qu.:2.000
#> Max. :2010-07-24 Max. :5.000 Max. :4.500 Max. :4.000
#> NA's :1
#> oa_qual liv_area lot_area house_style
#> Min. : 2.000 Min. : 438 Min. : 1533 Length:165
#> 1st Qu.: 5.000 1st Qu.:1116 1st Qu.: 7288 Class :character
#> Median : 6.000 Median :1450 Median : 9260 Mode :character
#> Mean : 6.128 Mean :1481 Mean : 9832
#> 3rd Qu.: 7.000 3rd Qu.:1707 3rd Qu.:11645
#> Max. :10.000 Max. :3390 Max. :39384
#> NA's :1
#> kit_qual heat_qual central_air sale_price
#> Length:165 Length:165 Length:165 Min. : 44.0
#> Class :character Class :character Class :character 1st Qu.:130.0
#> Mode :character Mode :character Mode :character Median :157.9
#> Mean :179.9
#> 3rd Qu.:201.6
#> Max. :545.2
#> NA's :1
Again, we can use the str()
function to a detailed list of each variable along with its type.
str(sahp)
#> tibble [165 × 12] (S3: tbl_df/tbl/data.frame)
#> $ dt_sold : Date[1:165], format: "2010-03-25" "2009-04-10" ...
#> $ bedroom : num [1:165] 3 4 3 3 3 2 2 3 2 5 ...
#> $ bathroom : num [1:165] 2.5 3.5 2 2.5 2 2.5 3 2 1 2.5 ...
#> $ gar_car : num [1:165] 2 2 1 2 2 2 2 2 1 2 ...
#> $ oa_qual : num [1:165] 6 7 5 5 6 6 6 5 5 5 ...
#> $ liv_area : num [1:165] 1479 2122 1057 1444 1445 ...
#> $ lot_area : num [1:165] 13517 11492 7922 9802 14235 ...
#> $ house_style: chr [1:165] "2Story" "2Story" "1Story" "2Story" ...
#> $ kit_qual : chr [1:165] "Good" "Good" "Good" "Average" ...
#> $ heat_qual : chr [1:165] "Excellent" "Excellent" "Average" "Good" ...
#> $ central_air: chr [1:165] "Y" "Y" "Y" "Y" ...
#> $ sale_price : num [1:165] 130 NA 109 174 138 ...
Now, let’s try to answer a few questions using the sahp
dataset.
4.1.2.1 Sample Analysis: are two-story houses more expensive than one-story ones?
Let’s try to answer this question by doing some analysis. First, let’s create the logical vectors corresponding to two-story and one-story houses.
Then, we create two vectors containing the prices of the two groups, respectively.
Finally, we can run the summary()
function on both vectors.
summary(sale_price_2)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 55.0 137.9 174.0 197.8 231.5 545.2 1
summary(sale_price_1)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 44.0 129.0 160.0 183.0 224.2 465.0
The results from the summary()
function clearly represent that the corresponding statistics are larger for two-story houses compared to one-story ones, for all 6 measures. As a result, we can conclude that the two-story houses indeed have higher sale prices than the one-story ones.
4.1.3 Converting Data Types
When you import a dataset into R, some variables may not have the desired types. In this case, it would be useful to convert them into the types you want before conducting further data analyses.
a. Converting a character vector to an unordered factor
Let’s look at the variable house_style
in sahp
. We can see from the output of str(sahp)
that it is of chr
type. Let’s check it and confirm the structure with summary()
.
is.character(sahp$house_style)
#> [1] TRUE
summary(sahp$house_style)
#> Length Class Mode
#> 165 character character
As briefly mentioned before, using the summary()
function on a character vector doesn’t provide us much useful information. Let’s find the unique values of this vector and get the frequency table.
unique(sahp$house_style)
#> [1] "2Story" "1Story" "1.5Fin" "SFoyer" "SLvl"
table(sahp$house_style)
#>
#> 1.5Fin 1Story 2Story SFoyer SLvl
#> 21 81 50 5 8
We can see that there are five house styles along with their frequencies. It turns out to be particularly useful to convert this type of variable into a factor type. Let’s use the function factor()
to proceed the conversion and as.factor()
to ensure the conversion is successfully completed.
house_style_factor <- factor(sahp$house_style)
is.factor(house_style_factor)
#> [1] TRUE
summary(house_style_factor)
#> 1.5Fin 1Story 2Story SFoyer SLvl
#> 21 81 50 5 8
Instead of calling table()
to see the frequencies, we can obtain them by calling just summary()
.
b. Converting a character vector to an ordered factor
Now, let’s take a look at another variable called kit_qual
, measuring the kitchen quality. Again, let’s check the unique values.
In addition to having four different quality values, they have an internal ordering among them. In particular, we know Fair < Average < Good < Excellent. To reflect this intrinsic order, you can convert this variable into an ordered factor using the same factor()
function, setting ordered = TRUE
and specifying the levels
in the ascending order of the desired ordering.
kit_qual_ordered_factor <- factor(sahp$kit_qual, ordered = TRUE, levels = c("Fair",
"Average", "Good", "Excellent")) #covert to ordered factor
summary(kit_qual_ordered_factor)
#> Fair Average Good Excellent
#> 9 85 57 14
str(kit_qual_ordered_factor)
#> Ord.factor w/ 4 levels "Fair"<"Average"<..: 3 3 3 2 2 3 2 2 2 1 ...
c. Converting a character vector to a logical vector
Lastly, let’s look at the variable central_air
, representing a house’s AC condition. As before, let’s get the unique elements.
Intuitively, you can create a logical vector representing whether the house has central AC or not.
central_air_logi <- sahp$central_air == "Y"
summary(central_air_logi)
#> Mode FALSE TRUE
#> logical 16 149
str(central_air_logi)
#> logi [1:165] TRUE TRUE TRUE TRUE TRUE TRUE ...
Another scenario would be creating an additional variable from the existing ones. For example, we know the overall quality (oa_qual
) of the house ranges from 2 to 10.
If we want to crate a new variable representing houses of good quality with a oa_qual
greater than 5, this can be achieved by creating a new logical variable named good_qual
as shown below.
4.1.4 Recover Modified Values
When you are working with a dataset provided by a package, you may accidentally modify some values in the original dataset.
In this situation, there’s no need to panic, as this mistake can be easily recovered by setting the data into its “factory” setting (i.e. the original version inside the package).
To do this, you just need to use the data()
function with the dataset name as its argument. Let’s try to modify one value of sahp
and recover the data set afterward.
However, it is strongly recommended to develop the habit of saving an independent copy of dataset under a different object name, especially with those provided by a loaded package. We encourage you to label your code with necessary, clear comments along with this process. It is a good habit that can greatly enhance your efficiency of coding while avoiding unnecessary errors.
sahp[1, 2] #get the original value
#> # A tibble: 1 × 1
#> bedroom
#> <dbl>
#> 1 3
sahp[1, 2] <- 5 #modify the value
sahp[1, 2] #verify the modified value
#> # A tibble: 1 × 1
#> bedroom
#> <dbl>
#> 1 5
data(sahp) #recover the data
sahp[1, 2] #verify the value is recovered
#> # A tibble: 1 × 1
#> bedroom
#> <dbl>
#> 1 3