3.3 Data Frame
So far, we have learned vectors (Chapter 2), matrices (Section 3.1), and arrays (Section 3.2). The three different object types share an important features: they all consist of elements of the same type, namely numeric, character, or logical. In real applications, it is common to have mixed variable types. To accommodate this, let’s introduce a new object type, namely the data frame.
3.3.1 Introduction to Data Frames
To create a data frame, you can use the data.frame()
function with a collection of vectors of the same length. Let’s see an example of some health conditions of a sheep and a pig over the years 2019, 2020 and 2021.
<- rep(c("sheep", "pig"), c(3,3))
animal <- rep(2019:2021, 2)
year <- c(110, 120, 140, NA, 300, 800)
weight <- c(2.2, 2.4, 2.7, 2, 2.1, 2.3)
height <- c("excellent", "good", NA, "excellent", "good", "average")
condition <- factor(condition, ordered = TRUE, levels = c("average", "good", "excellent"))
condition <- c(rep(TRUE, 5), FALSE)
healthy <- data.frame(animal, year, weight, height, condition, healthy)
my_data_frame
my_data_frame#> animal year weight height condition healthy
#> 1 sheep 2019 110 2.2 excellent TRUE
#> 2 sheep 2020 120 2.4 good TRUE
#> 3 sheep 2021 140 2.7 <NA> TRUE
#> 4 pig 2019 NA 2.0 excellent TRUE
#> 5 pig 2020 300 2.1 good TRUE
#> 6 pig 2021 800 2.3 average FALSE
Looking at the data frame my_data_frame
, it has 6 columns, each of which represents one variable. The variables are of different types. The animal
is factor, year
is integer, both weight
and height
are doubles, condition
is ordered factor, and healthy
is logical.
This kind of data representation is impossible using matrices since the coercion rule will apply, converting everything into characters. Let’s combine everything into a matrix and check its value.
<- cbind(animal, year, weight, height, condition, healthy)
my_mat
my_mat#> animal year weight height condition healthy
#> [1,] "sheep" "2019" "110" "2.2" "3" "TRUE"
#> [2,] "sheep" "2020" "120" "2.4" "2" "TRUE"
#> [3,] "sheep" "2021" "140" "2.7" NA "TRUE"
#> [4,] "pig" "2019" NA "2" "3" "TRUE"
#> [5,] "pig" "2020" "300" "2.1" "2" "TRUE"
#> [6,] "pig" "2021" "800" "2.3" "1" "FALSE"
In the process of creating data frames, you can also name each column.
<- data.frame(ani = animal, y = year, w = weight, h = height, con = condition, hea = healthy) my_data_frame2
After creating the data frame, it is useful to examine its class using the class()
function and structure using the str()
function.
class(my_data_frame)
#> [1] "data.frame"
str(my_data_frame)
#> 'data.frame': 6 obs. of 6 variables:
#> $ animal : chr "sheep" "sheep" "sheep" "pig" ...
#> $ year : int 2019 2020 2021 2019 2020 2021
#> $ weight : num 110 120 140 NA 300 800
#> $ height : num 2.2 2.4 2.7 2 2.1 2.3
#> $ condition: Ord.factor w/ 3 levels "average"<"good"<..: 3 2 NA 3 2 1
#> $ healthy : logi TRUE TRUE TRUE TRUE TRUE FALSE
The str()
tells us the data frame has 6 observations and 6 variables, along with the type and the first few values of each variable. From the output, you may be puzzled by the $
symbol before each variable name. In fact, you can easily extract a certain column corresponding to a variable with its name.
$animal
my_data_frame#> [1] "sheep" "sheep" "sheep" "pig" "pig" "pig"
$weight
my_data_frame#> [1] 110 120 140 NA 300 800
In Section 2.5, we introduced the very useful function summary()
which gives us important summary statistics for a vector. Using summary()
on a data frame, you get the summary statistics for each variable.
summary(my_data_frame)
#> animal year weight height condition
#> Length:6 Min. :2019 Min. :110 Min. :2.000 average :1
#> Class :character 1st Qu.:2019 1st Qu.:120 1st Qu.:2.125 good :2
#> Mode :character Median :2020 Median :140 Median :2.250 excellent:2
#> Mean :2020 Mean :294 Mean :2.283 NA's :1
#> 3rd Qu.:2021 3rd Qu.:300 3rd Qu.:2.375
#> Max. :2021 Max. :800 Max. :2.700
#> NA's :1
#> healthy
#> Mode :logical
#> FALSE:1
#> TRUE :5
#>
#>
#>
#>
From the results, you can see that depending on the variable type, you get different forms of summary.
In the object my_data_frame
, there are two missing values represented by NA
. To remove the observations (rows) with NA
values, you can use the na.omit()
on the data frame.
<- na.omit(my_data_frame)
my_df_nona
my_df_nona#> animal year weight height condition healthy
#> 1 sheep 2019 110 2.2 excellent TRUE
#> 2 sheep 2020 120 2.4 good TRUE
#> 5 pig 2020 300 2.1 good TRUE
#> 6 pig 2021 800 2.3 average FALSE
You can see that the 3rd and 4th row are removed since they both have a missing observation.
An alternative approach to remove all rows with missing observations is to first use the complete.cases()
function to get a logical vector of whether a row has missing elements, and then use data frame subsetting.
<- complete.cases(my_data_frame)
complete_ind my_data_frame[complete_ind, ]
3.3.2 Adding Observations or Variables in Data Frames
Sometimes, you may want to add additional observations or variables to an existing data frame.
To add additional observations, you need to put the additional observations or variables into a new data frame, and use the rbind()
function.
<- data.frame(animal = "pig", year = 2018, weight = 200, height = 1.9, condition = "excellent", healthy = TRUE)
new_obs rbind(my_data_frame, new_obs)
#> animal year weight height condition healthy
#> 1 sheep 2019 110 2.2 excellent TRUE
#> 2 sheep 2020 120 2.4 good TRUE
#> 3 sheep 2021 140 2.7 <NA> TRUE
#> 4 pig 2019 NA 2.0 excellent TRUE
#> 5 pig 2020 300 2.1 good TRUE
#> 6 pig 2021 800 2.3 average FALSE
#> 7 pig 2018 200 1.9 excellent TRUE
To add an additional variable to the existing data frame, you just need to add a new field to the data frame.
$age <- my_data_frame$year - 2015
my_data_frame my_data_frame
3.3.3 Subsetting Data Frames
As a two-dimensional object type, subsetting data frames is very similar to subsetting matrices.
a. using indices to do data frame subsetting
The first method for data frame subsetting is to specify the desired row indices and column indices, separated by ,
. For example, we can extract the (1, 3) and (2, 4) element of x
using the following codes.
1, 3]
my_data_frame[#> [1] 110
2, 4]
my_data_frame[#> [1] 2.4
Similar to a matrix subsetting, if you omit the indices of one dimension, R will keep everything along that dimension. You can also use negative indices to keep everything except the provides indices. Let’s see some examples.
2, ]
my_data_frame[#> animal year weight height condition healthy age
#> 2 sheep 2020 120 2.4 good TRUE 5
2]
my_data_frame[, #> [1] 2019 2020 2021 2019 2020 2021
2] #a single index corresponds to the columns
my_data_frame[#> year
#> 1 2019
#> 2 2020
#> 3 2021
#> 4 2019
#> 5 2020
#> 6 2021
c(1,3), -c(3,4)]
my_data_frame[#> animal year condition healthy age
#> 1 sheep 2019 excellent TRUE 4
#> 3 sheep 2021 <NA> TRUE 6
b. using column names to do data frame subsetting
Since data frames usually have column names, you can do subsetting using multiple column names.
c("animal", "weight")]
my_data_frame[, #> animal weight
#> 1 sheep 110
#> 2 sheep 120
#> 3 sheep 140
#> 4 pig NA
#> 5 pig 300
#> 6 pig 800
c. using logical vectors to do data frame subsetting
Using logical vectors to do data frame subsetting is very useful. Suppose we want to find the condition of the pig in year 2021.
<- my_data_frame$year == 2021
is_2021 <- my_data_frame$animal == "pig"
is_pig $condition[is_2021 & is_pig]
my_data_frame#> [1] average
#> Levels: average < good < excellent
Now, let’s say we want to extract all the observations with an excellent condition.
$condition == "excellent", ]
my_data_frame[my_data_frame#> animal year weight height condition healthy age
#> 1 sheep 2019 110 2.2 excellent TRUE 4
#> NA <NA> NA NA NA <NA> NA NA
#> 4 pig 2019 NA 2.0 excellent TRUE 4
which(my_data_frame$condition == "excellent"), ] #remove the NA row
my_data_frame[#> animal year weight height condition healthy age
#> 1 sheep 2019 110 2.2 excellent TRUE 4
#> 4 pig 2019 NA 2.0 excellent TRUE 4
3.3.4 Exercises
Consider the following data frame,
<- rep(c("sheep", "pig"), c(3,3))
animal <- c(110, NA, 140, NA, 300, 800)
weight <- c("excellent", "good", NA, "excellent", "good", "average")
condition <- c(rep(TRUE, 5), FALSE)
healthy <- data.frame(animal, weight, condition, healthy)
my_data_frame
my_data_frame#> animal weight condition healthy
#> 1 sheep 110 excellent TRUE
#> 2 sheep NA good TRUE
#> 3 sheep 140 <NA> TRUE
#> 4 pig NA excellent TRUE
#> 5 pig 300 good TRUE
#> 6 pig 800 average FALSE
- Generate a data frame with rows of
my_data_frame
containing complete observations. - In
my_data_frame
, fill in the missing values inweight
by the median of its non-missing values. - Add the following observation to
my_data_frame
:animal = "pig"
,weight = 900
,condition = average
, andhealthy = FALSE
. - Extract the sub-data-frame of
my_data_frame
that contains the columns ofanimal
andhealthy
and the rows that hasweight
less than 400 and condition is"good"
or"excellent"
.