3.3 Data Frame
So far, we have learned atomic vectors (Chapter 2), matrices (Section 3.1), and arrays (Section 3.2). Regardless of their dimensionality, the three different object types share an important features: they all contain elements of the same type (i.e., numeric, character, or logical). In real applications, it is common to have mixed variable types. To accommodate this, let’s introduce a new two-dimensional object type: data frame.
3.3.1 Introduction to Data Frames
To create a data frame, you can use the data.frame()
function to combine several vectors of the same length into a single object. Let’s see an example of some health conditions of a sheep and a pig over the years 2019, 2020 and 2021.
animal <- rep(c("sheep", "pig"), c(3, 3))
year <- rep(2019:2021, 2)
weight <- c(110, 120, 140, NA, 300, 800)
height <- c(2.2, 2.4, 2.7, 2, 2.1, 2.3)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
condition <- factor(condition, ordered = TRUE, levels = c("average", "good", "excellent"))
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, year, weight, height, condition, healthy)
my_data_frame
#> animal year weight height condition healthy
#> 1 sheep 2019 110 2.2 excellent TRUE
#> 2 sheep 2020 120 2.4 good TRUE
#> 3 sheep 2021 140 2.7 <NA> TRUE
#> 4 pig 2019 NA 2.0 excellent TRUE
#> 5 pig 2020 300 2.1 good TRUE
#> 6 pig 2021 800 2.3 average FALSE
Looking at the data frame my_data_frame
, it has 6 columns, each of which represents one variable. The variables are of different types. The animal
is factor, year
is integer, both weight
and height
are doubles, condition
is ordered factor, and healthy
is logical.
After creating the data frame, it is useful to examine its class using the class()
function and structure using the str()
function.
class(my_data_frame)
#> [1] "data.frame"
str(my_data_frame)
#> 'data.frame': 6 obs. of 6 variables:
#> $ animal : chr "sheep" "sheep" "sheep" "pig" ...
#> $ year : int 2019 2020 2021 2019 2020 2021
#> $ weight : num 110 120 140 NA 300 800
#> $ height : num 2.2 2.4 2.7 2 2.1 2.3
#> $ condition: Ord.factor w/ 3 levels "average"<"good"<..: 3 2 NA 3 2 1
#> $ healthy : logi TRUE TRUE TRUE TRUE TRUE FALSE
The str()
tells us the data frame has 6 observations and 6 variables, along with the type and the first few values of each variable. From the output, you may be puzzled by the $
symbol before each variable name. In fact, you can easily extract a certain column corresponding to a variable with the $
following by its name.
my_data_frame$animal
#> [1] "sheep" "sheep" "sheep" "pig" "pig" "pig"
my_data_frame$weight
#> [1] 110 120 140 NA 300 800
This kind of data representation is impossible using matrices since the coercion rule will apply, converting everything into characters. Let’s combine everything into a matrix and check its value.
my_mat <- cbind(animal, year, weight, height, condition, healthy)
my_mat
#> animal year weight height condition healthy
#> [1,] "sheep" "2019" "110" "2.2" "3" "TRUE"
#> [2,] "sheep" "2020" "120" "2.4" "2" "TRUE"
#> [3,] "sheep" "2021" "140" "2.7" NA "TRUE"
#> [4,] "pig" "2019" NA "2" "3" "TRUE"
#> [5,] "pig" "2020" "300" "2.1" "2" "TRUE"
#> [6,] "pig" "2021" "800" "2.3" "1" "FALSE"
In the process of creating data frames, you can also name each column.
my_data_frame2 <- data.frame(ani = animal, y = year, w = weight, h = height, con = condition,
hea = healthy)
In Section 2.8, we introduced the very useful function summary()
, which returns important summary statistics for a vector. Using summary()
on a data frame, you get the summary statistics for each variable.
summary(my_data_frame)
#> animal year weight height condition
#> Length:6 Min. :2019 Min. :110 Min. :2.000 average :1
#> Class :character 1st Qu.:2019 1st Qu.:120 1st Qu.:2.125 good :2
#> Mode :character Median :2020 Median :140 Median :2.250 excellent:2
#> Mean :2020 Mean :294 Mean :2.283 NA's :1
#> 3rd Qu.:2021 3rd Qu.:300 3rd Qu.:2.375
#> Max. :2021 Max. :800 Max. :2.700
#> NA's :1
#> healthy
#> Mode :logical
#> FALSE:1
#> TRUE :5
#>
#>
#>
#>
From the results, you can see that depending on the variable type, you get different forms of summary.
In real world, it is very common to encounter missing values, and you may want to discard the observations with them. In the object my_data_frame
, there are two missing values represented by NA
. To remove the observations (rows) with NA
values, you can use the na.omit()
on the data frame.
my_df_nona <- na.omit(my_data_frame)
my_df_nona
#> animal year weight height condition healthy
#> 1 sheep 2019 110 2.2 excellent TRUE
#> 2 sheep 2020 120 2.4 good TRUE
#> 5 pig 2020 300 2.1 good TRUE
#> 6 pig 2021 800 2.3 average FALSE
You can see that the 3rd and 4th row are removed since they both have a missing observation.
An alternative approach to remove all rows with missing observations is to first use the complete.cases()
function to get a logical vector of whether a row has missing elements, and then use data frame subsetting.
3.3.2 Adding Observations or Variables in Data Frames
Sometimes, you may want to add additional entries to the 1st dimension (i.e., rows/observations) or the 2nd dimension (i.e., columns/variables) to an existing data frame.
To add additional observations, you need to put the additional observations or variables into a new data frame, and use the rbind()
function.
new_obs <- data.frame(animal = "pig", year = 2018, weight = 200, height = 1.9, condition = "excellent",
healthy = TRUE)
rbind(my_data_frame, new_obs)
#> animal year weight height condition healthy
#> 1 sheep 2019 110 2.2 excellent TRUE
#> 2 sheep 2020 120 2.4 good TRUE
#> 3 sheep 2021 140 2.7 <NA> TRUE
#> 4 pig 2019 NA 2.0 excellent TRUE
#> 5 pig 2020 300 2.1 good TRUE
#> 6 pig 2021 800 2.3 average FALSE
#> 7 pig 2018 200 1.9 excellent TRUE
To add an additional variable to the existing data frame, you just need to use the $
operator to name the new variable and use the <-
operator to assign values to the new variable.
my_data_frame$age <- my_data_frame$year - 2015
my_data_frame
#> animal year weight height condition healthy age
#> 1 sheep 2019 110 2.2 excellent TRUE 4
#> 2 sheep 2020 120 2.4 good TRUE 5
#> 3 sheep 2021 140 2.7 <NA> TRUE 6
#> 4 pig 2019 NA 2.0 excellent TRUE 4
#> 5 pig 2020 300 2.1 good TRUE 5
#> 6 pig 2021 800 2.3 average FALSE 6
3.3.3 Subsetting and Modifying Data Frames
As a two-dimensional object type, subsetting and modifying data frames is very similar to the operations on matrices.
a. using indices to do data frame subsetting and modifying
The first method for data frame subsetting is to specify the desired row indices and column indices, separated by ,
. For example, we can extract the (1, 3) and (2, 4) element of x
using the following codes.
To modify a specific element of x
, we can assign the desired value to it.
You can check the correponding value of the data frame is modified.
Similar to a matrix subsetting, if you omit the indices of one dimension, R will keep everything along that dimension. You can also use negative indices to keep everything except the provides indices. Let’s see some examples.
my_data_frame[2, ]
#> animal year weight height condition healthy age
#> 2 sheep 2020 120 2.4 good TRUE 5
my_data_frame[, 3]
#> [1] 115 120 140 NA 300 800
my_data_frame[3] #a single index corresponds to the columns
#> weight
#> 1 115
#> 2 120
#> 3 140
#> 4 NA
#> 5 300
#> 6 800
my_data_frame[c(1, 3), -c(2, 3, 4)]
#> animal condition healthy age
#> 1 sheep excellent TRUE 4
#> 3 sheep <NA> TRUE 6
Similarly, we can modify multiple values at the same time.
my_data_frame[1:3, 3] <- c(120, 123, 141)
my_data_frame
#> animal year weight height condition healthy age
#> 1 sheep 2019 120 2.2 excellent TRUE 4
#> 2 sheep 2020 123 2.4 good TRUE 5
#> 3 sheep 2021 141 2.7 <NA> TRUE 6
#> 4 pig 2019 NA 2.0 excellent TRUE 4
#> 5 pig 2020 300 2.1 good TRUE 5
#> 6 pig 2021 800 2.3 average FALSE 6
Note that you generally want to supply a vector with the same length as the number of elements to be modified, or the recycling rule will apply.
b. using column names to do data frame subsetting and modifying
Since data frames usually have column names, you can do subsetting using multiple column names.
Let’s try to modify this sub data frame.
my_data_frame[1:2, c("animal", "weight")] <- data.frame(animal = c("monkey", "bear"),
weight = c(110, 300))
my_data_frame
#> animal year weight height condition healthy age
#> 1 monkey 2019 110 2.2 excellent TRUE 4
#> 2 bear 2020 300 2.4 good TRUE 5
#> 3 sheep 2021 141 2.7 <NA> TRUE 6
#> 4 pig 2019 NA 2.0 excellent TRUE 4
#> 5 pig 2020 300 2.1 good TRUE 5
#> 6 pig 2021 800 2.3 average FALSE 6
c. using logical vectors to do data frame subsetting and modifying
Using logical vectors to do data frame subsetting can come in handy. Suppose we want to find the condition of the pig in year 2021.
is_2021 <- my_data_frame$year == 2021
is_pig <- my_data_frame$animal == "pig"
my_data_frame[is_2021 & is_pig, "condition"]
#> [1] average
#> Levels: average < good < excellent
Now, let’s say we want to extract all the observations with an excellent condition.
my_data_frame[my_data_frame$condition == "excellent", ]
#> animal year weight height condition healthy age
#> 1 monkey 2019 110 2.2 excellent TRUE 4
#> NA <NA> NA NA NA <NA> NA NA
#> 4 pig 2019 NA 2.0 excellent TRUE 4
my_data_frame[which(my_data_frame$condition == "excellent"), ] #remove the NA row
#> animal year weight height condition healthy age
#> 1 monkey 2019 110 2.2 excellent TRUE 4
#> 4 pig 2019 NA 2.0 excellent TRUE 4
Let’s try to modify the condition of the pig in year 2021 to “excellent”.
3.3.4 Exercises
Consider the following data frame,
animal <- rep(c("sheep", "pig"), c(3, 3))
weight <- c(110, NA, 140, NA, 300, 800)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, weight, condition, healthy)
my_data_frame
#> animal weight condition healthy
#> 1 sheep 110 excellent TRUE
#> 2 sheep NA good TRUE
#> 3 sheep 140 <NA> TRUE
#> 4 pig NA excellent TRUE
#> 5 pig 300 good TRUE
#> 6 pig 800 average FALSE
- Generate a data frame with rows of
my_data_frame
containing complete observations. - In
my_data_frame
, fill in the missing values inweight
by the median of its non-missing values. - Add the following observation to
my_data_frame
:animal = "pig"
,weight = 900
,condition = average
, andhealthy = FALSE
. - Extract the sub-data-frame of
my_data_frame
that contains the columns ofanimal
andhealthy
and the rows that hasweight
less than 400 and condition is"good"
or"excellent"
.