3.3 Data Frame

So far, we have learned atomic vectors (Chapter 2), matrices (Section 3.1), and arrays (Section 3.2). Regardless of their dimensionality, the three different object types share an important features: they all contain elements of the same type (i.e., numeric, character, or logical). In real applications, it is common to have mixed variable types. To accommodate this, let’s introduce a new two-dimensional object type: data frame.

3.3.1 Introduction to Data Frames

To create a data frame, you can use the data.frame() function to combine several vectors of the same length into a single object. Let’s see an example of some health conditions of a sheep and a pig over the years 2019, 2020 and 2021.

animal <- rep(c("sheep", "pig"), c(3, 3))
year <- rep(2019:2021, 2)
weight <- c(110, 120, 140, NA, 300, 800)
height <- c(2.2, 2.4, 2.7, 2, 2.1, 2.3)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
condition <- factor(condition, ordered = TRUE, levels = c("average", "good", "excellent"))
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, year, weight, height, condition, healthy)
my_data_frame
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 3  sheep 2021    140    2.7      <NA>    TRUE
#> 4    pig 2019     NA    2.0 excellent    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE

Looking at the data frame my_data_frame, it has 6 columns, each of which represents one variable. The variables are of different types. The animal is factor, year is integer, both weight and height are doubles, condition is ordered factor, and healthy is logical.

After creating the data frame, it is useful to examine its class using the class() function and structure using the str() function.

class(my_data_frame)
#> [1] "data.frame"
str(my_data_frame)
#> 'data.frame':    6 obs. of  6 variables:
#>  $ animal   : chr  "sheep" "sheep" "sheep" "pig" ...
#>  $ year     : int  2019 2020 2021 2019 2020 2021
#>  $ weight   : num  110 120 140 NA 300 800
#>  $ height   : num  2.2 2.4 2.7 2 2.1 2.3
#>  $ condition: Ord.factor w/ 3 levels "average"<"good"<..: 3 2 NA 3 2 1
#>  $ healthy  : logi  TRUE TRUE TRUE TRUE TRUE FALSE

The str() tells us the data frame has 6 observations and 6 variables, along with the type and the first few values of each variable. From the output, you may be puzzled by the $ symbol before each variable name. In fact, you can easily extract a certain column corresponding to a variable with the $ following by its name.

my_data_frame$animal
#> [1] "sheep" "sheep" "sheep" "pig"   "pig"   "pig"
my_data_frame$weight
#> [1] 110 120 140  NA 300 800

This kind of data representation is impossible using matrices since the coercion rule will apply, converting everything into characters. Let’s combine everything into a matrix and check its value.

my_mat <- cbind(animal, year, weight, height, condition, healthy)
my_mat
#>      animal  year   weight height condition healthy
#> [1,] "sheep" "2019" "110"  "2.2"  "3"       "TRUE" 
#> [2,] "sheep" "2020" "120"  "2.4"  "2"       "TRUE" 
#> [3,] "sheep" "2021" "140"  "2.7"  NA        "TRUE" 
#> [4,] "pig"   "2019" NA     "2"    "3"       "TRUE" 
#> [5,] "pig"   "2020" "300"  "2.1"  "2"       "TRUE" 
#> [6,] "pig"   "2021" "800"  "2.3"  "1"       "FALSE"

In the process of creating data frames, you can also name each column.

my_data_frame2 <- data.frame(ani = animal, y = year, w = weight, h = height, con = condition,
    hea = healthy)

In Section 2.8, we introduced the very useful function summary(), which returns important summary statistics for a vector. Using summary() on a data frame, you get the summary statistics for each variable.

summary(my_data_frame)
#>     animal               year          weight        height          condition
#>  Length:6           Min.   :2019   Min.   :110   Min.   :2.000   average  :1  
#>  Class :character   1st Qu.:2019   1st Qu.:120   1st Qu.:2.125   good     :2  
#>  Mode  :character   Median :2020   Median :140   Median :2.250   excellent:2  
#>                     Mean   :2020   Mean   :294   Mean   :2.283   NA's     :1  
#>                     3rd Qu.:2021   3rd Qu.:300   3rd Qu.:2.375                
#>                     Max.   :2021   Max.   :800   Max.   :2.700                
#>                                    NA's   :1                                  
#>   healthy       
#>  Mode :logical  
#>  FALSE:1        
#>  TRUE :5        
#>                 
#>                 
#>                 
#>

From the results, you can see that depending on the variable type, you get different forms of summary.

In real world, it is very common to encounter missing values, and you may want to discard the observations with them. In the object my_data_frame, there are two missing values represented by NA. To remove the observations (rows) with NA values, you can use the na.omit() on the data frame.

my_df_nona <- na.omit(my_data_frame)
my_df_nona
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE

You can see that the 3rd and 4th row are removed since they both have a missing observation.

An alternative approach to remove all rows with missing observations is to first use the complete.cases() function to get a logical vector of whether a row has missing elements, and then use data frame subsetting.

complete_ind <- complete.cases(my_data_frame)
my_data_frame[complete_ind, ]
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE

3.3.2 Adding Observations or Variables in Data Frames

Sometimes, you may want to add additional entries to the 1st dimension (i.e., rows/observations) or the 2nd dimension (i.e., columns/variables) to an existing data frame.

To add additional observations, you need to put the additional observations or variables into a new data frame, and use the rbind() function.

new_obs <- data.frame(animal = "pig", year = 2018, weight = 200, height = 1.9, condition = "excellent",
    healthy = TRUE)
rbind(my_data_frame, new_obs)
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 3  sheep 2021    140    2.7      <NA>    TRUE
#> 4    pig 2019     NA    2.0 excellent    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE
#> 7    pig 2018    200    1.9 excellent    TRUE

To add an additional variable to the existing data frame, you just need to use the $ operator to name the new variable and use the <- operator to assign values to the new variable.

my_data_frame$age <- my_data_frame$year - 2015
my_data_frame
#>   animal year weight height condition healthy age
#> 1  sheep 2019    110    2.2 excellent    TRUE   4
#> 2  sheep 2020    120    2.4      good    TRUE   5
#> 3  sheep 2021    140    2.7      <NA>    TRUE   6
#> 4    pig 2019     NA    2.0 excellent    TRUE   4
#> 5    pig 2020    300    2.1      good    TRUE   5
#> 6    pig 2021    800    2.3   average   FALSE   6

3.3.3 Subsetting and Modifying Data Frames

As a two-dimensional object type, subsetting and modifying data frames is very similar to the operations on matrices.

a. using indices to do data frame subsetting and modifying

The first method for data frame subsetting is to specify the desired row indices and column indices, separated by ,. For example, we can extract the (1, 3) and (2, 4) element of x using the following codes.

my_data_frame[1, 3]
#> [1] 110
my_data_frame[2, 4]
#> [1] 2.4

To modify a specific element of x, we can assign the desired value to it.

my_data_frame[1, 3] <- 115

You can check the correponding value of the data frame is modified.

my_data_frame

Similar to a matrix subsetting, if you omit the indices of one dimension, R will keep everything along that dimension. You can also use negative indices to keep everything except the provides indices. Let’s see some examples.

my_data_frame[2, ]
#>   animal year weight height condition healthy age
#> 2  sheep 2020    120    2.4      good    TRUE   5
my_data_frame[, 3]
#> [1] 115 120 140  NA 300 800
my_data_frame[3]  #a single index corresponds to the columns
#>   weight
#> 1    115
#> 2    120
#> 3    140
#> 4     NA
#> 5    300
#> 6    800
my_data_frame[c(1, 3), -c(2, 3, 4)]
#>   animal condition healthy age
#> 1  sheep excellent    TRUE   4
#> 3  sheep      <NA>    TRUE   6

Similarly, we can modify multiple values at the same time.

my_data_frame[1:3, 3] <- c(120, 123, 141)
my_data_frame
#>   animal year weight height condition healthy age
#> 1  sheep 2019    120    2.2 excellent    TRUE   4
#> 2  sheep 2020    123    2.4      good    TRUE   5
#> 3  sheep 2021    141    2.7      <NA>    TRUE   6
#> 4    pig 2019     NA    2.0 excellent    TRUE   4
#> 5    pig 2020    300    2.1      good    TRUE   5
#> 6    pig 2021    800    2.3   average   FALSE   6

Note that you generally want to supply a vector with the same length as the number of elements to be modified, or the recycling rule will apply.

b. using column names to do data frame subsetting and modifying

Since data frames usually have column names, you can do subsetting using multiple column names.

my_data_frame[1:2, c("animal", "weight")]
#>   animal weight
#> 1  sheep    120
#> 2  sheep    123

Let’s try to modify this sub data frame.

my_data_frame[1:2, c("animal", "weight")] <- data.frame(animal = c("monkey", "bear"),
    weight = c(110, 300))
my_data_frame
#>   animal year weight height condition healthy age
#> 1 monkey 2019    110    2.2 excellent    TRUE   4
#> 2   bear 2020    300    2.4      good    TRUE   5
#> 3  sheep 2021    141    2.7      <NA>    TRUE   6
#> 4    pig 2019     NA    2.0 excellent    TRUE   4
#> 5    pig 2020    300    2.1      good    TRUE   5
#> 6    pig 2021    800    2.3   average   FALSE   6

c. using logical vectors to do data frame subsetting and modifying

Using logical vectors to do data frame subsetting can come in handy. Suppose we want to find the condition of the pig in year 2021.

is_2021 <- my_data_frame$year == 2021
is_pig <- my_data_frame$animal == "pig"
my_data_frame[is_2021 & is_pig, "condition"]
#> [1] average
#> Levels: average < good < excellent

Now, let’s say we want to extract all the observations with an excellent condition.

my_data_frame[my_data_frame$condition == "excellent", ]
#>    animal year weight height condition healthy age
#> 1  monkey 2019    110    2.2 excellent    TRUE   4
#> NA   <NA>   NA     NA     NA      <NA>      NA  NA
#> 4     pig 2019     NA    2.0 excellent    TRUE   4
my_data_frame[which(my_data_frame$condition == "excellent"), ]  #remove the NA row
#>   animal year weight height condition healthy age
#> 1 monkey 2019    110    2.2 excellent    TRUE   4
#> 4    pig 2019     NA    2.0 excellent    TRUE   4

Let’s try to modify the condition of the pig in year 2021 to “excellent”.

my_data_frame[is_2021 & is_pig, "condition"] <- "excellent"

3.3.4 Exercises

Consider the following data frame,

animal <- rep(c("sheep", "pig"), c(3, 3))
weight <- c(110, NA, 140, NA, 300, 800)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, weight, condition, healthy)
my_data_frame
#>   animal weight condition healthy
#> 1  sheep    110 excellent    TRUE
#> 2  sheep     NA      good    TRUE
#> 3  sheep    140      <NA>    TRUE
#> 4    pig     NA excellent    TRUE
#> 5    pig    300      good    TRUE
#> 6    pig    800   average   FALSE

Generate a data frame with rows of my_data_frame containing complete observations.
In my_data_frame, fill in the missing values in weight by the median of its non-missing values.
Add the following observation to my_data_frame: animal = "pig", weight = 900, condition = average, and healthy = FALSE.
Extract the sub-data-frame of my_data_frame that contains the columns of animal and healthy and the rows that has weight less than 400 and condition is "good" or "excellent".