3.3 Data Frame

So far, we have learned vectors (Chapter 2), matrices (Section 3.1), and arrays (Section 3.2). The three different object types share an important features: they all consist of elements of the same type, namely numeric, character, or logical. In real applications, it is common to have mixed variable types. To accommodate this, let’s introduce a new object type, namely the data frame.

3.3.1 Introduction to Data Frames

To create a data frame, you can use the data.frame() function with a collection of vectors of the same length. Let’s see an example of some health conditions of a sheep and a pig over the years 2019, 2020 and 2021.

animal <- rep(c("sheep", "pig"), c(3,3))
year <- rep(2019:2021, 2)
weight <- c(110, 120, 140, NA, 300, 800)
height <- c(2.2, 2.4, 2.7, 2, 2.1, 2.3)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
condition <- factor(condition, ordered = TRUE, levels = c("average", "good", "excellent"))
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, year, weight, height, condition, healthy)
my_data_frame
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 3  sheep 2021    140    2.7      <NA>    TRUE
#> 4    pig 2019     NA    2.0 excellent    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE

Looking at the data frame my_data_frame, it has 6 columns, each of which represents one variable. The variables are of different types. The animal is factor, year is integer, both weight and height are doubles, condition is ordered factor, and healthy is logical.

This kind of data representation is impossible using matrices since the coercion rule will apply, converting everything into characters. Let’s combine everything into a matrix and check its value.

my_mat <- cbind(animal, year, weight, height, condition, healthy)
my_mat
#>      animal  year   weight height condition healthy
#> [1,] "sheep" "2019" "110"  "2.2"  "3"       "TRUE" 
#> [2,] "sheep" "2020" "120"  "2.4"  "2"       "TRUE" 
#> [3,] "sheep" "2021" "140"  "2.7"  NA        "TRUE" 
#> [4,] "pig"   "2019" NA     "2"    "3"       "TRUE" 
#> [5,] "pig"   "2020" "300"  "2.1"  "2"       "TRUE" 
#> [6,] "pig"   "2021" "800"  "2.3"  "1"       "FALSE"

In the process of creating data frames, you can also name each column.

my_data_frame2 <- data.frame(ani = animal, y = year, w = weight, h = height, con = condition, hea = healthy)

After creating the data frame, it is useful to examine its class using the class() function and structure using the str() function.

class(my_data_frame)
#> [1] "data.frame"
str(my_data_frame)
#> 'data.frame':    6 obs. of  6 variables:
#>  $ animal   : chr  "sheep" "sheep" "sheep" "pig" ...
#>  $ year     : int  2019 2020 2021 2019 2020 2021
#>  $ weight   : num  110 120 140 NA 300 800
#>  $ height   : num  2.2 2.4 2.7 2 2.1 2.3
#>  $ condition: Ord.factor w/ 3 levels "average"<"good"<..: 3 2 NA 3 2 1
#>  $ healthy  : logi  TRUE TRUE TRUE TRUE TRUE FALSE

The str() tells us the data frame has 6 observations and 6 variables, along with the type and the first few values of each variable. From the output, you may be puzzled by the $ symbol before each variable name. In fact, you can easily extract a certain column corresponding to a variable with its name.

my_data_frame$animal
#> [1] "sheep" "sheep" "sheep" "pig"   "pig"   "pig"
my_data_frame$weight
#> [1] 110 120 140  NA 300 800

In Section 2.5, we introduced the very useful function summary() which gives us important summary statistics for a vector. Using summary() on a data frame, you get the summary statistics for each variable.

summary(my_data_frame)
#>     animal               year          weight        height          condition
#>  Length:6           Min.   :2019   Min.   :110   Min.   :2.000   average  :1  
#>  Class :character   1st Qu.:2019   1st Qu.:120   1st Qu.:2.125   good     :2  
#>  Mode  :character   Median :2020   Median :140   Median :2.250   excellent:2  
#>                     Mean   :2020   Mean   :294   Mean   :2.283   NA's     :1  
#>                     3rd Qu.:2021   3rd Qu.:300   3rd Qu.:2.375                
#>                     Max.   :2021   Max.   :800   Max.   :2.700                
#>                                    NA's   :1                                  
#>   healthy       
#>  Mode :logical  
#>  FALSE:1        
#>  TRUE :5        
#>                 
#>                 
#>                 
#> 

From the results, you can see that depending on the variable type, you get different forms of summary.

In the object my_data_frame, there are two missing values represented by NA. To remove the observations (rows) with NA values, you can use the na.omit() on the data frame.

my_df_nona <- na.omit(my_data_frame)
my_df_nona
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE

You can see that the 3rd and 4th row are removed since they both have a missing observation.

An alternative approach to remove all rows with missing observations is to first use the complete.cases() function to get a logical vector of whether a row has missing elements, and then use data frame subsetting.

complete_ind <- complete.cases(my_data_frame)
my_data_frame[complete_ind, ]

3.3.2 Adding Observations or Variables in Data Frames

Sometimes, you may want to add additional observations or variables to an existing data frame.

To add additional observations, you need to put the additional observations or variables into a new data frame, and use the rbind() function.

new_obs <- data.frame(animal = "pig", year = 2018, weight = 200, height = 1.9, condition = "excellent", healthy = TRUE)
rbind(my_data_frame, new_obs)
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 3  sheep 2021    140    2.7      <NA>    TRUE
#> 4    pig 2019     NA    2.0 excellent    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE
#> 7    pig 2018    200    1.9 excellent    TRUE

To add an additional variable to the existing data frame, you just need to add a new field to the data frame.

my_data_frame$age <- my_data_frame$year - 2015
my_data_frame

3.3.3 Subsetting Data Frames

As a two-dimensional object type, subsetting data frames is very similar to subsetting matrices.

a. using indices to do data frame subsetting

The first method for data frame subsetting is to specify the desired row indices and column indices, separated by ,. For example, we can extract the (1, 3) and (2, 4) element of x using the following codes.

my_data_frame[1, 3]
#> [1] 110
my_data_frame[2, 4]
#> [1] 2.4

Similar to a matrix subsetting, if you omit the indices of one dimension, R will keep everything along that dimension. You can also use negative indices to keep everything except the provides indices. Let’s see some examples.

my_data_frame[2, ]
#>   animal year weight height condition healthy age
#> 2  sheep 2020    120    2.4      good    TRUE   5
my_data_frame[, 2]
#> [1] 2019 2020 2021 2019 2020 2021
my_data_frame[2]       #a single index corresponds to the columns
#>   year
#> 1 2019
#> 2 2020
#> 3 2021
#> 4 2019
#> 5 2020
#> 6 2021
my_data_frame[c(1,3), -c(3,4)]
#>   animal year condition healthy age
#> 1  sheep 2019 excellent    TRUE   4
#> 3  sheep 2021      <NA>    TRUE   6

b. using column names to do data frame subsetting

Since data frames usually have column names, you can do subsetting using multiple column names.

my_data_frame[, c("animal", "weight")]
#>   animal weight
#> 1  sheep    110
#> 2  sheep    120
#> 3  sheep    140
#> 4    pig     NA
#> 5    pig    300
#> 6    pig    800

c. using logical vectors to do data frame subsetting

Using logical vectors to do data frame subsetting is very useful. Suppose we want to find the condition of the pig in year 2021.

is_2021 <- my_data_frame$year == 2021
is_pig <- my_data_frame$animal == "pig"
my_data_frame$condition[is_2021 & is_pig]
#> [1] average
#> Levels: average < good < excellent

Now, let’s say we want to extract all the observations with an excellent condition.

my_data_frame[my_data_frame$condition == "excellent", ]
#>    animal year weight height condition healthy age
#> 1   sheep 2019    110    2.2 excellent    TRUE   4
#> NA   <NA>   NA     NA     NA      <NA>      NA  NA
#> 4     pig 2019     NA    2.0 excellent    TRUE   4
my_data_frame[which(my_data_frame$condition == "excellent"), ]    #remove the NA row
#>   animal year weight height condition healthy age
#> 1  sheep 2019    110    2.2 excellent    TRUE   4
#> 4    pig 2019     NA    2.0 excellent    TRUE   4

3.3.4 Exercises

Consider the following data frame,

animal <- rep(c("sheep", "pig"), c(3,3))
weight <- c(110, NA, 140, NA, 300, 800)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, weight, condition, healthy)
my_data_frame
#>   animal weight condition healthy
#> 1  sheep    110 excellent    TRUE
#> 2  sheep     NA      good    TRUE
#> 3  sheep    140      <NA>    TRUE
#> 4    pig     NA excellent    TRUE
#> 5    pig    300      good    TRUE
#> 6    pig    800   average   FALSE
  1. Generate a data frame with rows of my_data_frame containing complete observations.
  2. In my_data_frame, fill in the missing values in weight by the median of its non-missing values.
  3. Add the following observation to my_data_frame: animal = "pig", weight = 900, condition = average, and healthy = FALSE.
  4. Extract the sub-data-frame of my_data_frame that contains the columns of animal and healthy and the rows that has weight less than 400 and condition is "good" or "excellent".