## 3.3 Data Frame

So far, we have learned vectors (Chapter 2), matrices (Section 3.1), and arrays (Section 3.2). The three different object types share an important features: they all consist of elements of the same type, namely numeric, character, or logical. In real applications, it is common to have mixed variable types. To accommodate this, let’s introduce a new object type, namely the data frame.

### 3.3.1 Introduction to Data Frames

To create a data frame, you can use the data.frame() function with a collection of vectors of the same length. Let’s see an example of some health conditions of a sheep and a pig over the years 2019, 2020 and 2021.

animal <- rep(c("sheep", "pig"), c(3,3))
year <- rep(2019:2021, 2)
weight <- c(110, 120, 140, NA, 300, 800)
height <- c(2.2, 2.4, 2.7, 2, 2.1, 2.3)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
condition <- factor(condition, ordered = TRUE, levels = c("average", "good", "excellent"))
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, year, weight, height, condition, healthy)
my_data_frame
#>   animal year weight height condition healthy
#> 1  sheep 2019    110    2.2 excellent    TRUE
#> 2  sheep 2020    120    2.4      good    TRUE
#> 3  sheep 2021    140    2.7      <NA>    TRUE
#> 4    pig 2019     NA    2.0 excellent    TRUE
#> 5    pig 2020    300    2.1      good    TRUE
#> 6    pig 2021    800    2.3   average   FALSE

Looking at the data frame my_data_frame, it has 6 columns, each of which represents one variable. The variables are of different types. The animal is factor, year is integer, both weight and height are doubles, condition is ordered factor, and healthy is logical.

This kind of data representation is impossible using matrices since the coercion rule will apply, converting everything into characters. Let’s combine everything into a matrix and check its value.

my_mat <- cbind(animal, year, weight, height, condition, healthy)
my_mat
#>      animal  year   weight height condition healthy
#> [1,] "sheep" "2019" "110"  "2.2"  "3"       "TRUE"
#> [2,] "sheep" "2020" "120"  "2.4"  "2"       "TRUE"
#> [3,] "sheep" "2021" "140"  "2.7"  NA        "TRUE"
#> [4,] "pig"   "2019" NA     "2"    "3"       "TRUE"
#> [5,] "pig"   "2020" "300"  "2.1"  "2"       "TRUE"
#> [6,] "pig"   "2021" "800"  "2.3"  "1"       "FALSE"

In the process of creating data frames, you can also name each column.

my_data_frame2 <- data.frame(ani = animal, y = year, w = weight, h = height, con = condition, hea = healthy)

After creating the data frame, it is useful to examine its class using the class() function and structure using the str() function.

class(my_data_frame)
#> [1] "data.frame"
str(my_data_frame)
#> 'data.frame':    6 obs. of  6 variables:
#>  $animal : chr "sheep" "sheep" "sheep" "pig" ... #>$ year     : int  2019 2020 2021 2019 2020 2021
#>  $weight : num 110 120 140 NA 300 800 #>$ height   : num  2.2 2.4 2.7 2 2.1 2.3
#>  $condition: Ord.factor w/ 3 levels "average"<"good"<..: 3 2 NA 3 2 1 #>$ healthy  : logi  TRUE TRUE TRUE TRUE TRUE FALSE

The str() tells us the data frame has 6 observations and 6 variables, along with the type and the first few values of each variable. From the output, you may be puzzled by the $ symbol before each variable name. In fact, you can easily extract a certain column corresponding to a variable with its name. my_data_frame$animal
#> [1] "sheep" "sheep" "sheep" "pig"   "pig"   "pig"
my_data_frame$weight #> [1] 110 120 140 NA 300 800 In Section 2.5, we introduced the very useful function summary() which gives us important summary statistics for a vector. Using summary() on a data frame, you get the summary statistics for each variable. summary(my_data_frame) #> animal year weight height condition #> Length:6 Min. :2019 Min. :110 Min. :2.000 average :1 #> Class :character 1st Qu.:2019 1st Qu.:120 1st Qu.:2.125 good :2 #> Mode :character Median :2020 Median :140 Median :2.250 excellent:2 #> Mean :2020 Mean :294 Mean :2.283 NA's :1 #> 3rd Qu.:2021 3rd Qu.:300 3rd Qu.:2.375 #> Max. :2021 Max. :800 Max. :2.700 #> NA's :1 #> healthy #> Mode :logical #> FALSE:1 #> TRUE :5 #> #> #> #>  From the results, you can see that depending on the variable type, you get different forms of summary. In the object my_data_frame, there are two missing values represented by NA. To remove the observations (rows) with NA values, you can use the na.omit() on the data frame. my_df_nona <- na.omit(my_data_frame) my_df_nona #> animal year weight height condition healthy #> 1 sheep 2019 110 2.2 excellent TRUE #> 2 sheep 2020 120 2.4 good TRUE #> 5 pig 2020 300 2.1 good TRUE #> 6 pig 2021 800 2.3 average FALSE You can see that the 3rd and 4th row are removed since they both have a missing observation. An alternative approach to remove all rows with missing observations is to first use the complete.cases() function to get a logical vector of whether a row has missing elements, and then use data frame subsetting. complete_ind <- complete.cases(my_data_frame) my_data_frame[complete_ind, ] ### 3.3.2 Adding Observations or Variables in Data Frames Sometimes, you may want to add additional observations or variables to an existing data frame. To add additional observations, you need to put the additional observations or variables into a new data frame, and use the rbind() function. new_obs <- data.frame(animal = "pig", year = 2018, weight = 200, height = 1.9, condition = "excellent", healthy = TRUE) rbind(my_data_frame, new_obs) #> animal year weight height condition healthy #> 1 sheep 2019 110 2.2 excellent TRUE #> 2 sheep 2020 120 2.4 good TRUE #> 3 sheep 2021 140 2.7 <NA> TRUE #> 4 pig 2019 NA 2.0 excellent TRUE #> 5 pig 2020 300 2.1 good TRUE #> 6 pig 2021 800 2.3 average FALSE #> 7 pig 2018 200 1.9 excellent TRUE To add an additional variable to the existing data frame, you just need to add a new field to the data frame. my_data_frame$age <- my_data_frame$year - 2015 my_data_frame ### 3.3.3 Subsetting Data Frames As a two-dimensional object type, subsetting data frames is very similar to subsetting matrices. a. using indices to do data frame subsetting The first method for data frame subsetting is to specify the desired row indices and column indices, separated by ,. For example, we can extract the (1, 3) and (2, 4) element of x using the following codes. my_data_frame[1, 3] #> [1] 110 my_data_frame[2, 4] #> [1] 2.4 Similar to a matrix subsetting, if you omit the indices of one dimension, R will keep everything along that dimension. You can also use negative indices to keep everything except the provides indices. Let’s see some examples. my_data_frame[2, ] #> animal year weight height condition healthy age #> 2 sheep 2020 120 2.4 good TRUE 5 my_data_frame[, 2] #> [1] 2019 2020 2021 2019 2020 2021 my_data_frame[2] #a single index corresponds to the columns #> year #> 1 2019 #> 2 2020 #> 3 2021 #> 4 2019 #> 5 2020 #> 6 2021 my_data_frame[c(1,3), -c(3,4)] #> animal year condition healthy age #> 1 sheep 2019 excellent TRUE 4 #> 3 sheep 2021 <NA> TRUE 6 b. using column names to do data frame subsetting Since data frames usually have column names, you can do subsetting using multiple column names. my_data_frame[, c("animal", "weight")] #> animal weight #> 1 sheep 110 #> 2 sheep 120 #> 3 sheep 140 #> 4 pig NA #> 5 pig 300 #> 6 pig 800 c. using logical vectors to do data frame subsetting Using logical vectors to do data frame subsetting is very useful. Suppose we want to find the condition of the pig in year 2021. is_2021 <- my_data_frame$year == 2021
is_pig <- my_data_frame$animal == "pig" my_data_frame$condition[is_2021 & is_pig]
#> [1] average
#> Levels: average < good < excellent

Now, let’s say we want to extract all the observations with an excellent condition.

my_data_frame[my_data_frame$condition == "excellent", ] #> animal year weight height condition healthy age #> 1 sheep 2019 110 2.2 excellent TRUE 4 #> NA <NA> NA NA NA <NA> NA NA #> 4 pig 2019 NA 2.0 excellent TRUE 4 my_data_frame[which(my_data_frame$condition == "excellent"), ]    #remove the NA row
#>   animal year weight height condition healthy age
#> 1  sheep 2019    110    2.2 excellent    TRUE   4
#> 4    pig 2019     NA    2.0 excellent    TRUE   4

### 3.3.4 Exercises

Consider the following data frame,

animal <- rep(c("sheep", "pig"), c(3,3))
weight <- c(110, NA, 140, NA, 300, 800)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
healthy <- c(rep(TRUE, 5), FALSE)
my_data_frame <- data.frame(animal, weight, condition, healthy)
my_data_frame
#>   animal weight condition healthy
#> 1  sheep    110 excellent    TRUE
#> 2  sheep     NA      good    TRUE
#> 3  sheep    140      <NA>    TRUE
#> 4    pig     NA excellent    TRUE
#> 5    pig    300      good    TRUE
#> 6    pig    800   average   FALSE
1. Generate a data frame with rows of my_data_frame containing complete observations.
2. In my_data_frame, fill in the missing values in weight by the median of its non-missing values.
3. Add the following observation to my_data_frame: animal = "pig", weight = 900, condition = average, and healthy = FALSE.
4. Extract the sub-data-frame of my_data_frame that contains the columns of animal and healthy and the rows that has weight less than 400 and condition is "good" or "excellent".