3.4 Tibble

Having learned data frames in Section 3.3, we would like to introduce a modern version of data frame, named tibbles. Tibbles are data frames, but change some behaviors of data frames to make coding easier. To use the tibble class, you need to install the tibble package, which is part of the tidyverse package.

install.packages("tibble")

3.4.1 Introduction to tibbles

After installing the tibble package, you can load the package and create a tibble the same way as you create a data frame.

library(tibble)
animal <- rep(c("sheep", "pig"), c(3,3))
year <- rep(2019:2021, 2)
healthy <- c(rep(TRUE, 5), FALSE)
my_tibble <- tibble(animal, year, healthy)
my_tibble
#> # A tibble: 6 × 3
#>   animal  year healthy
#>   <chr>  <int> <lgl>  
#> 1 sheep   2019 TRUE   
#> 2 sheep   2020 TRUE   
#> 3 sheep   2021 TRUE   
#> 4 pig     2019 TRUE   
#> 5 pig     2020 TRUE   
#> 6 pig     2021 FALSE

Another way to create a tibble is using the as_tibble() function on a data frame.

my_data_frame <- data.frame(animal, year,  healthy)
as_tibble(my_data_frame)

From the output, we can see that tibbles show the variable type under the name, which is very helpful. Another useful feature of tibble compare to data frame is that when you check its value, the output only shows at most the first 10 rows and the number of columns that can fit the output window, which avoids the console to be overcrowded.

x <- 1:1e5
tibble(id = x, value = sin(x))
#> # A tibble: 100,000 × 2
#>       id  value
#>    <int>  <dbl>
#>  1     1  0.841
#>  2     2  0.909
#>  3     3  0.141
#>  4     4 -0.757
#>  5     5 -0.959
#>  6     6 -0.279
#>  7     7  0.657
#>  8     8  0.989
#>  9     9  0.412
#> 10    10 -0.544
#> # … with 99,990 more rows

Be prepared that your console output will be flooded with numbers before running the following code.

data.frame(id = x, value = sin(x)) 

Once we have a tibble, let’s learn its class and structure.

class(my_tibble)
#> [1] "tbl_df"     "tbl"        "data.frame"
str(my_tibble)
#> tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ animal : chr [1:6] "sheep" "sheep" "sheep" "pig" ...
#>  $ year   : int [1:6] 2019 2020 2021 2019 2020 2021
#>  $ healthy: logi [1:6] TRUE TRUE TRUE TRUE TRUE FALSE

From the result, you can see that in addition to "data.frame", the tibble also has classes of "tbl_df" and "tbl", which contain many useful functions. We will be using tibbles extensively throughout the rest of book due to its advantages over the original data frames.

Lastly, we summarize the different variables types of tibble in the following table.

Type Section
<chr> character vector
<int> integer
<dbl> double
<ord> ordered factor
<fct> unordered factor
<lgl> logical vector
<date> dates
<dttm> date-times

Since tibble belongs to data frame, all the functions we learned for data frames including addition observations or variables, and subsetting operations can be used in the exact same format. However, the tibble class offers additional functions which makes some tasks easier.

3.4.2 Adding Observations or Variables in Tibbles

In a tibble, adding observations has an easier method than that in a data frame, via the add_row() function in the tibble package.

add_row(my_tibble, animal = "pig", year = c(2017, 2018), healthy = TRUE)
#> # A tibble: 8 × 3
#>   animal  year healthy
#>   <chr>  <dbl> <lgl>  
#> 1 sheep   2019 TRUE   
#> 2 sheep   2020 TRUE   
#> 3 sheep   2021 TRUE   
#> 4 pig     2019 TRUE   
#> 5 pig     2020 TRUE   
#> 6 pig     2021 FALSE  
#> 7 pig     2017 TRUE   
#> 8 pig     2018 TRUE

From the results, we can see that multiple rows can be added at the same time by specifying the corresponding values for each variable name. Note the recycling rule applies for other variables with only one value specified.

To add an additional variable, in addition to using the $ followed by a name as in data frames, you can also use the function add_column().

add_column(my_tibble, 
           weight = c(110, 120, 140, NA, 300, 800),
           height = c(2.2, 2.4, 2.7, 2, 2.1, 2.3)
           )

3.4.3 Tibble subsetting

While the tibble subsetting is very similar to the data frame subsetting, we would like to point out a few key differences.

First of all, when you use the [ and ] to do tibble subsetting, it always returns a tibble by default, even if only one column is selected. This behavior is different from subsetting data frames using [ and ]. If you would like to get a vector using only one column is selected, you need to add drop = TRUE in the subsetting process. You can also subset a single row and convert it into a vector by adding the same argument.

my_tibble[, 1]               #3*1 tibble
#> # A tibble: 6 × 1
#>   animal
#>   <chr> 
#> 1 sheep 
#> 2 sheep 
#> 3 sheep 
#> 4 pig   
#> 5 pig   
#> 6 pig
my_data_frame[, 1]           #vector
#> [1] "sheep" "sheep" "sheep" "pig"   "pig"   "pig"
my_tibble[, 1, drop = TRUE]  #vector
#> [1] "sheep" "sheep" "sheep" "pig"   "pig"   "pig"

3.4.4 Exercises

Consider the following tibble,

animal <- rep(c("sheep", "pig"), c(3,3))
weight <- c(110, NA, 140, NA, 300, 800)
condition <- c("excellent", "good", NA, "excellent", "good", "average")
healthy <- c(rep(TRUE, 5), FALSE)
my_tibble <- tibble(animal, weight, condition, healthy)
my_data_frame <- data.frame(animal, weight, condition, healthy)
my_tibble
#> # A tibble: 6 × 4
#>   animal weight condition healthy
#>   <chr>   <dbl> <chr>     <lgl>  
#> 1 sheep     110 excellent TRUE   
#> 2 sheep      NA good      TRUE   
#> 3 sheep     140 <NA>      TRUE   
#> 4 pig        NA excellent TRUE   
#> 5 pig       300 good      TRUE   
#> 6 pig       800 average   FALSE
  1. Use the add_row() function to add the following observation to my_tibble: animal = "pig", weight = 900, condition = average, and healthy = FALSE.
  2. Without running in R, what do you think are the difference between my_tibble[, 1] and my_data_frame[, 1]? How can you reproduce `my_data_frame[, 1] by subsetting my_tibble?