3.4 Tibble
Having learned data frames in Section 3.3, we would like to introduce a modern version of data frame, named tibbles. Tibbles are data frames, but change some behaviors of data frames to make coding easier. To use the tibble class, you need to install the tibble package, which is part of the tidyverse package.
install.packages("tibble")
3.4.1 Introduction to tibbles
After installing the tibble package, you can load the package and create a tibble the same way as you create a data frame.
library(tibble)
<- rep(c("sheep", "pig"), c(3,3))
animal <- rep(2019:2021, 2)
year <- c(rep(TRUE, 5), FALSE)
healthy <- tibble(animal, year, healthy)
my_tibble
my_tibble#> # A tibble: 6 × 3
#> animal year healthy
#> <chr> <int> <lgl>
#> 1 sheep 2019 TRUE
#> 2 sheep 2020 TRUE
#> 3 sheep 2021 TRUE
#> 4 pig 2019 TRUE
#> 5 pig 2020 TRUE
#> 6 pig 2021 FALSE
Another way to create a tibble is using the as_tibble()
function on a data frame.
<- data.frame(animal, year, healthy)
my_data_frame as_tibble(my_data_frame)
From the output, we can see that tibbles show the variable type under the name, which is very helpful. Another useful feature of tibble compare to data frame is that when you check its value, the output only shows at most the first 10 rows and the number of columns that can fit the output window, which avoids the console to be overcrowded.
<- 1:1e5
x tibble(id = x, value = sin(x))
#> # A tibble: 100,000 × 2
#> id value
#> <int> <dbl>
#> 1 1 0.841
#> 2 2 0.909
#> 3 3 0.141
#> 4 4 -0.757
#> 5 5 -0.959
#> 6 6 -0.279
#> 7 7 0.657
#> 8 8 0.989
#> 9 9 0.412
#> 10 10 -0.544
#> # … with 99,990 more rows
Be prepared that your console output will be flooded with numbers before running the following code.
data.frame(id = x, value = sin(x))
Once we have a tibble, let’s learn its class and structure.
class(my_tibble)
#> [1] "tbl_df" "tbl" "data.frame"
str(my_tibble)
#> tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
#> $ animal : chr [1:6] "sheep" "sheep" "sheep" "pig" ...
#> $ year : int [1:6] 2019 2020 2021 2019 2020 2021
#> $ healthy: logi [1:6] TRUE TRUE TRUE TRUE TRUE FALSE
From the result, you can see that in addition to "data.frame"
, the tibble also has classes of "tbl_df"
and "tbl"
, which contain many useful functions. We will be using tibbles extensively throughout the rest of book due to its advantages over the original data frames.
Lastly, we summarize the different variables types of tibble in the following table.
Type | Section |
---|---|
<chr> |
character vector |
<int> |
integer |
<dbl> |
double |
<ord> |
ordered factor |
<fct> |
unordered factor |
<lgl> |
logical vector |
<date> |
dates |
<dttm> |
date-times |
Since tibble belongs to data frame, all the functions we learned for data frames including addition observations or variables, and subsetting operations can be used in the exact same format. However, the tibble
class offers additional functions which makes some tasks easier.
3.4.2 Adding Observations or Variables in Tibbles
In a tibble, adding observations has an easier method than that in a data frame, via the add_row()
function in the tibble package.
add_row(my_tibble, animal = "pig", year = c(2017, 2018), healthy = TRUE)
#> # A tibble: 8 × 3
#> animal year healthy
#> <chr> <dbl> <lgl>
#> 1 sheep 2019 TRUE
#> 2 sheep 2020 TRUE
#> 3 sheep 2021 TRUE
#> 4 pig 2019 TRUE
#> 5 pig 2020 TRUE
#> 6 pig 2021 FALSE
#> 7 pig 2017 TRUE
#> 8 pig 2018 TRUE
From the results, we can see that multiple rows can be added at the same time by specifying the corresponding values for each variable name. Note the recycling rule applies for other variables with only one value specified.
To add an additional variable, in addition to using the $
followed by a name as in data frames, you can also use the function add_column()
.
add_column(my_tibble,
weight = c(110, 120, 140, NA, 300, 800),
height = c(2.2, 2.4, 2.7, 2, 2.1, 2.3)
)
3.4.3 Tibble subsetting
While the tibble subsetting is very similar to the data frame subsetting, we would like to point out a few key differences.
First of all, when you use the [
and ]
to do tibble subsetting, it always returns a tibble by default, even if only one column is selected. This behavior is different from subsetting data frames using [
and ]
. If you would like to get a vector using only one column is selected, you need to add drop = TRUE
in the subsetting process. You can also subset a single row and convert it into a vector by adding the same argument.
1] #3*1 tibble
my_tibble[, #> # A tibble: 6 × 1
#> animal
#> <chr>
#> 1 sheep
#> 2 sheep
#> 3 sheep
#> 4 pig
#> 5 pig
#> 6 pig
1] #vector
my_data_frame[, #> [1] "sheep" "sheep" "sheep" "pig" "pig" "pig"
1, drop = TRUE] #vector
my_tibble[, #> [1] "sheep" "sheep" "sheep" "pig" "pig" "pig"
3.4.4 Exercises
Consider the following tibble,
<- rep(c("sheep", "pig"), c(3,3))
animal <- c(110, NA, 140, NA, 300, 800)
weight <- c("excellent", "good", NA, "excellent", "good", "average")
condition <- c(rep(TRUE, 5), FALSE)
healthy <- tibble(animal, weight, condition, healthy)
my_tibble <- data.frame(animal, weight, condition, healthy)
my_data_frame
my_tibble#> # A tibble: 6 × 4
#> animal weight condition healthy
#> <chr> <dbl> <chr> <lgl>
#> 1 sheep 110 excellent TRUE
#> 2 sheep NA good TRUE
#> 3 sheep 140 <NA> TRUE
#> 4 pig NA excellent TRUE
#> 5 pig 300 good TRUE
#> 6 pig 800 average FALSE
- Use the
add_row()
function to add the following observation tomy_tibble
:animal = "pig"
,weight = 900
,condition = average
, andhealthy = FALSE
. - Without running in R, what do you think are the difference between
my_tibble[, 1]
andmy_data_frame[, 1]
? How can you reproduce `my_data_frame[, 1]
by subsettingmy_tibble
?