2.11 Character Vectors, Factors & Ordered Factors
Having learned character vectors in Section 2.1.2, we introduce a very important data type in this section, named factors. First, let’s create a character vector to be used in this section.
<- c("sheep", "pig", "monkey", "sheep", "sheep", "pig") animals
2.11.1 Create a factor from a vector
So, what exactly is a factor? It can be viewed as a special type of vector whose elements take on a fixed and known set of different values. You can create a factor from a vector using the factor()
function. To understand the output of a factor, it is helpful to
compare the results with the original vector animals
.
<- factor(animals)
animals_fac
animals_fac#> [1] sheep pig monkey sheep sheep pig
#> Levels: monkey pig sheep
animals#> [1] "sheep" "pig" "monkey" "sheep" "sheep" "pig"
First, note that the strings in the character vector all have quotation marks around the elements, while the corresponding factor doesn’t have them. Second, we see an additional row in the factor, starting with “Levels:” This shows the unique elements of animals
ordered alphabetically.
If you use the class()
function on animals_fac
, you will see it is indeed a factor.
class(animals_fac)
class(animals)
To get a levels of a factor, you can use the function levels()
on it.
levels(animals_fac)
#> [1] "monkey" "pig" "sheep"
To have deeper understanding on factors, it is helpful to check its internal storage type using typeof()
.
typeof(animals_fac)
#> [1] "integer"
as.numeric(animals_fac)
#> [1] 3 2 1 3 3 2
Perhaps a bit surprisingly, a factor is stored as integers. The integers represent the corresponding locations of each element in the levels. For example, the first value of as.numeric(animals_fac)
is 3, since the first element of animals_fac
is "sheep"
, which is the third element in the levels. The particular storage mechanism for factors is very appealing in the sense that storing integers takes much less space than storing all the same levels repeatedly in the original character vector. As the same time, you can easily reproduce the original character vector using the integers and the factor levels using vector subsetting via indices.
levels(animals_fac)[as.numeric(animals_fac)]
#> [1] "sheep" "pig" "monkey" "sheep" "sheep" "pig"
To show that factors indeed could take less memory than the corresponding character vectors when the levels are repeated many times, let’s see the following example where we use the object.size()
function to check the estimate of memory used to store the corresponding object.
<- rep(c("sheep", "pig", "monkey"), c(100,200,300))
many_animals <- factor(many_animals)
many_animals_fac object.size(many_animals)
#> 5016 bytes
object.size(many_animals_fac)
#> 3032 bytes
From this example, we can see that storing the information as a factor could offer substantial memory savings (about 40% in this example) compare to storing it as a character vector.
Another advantage of factors over vectors is that it will detect any input that is outside of the levels. Let’s try to assign the string “Tiger” to the first element of both animals_fac
and animals
.
1] <- "Tiger"
animals_fac[
animals_fac#> [1] <NA> pig monkey sheep sheep pig
#> Levels: monkey pig sheep
1] <- "Tiger"
animals[
animals#> [1] "Tiger" "pig" "monkey" "sheep" "sheep" "pig"
Since "Tiger"
is not inside the levels set, we see a warning in the assignment process and the value of the first element is changed to <NA>
. When the same assignment is done on the vector animals
, there is no warning and the first element of animals
is changed to “Tigers” as instructed. This is an attractive feature of factors that can prevent input errors.
In addition to creating factors from character vectors, you can also create them from numeric vectors as well as logical vectors.
<- rep(3:1, 1:3)
x <- factor(x)
x_fac <- rep(c(T, F), c(5, 3))
y <- factor(y) y_fac
It is worth noting that after we convert a numeric vector into a factor, the usual arithmetic operation can no longer be applied since the numbers become levels.
1] + 1
x_fac[#> Warning in Ops.factor(x_fac[1], 1): '+' not meaningful for factors
#> [1] NA
The result is NA
with a warning message.
2.11.2 Set the factor levels and labels
As we have seen, the factor()
function extracts the unique elements from a vector and sort them as its levels. To manually specify the levels and their order, you can set the levels
argument. For example, if you only want "sheep"
and "pig"
in the level, you can use the following code.
factor(animals_fac, levels = c("pig", "sheep"))
#> [1] <NA> pig <NA> sheep sheep pig
#> Levels: pig sheep
As you can see, the third element becomes NA
, since it corresponding element "monkey"
in the original vector is not in the set of levels.
You can also create labels to represent each level of the factor by setting the labels
argument in the factor()
function.
factor(animals_fac, levels = c("pig", "sheep"), labels = c("pretty_pig", "smart_sheep"))
#> [1] <NA> pretty_pig <NA> smart_sheep smart_sheep pretty_pig
#> Levels: pretty_pig smart_sheep
An alternative way to change the levels of the levels is to assign the desired level vector to the levels()
function with the factor as its argument. For example, if you want to translate the animals names into Spanish, you can use
levels(animals_fac) <- c("mona", "cerda", "oveja")
animals_fac#> [1] <NA> cerda mona oveja oveja cerda
#> Levels: mona cerda oveja
2.11.3 Ordered factors
By default, the function factor()
creates an unordered factor, which is usually used when there are no natural ordering among the levels. Sometimes, there may be a natural ordering among the levels. Let’s see an example.
<- c("excellent", "good", "excellent", "good", "average")
conditions factor(conditions)
Different from the animals in animals
, the conditions have a natural ordering. We know \(average < good < excellent\). To reflect this in a factor, you can create an so-called ordered factor by setting ordered = TRUE
and specify the levels in the ascending order of the desired ordering.
<- factor(conditions, ordered = TRUE, levels = c("average", "good", "excellent"))
condition_ordered_fac
condition_ordered_fac#> [1] excellent good excellent good average
#> Levels: average < good < excellent
We can see that there is an ordering shown in the “Levels.” You can also do comparisons on ordered factors.
1] < condition_ordered_fac[2]
condition_ordered_fac[#> [1] FALSE
The result is FALSE
since \(excellent > good\). We will revisit the topic of factor ordering when generating bar charts in Section 4.10.2.
2.11.4 Exercises
- What are the advantages of factors over vectors?
- Suppose we define
x <- factor(1:5)
, what is the result ofx[1] < x[2]
? Please try to answer this question without R.
- (a):
TRUE
- (b):
FALSE
- (c):
NA
- Suppose we define
x <- factor(1:5, ordered = TRUE)
, what is the result ofx[1] < x[2]
? Please try to answer this question without R.
- (a):
TRUE
- (b):
FALSE
- (c):
NA
- Suppose we define
x <- factor(1:5, ordered = TRUE, levels = 5:1)
, what is the result ofx[1] < x[2]
? Please try to answer this question without R.
- (a):
TRUE
- (b):
FALSE
- (c):
NA
- Suppose
size <- rep(c("big", "small", "medium"), 3:1)
, convert it to an ordered factor with levels small < medium < big.