2.11 Character Vectors, Factors & Ordered Factors

Having learned character vectors in Section 2.1.2, we introduce a very important data type in this section, named factors. First, let’s create a character vector to be used in this section.

animals <- c("sheep", "pig", "monkey", "sheep", "sheep", "pig")

2.11.1 Create a factor from a vector

So, what exactly is a factor? It can be viewed as a special type of vector whose elements take on a fixed and known set of different values. You can create a factor from a vector using the factor() function. To understand the output of a factor, it is helpful to compare the results with the original vector animals.

animals_fac <- factor(animals)  
animals_fac
#> [1] sheep  pig    monkey sheep  sheep  pig   
#> Levels: monkey pig sheep
animals
#> [1] "sheep"  "pig"    "monkey" "sheep"  "sheep"  "pig"

First, note that the strings in the character vector all have quotation marks around the elements, while the corresponding factor doesn’t have them. Second, we see an additional row in the factor, starting with “Levels:” This shows the unique elements of animals ordered alphabetically.

If you use the class() function on animals_fac, you will see it is indeed a factor.

class(animals_fac)
class(animals)

To get a levels of a factor, you can use the function levels() on it.

levels(animals_fac)
#> [1] "monkey" "pig"    "sheep"

To have deeper understanding on factors, it is helpful to check its internal storage type using typeof().

typeof(animals_fac)
#> [1] "integer"
as.numeric(animals_fac)
#> [1] 3 2 1 3 3 2

Perhaps a bit surprisingly, a factor is stored as integers. The integers represent the corresponding locations of each element in the levels. For example, the first value of as.numeric(animals_fac) is 3, since the first element of animals_fac is "sheep", which is the third element in the levels. The particular storage mechanism for factors is very appealing in the sense that storing integers takes much less space than storing all the same levels repeatedly in the original character vector. As the same time, you can easily reproduce the original character vector using the integers and the factor levels using vector subsetting via indices.

levels(animals_fac)[as.numeric(animals_fac)]
#> [1] "sheep"  "pig"    "monkey" "sheep"  "sheep"  "pig"

To show that factors indeed could take less memory than the corresponding character vectors when the levels are repeated many times, let’s see the following example where we use the object.size() function to check the estimate of memory used to store the corresponding object.

many_animals <- rep(c("sheep", "pig", "monkey"), c(100,200,300))
many_animals_fac <- factor(many_animals)
object.size(many_animals)
#> 5016 bytes
object.size(many_animals_fac)
#> 3032 bytes

From this example, we can see that storing the information as a factor could offer substantial memory savings (about 40% in this example) compare to storing it as a character vector.

Another advantage of factors over vectors is that it will detect any input that is outside of the levels. Let’s try to assign the string “Tiger” to the first element of both animals_fac and animals.

animals_fac[1] <- "Tiger"
animals_fac
#> [1] <NA>   pig    monkey sheep  sheep  pig   
#> Levels: monkey pig sheep
animals[1] <- "Tiger"
animals
#> [1] "Tiger"  "pig"    "monkey" "sheep"  "sheep"  "pig"

Since "Tiger" is not inside the levels set, we see a warning in the assignment process and the value of the first element is changed to <NA>. When the same assignment is done on the vector animals, there is no warning and the first element of animals is changed to “Tigers” as instructed. This is an attractive feature of factors that can prevent input errors.

In addition to creating factors from character vectors, you can also create them from numeric vectors as well as logical vectors.

x <- rep(3:1, 1:3)
x_fac <- factor(x)
y <- rep(c(T, F), c(5, 3))
y_fac <- factor(y)

It is worth noting that after we convert a numeric vector into a factor, the usual arithmetic operation can no longer be applied since the numbers become levels.

x_fac[1] + 1
#> Warning in Ops.factor(x_fac[1], 1): '+' not meaningful for factors
#> [1] NA

The result is NA with a warning message.

2.11.2 Set the factor levels and labels

As we have seen, the factor() function extracts the unique elements from a vector and sort them as its levels. To manually specify the levels and their order, you can set the levels argument. For example, if you only want "sheep" and "pig" in the level, you can use the following code.

factor(animals_fac, levels = c("pig", "sheep"))
#> [1] <NA>  pig   <NA>  sheep sheep pig  
#> Levels: pig sheep

As you can see, the third element becomes NA, since it corresponding element "monkey" in the original vector is not in the set of levels.

You can also create labels to represent each level of the factor by setting the labels argument in the factor() function.

factor(animals_fac, levels = c("pig", "sheep"), labels = c("pretty_pig", "smart_sheep"))
#> [1] <NA>        pretty_pig  <NA>        smart_sheep smart_sheep pretty_pig 
#> Levels: pretty_pig smart_sheep

An alternative way to change the levels of the levels is to assign the desired level vector to the levels() function with the factor as its argument. For example, if you want to translate the animals names into Spanish, you can use

levels(animals_fac) <- c("mona", "cerda", "oveja")
animals_fac
#> [1] <NA>  cerda mona  oveja oveja cerda
#> Levels: mona cerda oveja

2.11.3 Ordered factors

By default, the function factor() creates an unordered factor, which is usually used when there are no natural ordering among the levels. Sometimes, there may be a natural ordering among the levels. Let’s see an example.

conditions <- c("excellent", "good", "excellent", "good", "average")
factor(conditions)

Different from the animals in animals, the conditions have a natural ordering. We know \(average < good < excellent\). To reflect this in a factor, you can create an so-called ordered factor by setting ordered = TRUE and specify the levels in the ascending order of the desired ordering.

condition_ordered_fac <- factor(conditions, ordered = TRUE, levels = c("average", "good", "excellent"))
condition_ordered_fac
#> [1] excellent good      excellent good      average  
#> Levels: average < good < excellent

We can see that there is an ordering shown in the “Levels.” You can also do comparisons on ordered factors.

condition_ordered_fac[1] < condition_ordered_fac[2]
#> [1] FALSE

The result is FALSE since \(excellent > good\). We will revisit the topic of factor ordering when generating bar charts in Section 4.10.2.

2.11.4 Exercises

  1. What are the advantages of factors over vectors?
  2. Suppose we define x <- factor(1:5), what is the result of x[1] < x[2]? Please try to answer this question without R.
  • (a): TRUE
  • (b): FALSE
  • (c): NA
  1. Suppose we define x <- factor(1:5, ordered = TRUE), what is the result of x[1] < x[2]? Please try to answer this question without R.
  • (a): TRUE
  • (b): FALSE
  • (c): NA
  1. Suppose we define x <- factor(1:5, ordered = TRUE, levels = 5:1), what is the result of x[1] < x[2]? Please try to answer this question without R.
  • (a): TRUE
  • (b): FALSE
  • (c): NA
  1. Suppose size <- rep(c("big", "small", "medium"), 3:1), convert it to an ordered factor with levels small < medium < big.