2.10 Character Vectors: Sort, Rank, Order

In Section 2.7, you learned sort(), rank()``, andorder()` to sort numeric vectors and get their elements’ ranks and indices. These three functions can be used in a similar manner for character vectors. Similar to numeric vectors, let’s first prepare a character vector.

char_vec <- c("a", "A", "B", "b", "aB", "ac", "1c", ".a", "1a", "2a", ".a", "&u",
    "3", "_4")

2.10.1 Ordering rules

For character vectors, R uses the lexicographical ordering, which is sometimes called dictionary order since it is the order used in a dictionary. Note that the strings in character vectors can contain letters, numbers, or symbols. There are a few important ordering rules as follows.

  • symbols < digits < letters: symbols appear first, followed by digits, and letters appear last.

  • symbols are ordered in a specific way as shown below.

syms <- c(" ", ",", ";", "_", "(", ")", "!", "[", "]", "{", "}", "-", "*", "/", "#",
    "$", "%", "^", "&", "`", "@", "+", "=", "|", "?", "<", ">", ".")
sort(syms)
#>  [1] " " "_" "-" "," ";" "!" "?" "." "(" ")" "[" "]" "{" "}" "@" "*" "/" "&" "#"
#> [20] "%" "`" "^" "+" "<" "=" ">" "|" "$"
  • digits are ordered ascendingly: the smaller digits appear earlier than the bigger ones.
nums <- 0:9
sort(nums)
#>  [1] 0 1 2 3 4 5 6 7 8 9
  • Letters have two sorting rules. In R, letters is a pre-created character vector with all 26 lower-cased letters in the alphabet, and LETTERS is another character vector with all 26 upper-cased letters in the alphabet. Case-wise, lower cases go before upper cases. Letter-wise, letters are sorted alphabetically.
all_letters <- c(letters, LETTERS)
sort(all_letters)
#>  [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F" "g" "G" "h" "H" "i" "I" "j"
#> [20] "J" "k" "K" "l" "L" "m" "M" "n" "N" "o" "O" "p" "P" "q" "Q" "r" "R" "s" "S"
#> [39] "t" "T" "u" "U" "v" "V" "w" "W" "x" "X" "y" "Y" "z" "Z"

From the example below, you will find out that the letter-wise rule is prioritized over the case-wise rule.

x <- c("d", "c")
sort(x)
#> [1] "c" "d"
y <- c("d", "C")
sort(y)
#> [1] "C" "d"

2.10.2 Sort vectors with sort()

You can surely apply sort() on character vectors. A character vector’s elements (i.e., strings) are sorted by their first character. If two elements have the same first character, they will be sorted by their second character. The rule applies until the indexed ties between two strings’ characters are broken, or two strings run out of characters.

Let’s try to sort the character vector char_vec.

sort(char_vec)
#>  [1] "_4" ".a" ".a" "&u" "1a" "1c" "2a" "3"  "a"  "A"  "aB" "ac" "b"  "B"

We have the following observations.

  • Symbols appear first, followed by digits, and letters appear last.
  • According to the ordering rule of symbols, _4 goes first, .a (two of them) and &u follow subsequently.
  • 1a and 1c have the same first character. Between their second character, a goes before c, therefore 1a goes before 1c.
  • aB and ac have the same first character, since b goes before C (although B is an upper case while c is a lower case), aB goes before ac.

Of course, we can also have the order reversed by adding the argument decreasing = TRUE inside sort().

sort(char_vec, decreasing = TRUE)
#>  [1] "B"  "b"  "ac" "aB" "A"  "a"  "3"  "2a" "1c" "1a" "&u" ".a" ".a" "_4"

2.10.3 Get ranks in vectors with rank()

The same ordering rules introduced in subsection 2.10.1 do apply when R ranks a character vector’s strings. Here, the element with rank 1 is _4 and .a has rank 2. Just like numeric vectors, if you have strings with the same value (i.e., characters) in character vectors, these elements’ ranks will be the same (the average of the corresponding ranks) by default.

rank(char_vec)
#>  [1]  9.0 10.0 14.0 13.0 11.0 12.0  6.0  2.5  5.0  7.0  2.5  4.0  8.0  1.0

As expected, you can set the ties.method argument in rank() to use other methods for breaking ties.

rank(char_vec, ties.method = "min")
#>  [1]  9 10 14 13 11 12  6  2  5  7  2  4  8  1
rank(char_vec, ties.method = "first")
#>  [1]  9 10 14 13 11 12  6  2  5  7  3  4  8  1

2.10.4 Get the ordering permutation via order()

Again, you can use the same order() function to get the corresponding indices of a character vector’s strings. Also, the order() function breaks the ties by the appearing order by default.

order(char_vec)
#>  [1] 14  8 11 12  9  7 10 13  1  2  5  6  4  3

The decreasing argument still works for order():

order(char_vec, decreasing = TRUE)
#>  [1]  3  4  6  5  2  1 13 10  7  9 12  8 11 14

2.10.5 Summary and comparisons

Let’s put all sort-related functions together to clarify their functions.

char_vec
#>  [1] "a"  "A"  "B"  "b"  "aB" "ac" "1c" ".a" "1a" "2a" ".a" "&u" "3"  "_4"
sort(char_vec)
#>  [1] "_4" ".a" ".a" "&u" "1a" "1c" "2a" "3"  "a"  "A"  "aB" "ac" "b"  "B"

The sort() function sort char_vec according to previously introduced ordering rules. _4 goes first because its first character is a symbol (symbols < digits < letters), and _ goes before “.” and “&”.

rank(char_vec)
#>  [1]  9.0 10.0 14.0 13.0 11.0 12.0  6.0  2.5  5.0  7.0  2.5  4.0  8.0  1.0

The rank() function gives the ranks of strings inside char_vec. The first returned rank is 9.0 because the first string in char_vec, namely “a”, is sorted in the 9th place in char_vec. The second returned rank is 10.0 because the second string in char_vec, namely “A”, is sorted in the 10th place in char_vec.

order(char_vec)
#>  [1] 14  8 11 12  9  7 10 13  1  2  5  6  4  3

The order() function gives the ordering permutation of char_vec. The first returned order is 14 because sorting char_vec will put this character vector’s 14th string, namely “_4”, in the first place. Similarly, the second returned order is 8 because sorting char_vec will put this character vector’s 8th string, namely “.a”, in the second place.

2.10.6 Exercises

Suppose exercise <- c("&5", "Nd", "9iC", "3df", "df", "nd", "_5", "9ic")

  1. Sort exercise in the ascending order. Explain why 1) 3df goes before 9ic; 2) &5 goes before 3df; 3) 9ic goes before 9iC.

  2. Apply both rank() and order() to the character vector exercise. Within the two returned vectors, explain how does the element 7 means differently in two vectors.