2.10 Character Vectors: Sort, Rank, Order
In Section 2.7, you learned sort()
, rank()``, and
order()` to sort numeric vectors and get their elements’ ranks and indices. These three functions can be used in a similar manner for character vectors. Similar to numeric vectors, let’s first prepare a character vector.
2.10.1 Ordering rules
For character vectors, R uses the lexicographical ordering, which is sometimes called dictionary order since it is the order used in a dictionary. Note that the strings in character vectors can contain letters, numbers, or symbols. There are a few important ordering rules as follows.
symbols < digits < letters: symbols appear first, followed by digits, and letters appear last.
symbols are ordered in a specific way as shown below.
syms <- c(" ", ",", ";", "_", "(", ")", "!", "[", "]", "{", "}", "-", "*", "/", "#",
"$", "%", "^", "&", "`", "@", "+", "=", "|", "?", "<", ">", ".")
sort(syms)
#> [1] " " "_" "-" "," ";" "!" "?" "." "(" ")" "[" "]" "{" "}" "@" "*" "/" "&" "#"
#> [20] "%" "`" "^" "+" "<" "=" ">" "|" "$"
- digits are ordered ascendingly: the smaller digits appear earlier than the bigger ones.
- Letters have two sorting rules. In R,
letters
is a pre-created character vector with all 26 lower-cased letters in the alphabet, andLETTERS
is another character vector with all 26 upper-cased letters in the alphabet. Case-wise, lower cases go before upper cases. Letter-wise, letters are sorted alphabetically.
all_letters <- c(letters, LETTERS)
sort(all_letters)
#> [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F" "g" "G" "h" "H" "i" "I" "j"
#> [20] "J" "k" "K" "l" "L" "m" "M" "n" "N" "o" "O" "p" "P" "q" "Q" "r" "R" "s" "S"
#> [39] "t" "T" "u" "U" "v" "V" "w" "W" "x" "X" "y" "Y" "z" "Z"
From the example below, you will find out that the letter-wise rule is prioritized over the case-wise rule.
2.10.2 Sort vectors with sort()
You can surely apply sort()
on character vectors. A character vector’s elements (i.e., strings) are sorted by their first character. If two elements have the same first character, they will be sorted by their second character. The rule applies until the indexed ties between two strings’ characters are broken, or two strings run out of characters.
Let’s try to sort the character vector char_vec
.
We have the following observations.
- Symbols appear first, followed by digits, and letters appear last.
- According to the ordering rule of symbols,
_4
goes first,.a
(two of them) and&u
follow subsequently. 1a
and1c
have the same first character. Between their second character, a goes before c, therefore1a
goes before1c
.aB
andac
have the same first character, since b goes before C (although B is an upper case while c is a lower case),aB
goes beforeac
.
Of course, we can also have the order reversed by adding the argument decreasing = TRUE
inside sort()
.
2.10.3 Get ranks in vectors with rank()
The same ordering rules introduced in subsection 2.10.1 do apply when R ranks a character vector’s strings. Here, the element with rank 1 is _4
and .a
has rank 2. Just like numeric vectors, if you have strings with the same value (i.e., characters) in character vectors, these elements’ ranks will be the same (the average of the corresponding ranks) by default.
As expected, you can set the ties.method
argument in rank()
to use other methods for breaking ties.
2.10.4 Get the ordering permutation via order()
Again, you can use the same order()
function to get the corresponding indices of a character vector’s strings. Also, the order()
function breaks the ties by the appearing order by default.
The decreasing
argument still works for order()
:
2.10.5 Summary and comparisons
Let’s put all sort-related functions together to clarify their functions.
The sort()
function sort char_vec
according to previously introduced ordering rules. _4
goes first because its first character is a symbol (symbols < digits < letters), and _
goes before “.” and “&”.
The rank()
function gives the ranks of strings inside char_vec
. The first returned rank is 9.0
because the first string in char_vec
, namely “a”, is sorted in the 9th place in char_vec
. The second returned rank is 10.0
because the second string in char_vec
, namely “A”, is sorted in the 10th place in char_vec
.
The order()
function gives the ordering permutation of char_vec
. The first returned order is 14 because sorting char_vec
will put this character vector’s 14th string, namely “_4”, in the first place. Similarly, the second returned order is 8 because sorting char_vec
will put this character vector’s 8th string, namely “.a”, in the second place.
2.10.6 Exercises
Suppose exercise <- c("&5", "Nd", "9iC", "3df", "df", "nd", "_5", "9ic")
Sort
exercise
in the ascending order. Explain why 1)3df
goes before9ic
; 2)&5
goes before3df
; 3)9ic
goes before9iC
.Apply both
rank()
andorder()
to the character vectorexercise
. Within the two returned vectors, explain how does the element7
means differently in two vectors.