8.2 Matching Pattern with Regular Expressions

In many applications, we may want to find out which strings match a certain pattern. To illustrate this, we will be using the variable names from the sahp data set from r02pro package.

library(r02pro)
library(stringr)
sahp_names <- colnames(sahp)

The stringr package provides a very useful function called str_view() to highlight the elements that match the given pattern. To create a logical vector that reflect this match, you can use the str_detect() function.

8.2.1 Basic Matches

Let’s describe a few commonly used patterns.

a. contain a given string

To find all strings that contain "room", you can run the following code.

str_view(sahp_names, "room")
str_detect(sahp_names, "room") #> [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

You can see that the str_detect() function returns a logical vector that reflects whether this is a match. In the rest of this section, we will be focusing on demonstration with str_view().

From the result, we can see that "bedroom" and "bathroom" are highlighted, since they both contain the string "room".

b. including . in the pattern

In addition to specifying the exact string for matching, you can also use the wild symbol . to represent any character.

str_view(sahp_names, ".h")
str_view(sahp_names, ".s.")

The first pattern ".h" represents any character plus the letter h. Note that "house_style" is not a match since it doesn’t have any character before the letter "h".

The second patter ".s." represents a length-3 substring, with any two characters around "h". Again, here "sale_price" is not a match since "s" is a first character.

c. including anchors in the pattern

Sometimes, we may want to match the start of the string ("^"), or the end of the string ("$").

str_view(sahp_names, "^l")

The highlighted strings "liv_area" and "lot_area" both start with letter "l". Although "oa_qual" and "house_style" contain "l", they are not selected since "l" is not the start of the strings. Now, let’s find the strings the end with "l".

str_view(sahp_names, "l$")

We now see "oa_qual", "kit_qual", and "heat_qual" all selected.

8.2.2 Character Classes Matches

a. Single character match

Sometimes, we may want to match the characters with a specific class or a group of values instead of exact values. Here is a list of the commonly used classes.

  • \d: matches any digit. In other words, numbers from 0 to 9.
  • \s: matches any whitespace (e.g. space, tab, newline). For example, , \t, \n.
  • [xyz]: matches x, y, or z.
  • [^xyz]: matches anything except x, y, or z. (The opposite of [xyz])
  • [a-z] or [:lower:]: matches every character between a and z.
  • [A-Z] or [:upper:]: matches every character between A and Z.

Note that all of these are detecting a single character from the candidate ones.

Let’s see some examples.

my_char <- c("abc","a1","2b","33c","d 2", "d f  3")
str_view(my_char, "\\d")

First of all, we need to have "\\" since "\" is a special character that needs the espace character "\". Here, as long as the string contains at least one digits, it will be matched.

str_view(my_char, "\\s\\d")

Here, the pattern "\s\d" represents a pattern of a whitespace followed by a digit. Here, only "d 2" and "d f 3" match this pattern.

str_view(my_char, "[ac3]")

Here, "[ac3]" indicates that as long as the string contains "a", "c", or "3", it will be matched.

str_view(my_char, "[^abc3]")

Here, "[^abc3]" indicates that as long as the string contains any character that is not one of "a", "b", "c", and "3", it will be matched.

b. Repeated characters match Sometimes, you may want to match a pattern multiple times consecutively. While you can do it manually by repeating the pattern, there is an easier way to do this by using the {} with the number inside. Let’s create a string and see some examples.

my_str_rep <- c("bbbb", "1234a",
                "bbb", "abcdef",
                "bb ee", "123456789")

b{3} will match 3 b’s.

str_view(my_str_rep, "b{3}")

To match a 9-digit number, you can use [1-9]{9}.

str_view(my_str_rep, "[1-9]{9}")

[a-z]{4} matches all four-letter word.

str_view(my_str_rep, "[a-z]{4}")

[a-z1-9]{4,} will match all strings which has at least 4 lower case letters or numbers.

str_view(my_str_rep, "[a-z1-9]{4,}")