8.2 Matching Pattern with Regular Expressions
In many applications, we may want to find out which strings match a certain pattern. To illustrate this, we will be using the variable names from the sahp
data set from r02pro package.
library(r02pro)
library(stringr)
<- colnames(sahp) sahp_names
The stringr package provides a very useful function called str_view()
to highlight the elements that match the given pattern. To create a logical vector that reflect this match, you can use the str_detect()
function.
8.2.1 Basic Matches
Let’s describe a few commonly used patterns.
a. contain a given string
To find all strings that contain "room"
, you can run the following code.
str_view(sahp_names, "room")
You can see that the str_detect()
function returns a logical vector that reflects whether this is a match. In the rest of this section, we will be focusing on demonstration with str_view()
.
From the result, we can see that "bedroom"
and "bathroom"
are highlighted, since they both contain the string "room"
.
b. including .
in the pattern
In addition to specifying the exact string for matching, you can also use the wild symbol .
to represent any character.
str_view(sahp_names, ".h")
The first pattern ".h"
represents any character plus the letter h. Note that "house_style"
is not a match since it doesn’t have any character before the letter "h"
.
The second patter ".s."
represents a length-3 substring, with any two characters around "h"
. Again, here "sale_price"
is not a match since "s"
is a first character.
c. including anchors in the pattern
Sometimes, we may want to match the start of the string ("^"
), or the end of the string ("$"
).
str_view(sahp_names, "^l")
The highlighted strings "liv_area"
and "lot_area"
both start with letter "l"
. Although "oa_qual"
and "house_style"
contain "l"
, they are not selected since "l"
is not the start of the strings. Now, let’s find the strings the end with "l"
.
str_view(sahp_names, "l$")
We now see "oa_qual"
, "kit_qual"
, and "heat_qual"
all selected.
8.2.2 Character Classes Matches
a. Single character match
Sometimes, we may want to match the characters with a specific class or a group of values instead of exact values. Here is a list of the commonly used classes.
\d
: matches any digit. In other words, numbers from 0 to 9.\s
: matches any whitespace (e.g. space, tab, newline). For example,,
\t
,\n
.[xyz]
: matches x, y, or z.[^xyz]
: matches anything except x, y, or z. (The opposite of[xyz]
)[a-z]
or[:lower:]
: matches every character between a and z.[A-Z]
or[:upper:]
: matches every character between A and Z.
Note that all of these are detecting a single character from the candidate ones.
Let’s see some examples.
<- c("abc","a1","2b","33c","d 2", "d f 3")
my_char str_view(my_char, "\\d")
First of all, we need to have "\\"
since "\"
is a special character that needs the espace character "\"
. Here, as long as the string contains at least one digits, it will be matched.
str_view(my_char, "\\s\\d")
Here, the pattern "\s\d"
represents a pattern of a whitespace followed by a digit. Here, only "d 2"
and "d f 3"
match this pattern.
str_view(my_char, "[ac3]")
Here, "[ac3]"
indicates that as long as the string contains "a"
, "c"
, or "3"
, it will be matched.
str_view(my_char, "[^abc3]")
Here, "[^abc3]"
indicates that as long as the string contains any character that is not one of "a"
, "b"
, "c"
, and "3"
, it will be matched.
b. Repeated characters match
Sometimes, you may want to match a pattern multiple times consecutively. While you can do it manually by repeating the pattern, there is an easier way to do this by using the {}
with the number inside. Let’s create a string and see some examples.
<- c("bbbb", "1234a",
my_str_rep "bbb", "abcdef",
"bb ee", "123456789")
b{3}
will match 3 b’s.
str_view(my_str_rep, "b{3}")
To match a 9-digit number, you can use [1-9]{9}
.
str_view(my_str_rep, "[1-9]{9}")
[a-z]{4}
matches all four-letter word.
str_view(my_str_rep, "[a-z]{4}")
[a-z1-9]{4,}
will match all strings which has at least 4 lower case letters or numbers.
str_view(my_str_rep, "[a-z1-9]{4,}")