Examples of using regular expressions in R

In this post I’ll be walking through some basic examples of regex, or regular expressions, and also providing a couple real-world examples.

To quote Hadley Wickham and Garrett Grolemund (2017), regex is a “language that allow[s] you to describe patterns in strings." When we have string/character data in R, we may want to extract information from it. And regex is a tool for doing this.

As a conceptual example, consider the sentence below:

“The user’s phone number is (555) 555-5555.”

This string of text could be something scraped from a PDF, or perhaps provided as input to a survey. Our end goal might be to extract and store the phone number in that sentence. One way to do this would be to write a regular expression that describes the pattern of a phone number, which R can then use to extract the phone number from the string.

Another common example is pulling metadata from filenames. If you have a folder of rasters with names like "farm_aerial_imagery_whitman_county_2021.tiff", you might want to filter them based on their contents. Using regex we can give R instructions to separate out the county name and year from the filenames, which allows us to determine the spatial and temporal coverage of our files.

Now I’ll go through a few examples. If you want to follow along, you can download the .zip folder from here. Make sure to unzip the folder and open the script it contains. We’ll fill it in below:

library(tidyverse)
library(stringi)

Note that tidyverse contains the stringr package, which contains a variety of useful functions for working with strings. We’ll barely scratch the surface of its capabilities here.

1. A simple example of regex

A common application of regex is to extract numbers from text data. Let’s say we have this filename:

filename <- "2021A.txt"

This is how we could confirm the presence of just the year from it using base R: We could search for the year verbatim, returning a logical T/F with grepl().

grepl(pattern = "2021", x = filename)
## [1] TRUE

Or we could generalize it to just identifying where a number occurs. We use a pattern to identifying numbers 0 through 9:

grepl(pattern = "[0-9]", x = filename)
## [1] TRUE

More helpful would be to find four numbers in a row, as indicated with our bracketed 4:

grepl(pattern = "[0-9]{4}", x = filename)
## [1] TRUE

And then finally to actually return those numbers to us using str_extract() so that we can use them:

?str_extract
str_extract(string = filename, pattern = "[0-9]{4}")
## [1] "2021"

Now you might imagine that we could take this one step further and pull the dates from an entire vector of filenames. This will be especially useful if we want to extract these data from an entire column, for example.

file_list <- c("2021A.txt", "1995A.txt", "2021B.txt", "1995B.txt")


str_extract(string = file_list, pattern = "[0-9]{4}")
## [1] "2021" "1995" "2021" "1995"

2. Separating and extracting

Pretend that we’re given a list of student names with grades. We want to use this for an analysis, but the names and grades are combined in a random order into a single string. We’ll need to separate them somehow. Luckily regex provides us the tools to do so.

grade_data <- read.csv(file = "grades.csv")

head(grade_data)
##           grades
## 1   COSTA.gradeB
## 2    gradeD.LOWE
## 3  gradeC.SEXTON
## 4 ELLIOTT.gradeD
## 5   MARIN.gradeD
## 6  GUERRA.gradeC

Let’s test out a potential regex pattern to find grades. We want a single capital letter, right?

str_extract(string = grade_data$grades,
            pattern = "[A-Z]{1}")
##   [1] "C" "D" "C" "E" "M" "G" "B" "D" "B" "F" "C" "B" "D" "C" "H" "S" "F" "B"
##  [19] "D" "G" "H" "C" "H" "H" "B" "F" "R" "F" "G" "B" "W" "R" "A" "S" "W" "D"
##  [37] "W" "D" "W" "R" "B" "M" "C" "C" "E" "M" "I" "B" "A" "Z" "K" "C" "B" "C"
##  [55] "B" "R" "C" "D" "F" "C" "D" "D" "D" "T" "C" "M" "K" "C" "B" "C" "C" "B"
##  [73] "B" "A" "A" "D" "F" "F" "A" "Q" "L" "A" "D" "F" "B" "D" "F" "R" "A" "A"
##  [91] "B" "D" "T" "F" "F" "V" "A" "D" "B" "C"

Hmm, that’s the general idea, but it’s clearly just extracting the first capital letter in the text string, which isn’t always a grade.

Instead, look for "grade" followed by a single capital letter.

str_extract(string = grade_data$grades,
            pattern = "grade[A-Z]{1}")
##   [1] "gradeB" "gradeD" "gradeC" "gradeD" "gradeD" "gradeC" "gradeB" "gradeD"
##   [9] "gradeB" "gradeF" "gradeD" "gradeB" "gradeD" "gradeF" "gradeD" "gradeA"
##  [17] "gradeF" "gradeB" "gradeD" "gradeA" "gradeC" "gradeC" "gradeA" "gradeA"
##  [25] "gradeB" "gradeF" "gradeC" "gradeF" "gradeF" "gradeB" "gradeB" "gradeB"
##  [33] "gradeA" "gradeD" "gradeA" "gradeD" "gradeF" "gradeD" "gradeF" "gradeB"
##  [41] "gradeB" "gradeF" "gradeB" "gradeC" "gradeD" "gradeC" "gradeD" "gradeA"
##  [49] "gradeA" "gradeA" "gradeD" "gradeC" "gradeA" "gradeC" "gradeB" "gradeA"
##  [57] "gradeC" "gradeD" "gradeF" "gradeC" "gradeD" "gradeD" "gradeD" "gradeC"
##  [65] "gradeC" "gradeD" "gradeC" "gradeC" "gradeB" "gradeC" "gradeD" "gradeB"
##  [73] "gradeB" "gradeA" "gradeA" "gradeB" "gradeF" "gradeF" "gradeA" "gradeC"
##  [81] "gradeC" "gradeA" "gradeD" "gradeF" "gradeB" "gradeD" "gradeF" "gradeF"
##  [89] "gradeC" "gradeB" "gradeB" "gradeD" "gradeB" "gradeF" "gradeF" "gradeA"
##  [97] "gradeA" "gradeD" "gradeA" "gradeC"

That works pretty much exactly as we intended…Save it for later:

grades <- str_extract(string = grade_data$grades,
                      pattern = "grade[A-Z]{1}")

# And remove "grade" from it
grades <- gsub(pattern = "grade", replacement = "", x = grades)

What about last names? Well, there are probably many ways to do this one. A “easy” way could be to first remove the grade info, then keep the rest:

grade_remove <- gsub(pattern = "grade[A-Z]{1}", replacement = "", x = grade_data$grades)

head(grade_remove)
## [1] "COSTA."   ".LOWE"    ".SEXTON"  "ELLIOTT." "MARIN."   "GUERRA."

Ok, so far so good. There’s also a pesky period in there…

gsub(pattern = ".", replacement = "", x = grade_remove)
##   [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [26] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [51] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
##  [76] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

That doesn’t look right! The period . is a pattern that is used in regex as a wildcard, standing in for any character. We have to “escape” it in order to use it to signify a literal period.

We use the double backslash in R to “escape” these special characters. We use two instead of one because of the way that R uses strings. We’re not going to get into the detail here

head(gsub(pattern = "\\.", replacement = "", x = grade_remove))
## [1] "COSTA"   "LOWE"    "SEXTON"  "ELLIOTT" "MARIN"   "GUERRA"

Save it for later:

student_names <- gsub(pattern = "\\.", replacement = "", x = grade_remove)

We can now combine these tricks to make a new data frame!

grade_records <- tibble(
  student_names,
  grades
)

3. Kick it up a notch

In this section I want to share some regex routines I’ve used in my real-world work to speed up or simplify data cleaning.

They’re going to be a little higher level, but hopefully useful to show how these things can be applied day-to-day.

3a. Lookahead/Lookbehind

We have a file with a column of coordinates that needs to be reformatted. I’ll show how lookahead/lookbehind methods can help here.

coordinates <- read.csv(file = "coordinates.csv")

head(coordinates)
##                   coordinates count       date
## 1 N 46° 43.9214 W 117° 9.6333    28 10/10/2020
## 2    N 21° 20.04 W 158° 50.05    13 10/10/2020
## 3 N 53° 31.0101 E 108° 6.0512    47  10/5/2020
## 4 N 46° 43.9214 W 117° 9.6333    96 10/10/2020
## 5    N 21° 20.04 W 158° 50.05    95  10/5/2020
## 6    N 21° 20.04 W 158° 50.05    66 10/10/2020

We want to identify pieces of text in the coordinates column that are latitudes or longitudes. Latitudes = pieces of text followed by " W” or " E". Longitudes = pieces of text preceded by " W" or " E".

Extract Latitude: We want to use a positive lookahead. i.e., our piece of targeted text (latitude) exists BEFORE a specific textual pattern (“W” or “E”). The text pattern in the lookahead is NOT included in the matched text, so we won’t end up storing the W or E!


The regex syntax for this is: "(?=...)". We’ll include the leading space before the W/E as well. Note that the "|" stands for “OR” and ".*" signifies “any character zero or more times”. Here’s a good reference for the syntax.

coordinates %>%
  mutate(latitude = str_extract(string = coordinates,
                                pattern = ".*(?= E| W)"))
##                    coordinates count       date      latitude
## 1  N 46° 43.9214 W 117° 9.6333    28 10/10/2020 N 46° 43.9214
## 2     N 21° 20.04 W 158° 50.05    13 10/10/2020   N 21° 20.04
## 3  N 53° 31.0101 E 108° 6.0512    47  10/5/2020 N 53° 31.0101
## 4  N 46° 43.9214 W 117° 9.6333    96 10/10/2020 N 46° 43.9214
## 5     N 21° 20.04 W 158° 50.05    95  10/5/2020   N 21° 20.04
## 6     N 21° 20.04 W 158° 50.05    66 10/10/2020   N 21° 20.04
## 7  N 53° 31.0101 E 108° 6.0512    74  10/5/2020 N 53° 31.0101
## 8  N 53° 31.0101 E 108° 6.0512    62 10/10/2020 N 53° 31.0101
## 9  N 53° 31.0101 E 108° 6.0512    66 10/20/2020 N 53° 31.0101
## 10 N 46° 43.9214 W 117° 9.6333    58 10/20/2020 N 46° 43.9214
## 11 N 53° 31.0101 E 108° 6.0512    36 10/20/2020 N 53° 31.0101
## 12 N 46° 43.9214 W 117° 9.6333    49 10/20/2020 N 46° 43.9214
## 13 N 53° 31.0101 E 108° 6.0512    39  9/25/2020 N 53° 31.0101
## 14 N 53° 31.0101 E 108° 6.0512    15 10/20/2020 N 53° 31.0101
## 15 N 46° 43.9214 W 117° 9.6333    75  9/25/2020 N 46° 43.9214

Here’s a quick summary of the other lookahead/behind options:

Negative lookahead: Our piece of targeted text is NOT followed by a specific textual pattern. Lookahead not included in match. Syntax: "(?!...)"

Positive lookbehind: Our piece of targeted text exists AFTER a specific textual pattern. That pattern is NOT included in the matched text. Syntax: "(?<=...)"

Negative Lookbehind: Our piece of targeted text does NOT exist AFTER a specific textual pattern. That pattern is NOT included in the matched text. Negative Lookbehind: "(?<!...)"


Extract Longitude: To get the longitude from our coordinate column we can find the pieces of text that begin with either an E or W. It’s a little simpler than what we did for latitude:

coordinates %>%
  mutate(longitude = str_extract(string = coordinates,
                                 pattern = "E.*|W.*"))
##                    coordinates count       date     longitude
## 1  N 46° 43.9214 W 117° 9.6333    28 10/10/2020 W 117° 9.6333
## 2     N 21° 20.04 W 158° 50.05    13 10/10/2020  W 158° 50.05
## 3  N 53° 31.0101 E 108° 6.0512    47  10/5/2020 E 108° 6.0512
## 4  N 46° 43.9214 W 117° 9.6333    96 10/10/2020 W 117° 9.6333
## 5     N 21° 20.04 W 158° 50.05    95  10/5/2020  W 158° 50.05
## 6     N 21° 20.04 W 158° 50.05    66 10/10/2020  W 158° 50.05
## 7  N 53° 31.0101 E 108° 6.0512    74  10/5/2020 E 108° 6.0512
## 8  N 53° 31.0101 E 108° 6.0512    62 10/10/2020 E 108° 6.0512
## 9  N 53° 31.0101 E 108° 6.0512    66 10/20/2020 E 108° 6.0512
## 10 N 46° 43.9214 W 117° 9.6333    58 10/20/2020 W 117° 9.6333
## 11 N 53° 31.0101 E 108° 6.0512    36 10/20/2020 E 108° 6.0512
## 12 N 46° 43.9214 W 117° 9.6333    49 10/20/2020 W 117° 9.6333
## 13 N 53° 31.0101 E 108° 6.0512    39  9/25/2020 E 108° 6.0512
## 14 N 53° 31.0101 E 108° 6.0512    15 10/20/2020 E 108° 6.0512
## 15 N 46° 43.9214 W 117° 9.6333    75  9/25/2020 W 117° 9.6333

Now let’s combine these two approaches:

coordinates %>%
  mutate(latitude = str_extract(string = coordinates,
                                pattern = ".*(?= E| W)"),
         longitude = str_extract(string = coordinates,
                                 pattern = "E.*|W.*"))
##                    coordinates count       date      latitude     longitude
## 1  N 46° 43.9214 W 117° 9.6333    28 10/10/2020 N 46° 43.9214 W 117° 9.6333
## 2     N 21° 20.04 W 158° 50.05    13 10/10/2020   N 21° 20.04  W 158° 50.05
## 3  N 53° 31.0101 E 108° 6.0512    47  10/5/2020 N 53° 31.0101 E 108° 6.0512
## 4  N 46° 43.9214 W 117° 9.6333    96 10/10/2020 N 46° 43.9214 W 117° 9.6333
## 5     N 21° 20.04 W 158° 50.05    95  10/5/2020   N 21° 20.04  W 158° 50.05
## 6     N 21° 20.04 W 158° 50.05    66 10/10/2020   N 21° 20.04  W 158° 50.05
## 7  N 53° 31.0101 E 108° 6.0512    74  10/5/2020 N 53° 31.0101 E 108° 6.0512
## 8  N 53° 31.0101 E 108° 6.0512    62 10/10/2020 N 53° 31.0101 E 108° 6.0512
## 9  N 53° 31.0101 E 108° 6.0512    66 10/20/2020 N 53° 31.0101 E 108° 6.0512
## 10 N 46° 43.9214 W 117° 9.6333    58 10/20/2020 N 46° 43.9214 W 117° 9.6333
## 11 N 53° 31.0101 E 108° 6.0512    36 10/20/2020 N 53° 31.0101 E 108° 6.0512
## 12 N 46° 43.9214 W 117° 9.6333    49 10/20/2020 N 46° 43.9214 W 117° 9.6333
## 13 N 53° 31.0101 E 108° 6.0512    39  9/25/2020 N 53° 31.0101 E 108° 6.0512
## 14 N 53° 31.0101 E 108° 6.0512    15 10/20/2020 N 53° 31.0101 E 108° 6.0512
## 15 N 46° 43.9214 W 117° 9.6333    75  9/25/2020 N 46° 43.9214 W 117° 9.6333

Alternatively, we could use the separate() function along with regex for the column separator. Here I tell R to separate the two columns where there is a number followed by a space and then either W or E. The two parentheses enclose lookahead and lookbehind phrases that shouldn’t be included in the extracted text but which we do want to check for.

coordinates %>%
  separate(col = coordinates,
           into = c("latitude", "longitude"),
           sep = "(?<=[0-9]) (?=W|E)")
##         latitude     longitude count       date
## 1  N 46° 43.9214 W 117° 9.6333    28 10/10/2020
## 2    N 21° 20.04  W 158° 50.05    13 10/10/2020
## 3  N 53° 31.0101 E 108° 6.0512    47  10/5/2020
## 4  N 46° 43.9214 W 117° 9.6333    96 10/10/2020
## 5    N 21° 20.04  W 158° 50.05    95  10/5/2020
## 6    N 21° 20.04  W 158° 50.05    66 10/10/2020
## 7  N 53° 31.0101 E 108° 6.0512    74  10/5/2020
## 8  N 53° 31.0101 E 108° 6.0512    62 10/10/2020
## 9  N 53° 31.0101 E 108° 6.0512    66 10/20/2020
## 10 N 46° 43.9214 W 117° 9.6333    58 10/20/2020
## 11 N 53° 31.0101 E 108° 6.0512    36 10/20/2020
## 12 N 46° 43.9214 W 117° 9.6333    49 10/20/2020
## 13 N 53° 31.0101 E 108° 6.0512    39  9/25/2020
## 14 N 53° 31.0101 E 108° 6.0512    15 10/20/2020
## 15 N 46° 43.9214 W 117° 9.6333    75  9/25/2020

3b. Group capture

Looking at the date column in this dataset we might also want to reformat it for one reason or another. I actually wouldn’t typically do this with data, as there are functions to format date in the lubridate package. But where I have used this to reformat dates that are hard-coded into R scripts. In that situation I used the Find & Replace tool in RStudio along with the regex option.

head(coordinates)
##                   coordinates count       date
## 1 N 46° 43.9214 W 117° 9.6333    28 10/10/2020
## 2    N 21° 20.04 W 158° 50.05    13 10/10/2020
## 3 N 53° 31.0101 E 108° 6.0512    47  10/5/2020
## 4 N 46° 43.9214 W 117° 9.6333    96 10/10/2020
## 5    N 21° 20.04 W 158° 50.05    95  10/5/2020
## 6    N 21° 20.04 W 158° 50.05    66 10/10/2020

We have dates in this format: MM/DD/YYYY

coordinates$date
##  [1] "10/10/2020" "10/10/2020" "10/5/2020"  "10/10/2020" "10/5/2020" 
##  [6] "10/10/2020" "10/5/2020"  "10/10/2020" "10/20/2020" "10/20/2020"
## [11] "10/20/2020" "10/20/2020" "9/25/2020"  "10/20/2020" "9/25/2020"

Let’s manually convert the forward slashes to underscores, but do so using group captures. This syntax allows us to group parts of a text string, for example as a means of ensuring that the data we’re editing meets certain conditions (e.g., for quality control).

To do this we denote groups with parentheses: ( ). So, for example, we could use the following lines of code to “capture” the pieces of our dates: "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})"

Then we can refer to the date pieces by numbers, in the order they are written. Depending on the function we’re using, we refer to each group number with a dollar sign OR two backslashes ahead of its label: "$1_$2_$3" or "\\1_\\2_\\3"

So, if we put these two pieces together we can create a new date column like this:

With the stringr package:

coordinates %>%
  mutate(new_date = str_replace(string = date,
                                pattern = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})",
                                replacement = "\\1_\\2_\\3"))
##                    coordinates count       date   new_date
## 1  N 46° 43.9214 W 117° 9.6333    28 10/10/2020 10_10_2020
## 2     N 21° 20.04 W 158° 50.05    13 10/10/2020 10_10_2020
## 3  N 53° 31.0101 E 108° 6.0512    47  10/5/2020  10_5_2020
## 4  N 46° 43.9214 W 117° 9.6333    96 10/10/2020 10_10_2020
## 5     N 21° 20.04 W 158° 50.05    95  10/5/2020  10_5_2020
## 6     N 21° 20.04 W 158° 50.05    66 10/10/2020 10_10_2020
## 7  N 53° 31.0101 E 108° 6.0512    74  10/5/2020  10_5_2020
## 8  N 53° 31.0101 E 108° 6.0512    62 10/10/2020 10_10_2020
## 9  N 53° 31.0101 E 108° 6.0512    66 10/20/2020 10_20_2020
## 10 N 46° 43.9214 W 117° 9.6333    58 10/20/2020 10_20_2020
## 11 N 53° 31.0101 E 108° 6.0512    36 10/20/2020 10_20_2020
## 12 N 46° 43.9214 W 117° 9.6333    49 10/20/2020 10_20_2020
## 13 N 53° 31.0101 E 108° 6.0512    39  9/25/2020  9_25_2020
## 14 N 53° 31.0101 E 108° 6.0512    15 10/20/2020 10_20_2020
## 15 N 46° 43.9214 W 117° 9.6333    75  9/25/2020  9_25_2020

Or with the stringi package:

coordinates %>%
  mutate(new_date = stri_replace(str = date,
                                 regex = "([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})",
                                 replacement = "$1_$2_$3"))
##                    coordinates count       date   new_date
## 1  N 46° 43.9214 W 117° 9.6333    28 10/10/2020 10_10_2020
## 2     N 21° 20.04 W 158° 50.05    13 10/10/2020 10_10_2020
## 3  N 53° 31.0101 E 108° 6.0512    47  10/5/2020  10_5_2020
## 4  N 46° 43.9214 W 117° 9.6333    96 10/10/2020 10_10_2020
## 5     N 21° 20.04 W 158° 50.05    95  10/5/2020  10_5_2020
## 6     N 21° 20.04 W 158° 50.05    66 10/10/2020 10_10_2020
## 7  N 53° 31.0101 E 108° 6.0512    74  10/5/2020  10_5_2020
## 8  N 53° 31.0101 E 108° 6.0512    62 10/10/2020 10_10_2020
## 9  N 53° 31.0101 E 108° 6.0512    66 10/20/2020 10_20_2020
## 10 N 46° 43.9214 W 117° 9.6333    58 10/20/2020 10_20_2020
## 11 N 53° 31.0101 E 108° 6.0512    36 10/20/2020 10_20_2020
## 12 N 46° 43.9214 W 117° 9.6333    49 10/20/2020 10_20_2020
## 13 N 53° 31.0101 E 108° 6.0512    39  9/25/2020  9_25_2020
## 14 N 53° 31.0101 E 108° 6.0512    15 10/20/2020 10_20_2020
## 15 N 46° 43.9214 W 117° 9.6333    75  9/25/2020  9_25_2020

References:

  • Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.
Matthew Brousil
Matthew Brousil
Data Scientist

Related