Regular Expressions (Regex)
Why Learn Regular Expressions in Data Analytics?
Imagine you are cleaning a survey dataset containing user-submitted telephone numbers. The entries are messy and inconsistent:
"123-456-7890"(hyphens)"1234567890"(raw digits)"(123) 456-7890"(parentheses and space)"Call 123-456-7890"(text mixed with digits)
To perform text mining or clean this column for SMS alerts, you need to strip away all non-numeric characters, leaving only the clean 10-digit number.
Using custom string search functions is impossible because the formatting variations are infinite. You need Regular Expressions (Regex). Regex allows you to specify a text search pattern (e.g., "find all non-digits") and replace or match it instantly.
1. R's Built-in Regex Functions
R includes several built-in regex functions:
gsub(pattern, replacement, x): Replaces all occurrences of a pattern in the textx(equivalent to Python'sre.sub()).sub(pattern, replacement, x): Replaces only the first occurrence of a pattern.grepl(pattern, x): Returns a logical vector (TRUEorFALSE) indicating if the pattern is found in each element ofx.grep(pattern, x): Returns the numeric indices of elements inxthat match the pattern.
raw_text <- "catorangecatdogapple"
# Replaces all occurrences of "cat" with "CAT"
result_all <- gsub("cat", "CAT", raw_text)
print(result_all) # "CATorangeCATdogapple"
# Replaces ONLY the first occurrence of "cat"
result_first <- sub("cat", "CAT", raw_text)
print(result_first) # "CATorangecatdogapple"
2. Escaping and Raw Strings
In Python or JavaScript, you write \d in a string to match a digit in regular expressions. In R, because the single backslash \ is a string escape character, you have two choices for writing regex patterns:
Option A: The Double Backslash Rule
You must double-escape every backslash: \\d, \\s, etc.
\\d: Matches any digit (0-9).\\D: Matches any non-digit.\\s: Matches any whitespace.\\w: Matches any alphanumeric character.
phone_messy <- "Call (123) 456-7890"
# Replace all non-digits (\\D) with ""
phone_clean <- gsub("\\D", "", phone_messy)
print(phone_clean) # "1234567890"
Option B: Raw String Literals (R 4.0+)
To avoid the double backslash headache, R supports Raw String Literals using the r"(...)" or r"[...]" syntax. Inside a raw string literal, backslashes are treated as literal characters and do not need to be escaped:
# No double backslash needed!
phone_clean_raw <- gsub(r"(\D)", "", phone_messy)
print(phone_clean_raw) # "1234567890"
# Match a digit followed by a letter using raw string syntax
expression <- r"(\d\w)"
3. Common Regex Operators
| Operator | Meaning | Example | Result |
|---|---|---|---|
| |
OR (matches left or right pattern) | gsub("cat|dog", "-", "catdog") |
"--" |
+ |
Matches one or more times | gsub("a+", "-", "caaat") |
"c-t" |
* |
Matches zero or more times | gsub("ab*", "-", "acabb") |
"-ca-" |
^ |
Matches the start of a string | grepl("^cat", "catdog") |
TRUE |
$ |
Matches the end of a string | grepl("dog$", "catdog") |
TRUE |
[ ] |
Character set (matches any character inside) | gsub("[aeiou]", "-", "apple") |
"-pp--" |
4. Tidyverse Regex: stringr
The tidyverse stringr package replaces base R's regex tools with highly consistent counterparts:
str_detect(string, pattern): Equivalent togrepl(). Returns a logical vector.str_subset(string, pattern): Returns only elements that contain a match.str_extract(string, pattern)/str_extract_all: Extracts the actual matched text rather than returning logical flags or indices.str_match(string, pattern): Extracts matched text AND individual capture groups as columns of a matrix.str_replace(string, pattern, replacement)/str_replace_all: Equivalent tosub()andgsub().
library(stringr)
fruits <- c("apple", "banana", "pear")
# Detect elements starting with 'b'
print(str_detect(fruits, "^b")) # FALSE TRUE FALSE
# Subset elements containing 'a'
print(str_subset(fruits, "a")) # "apple" "banana" "pear"
# Extract characters matching a pattern
text <- "Error count: 42"
print(str_extract(text, r"(\d+)")) # "42"
5. Grouping and Backreferences
By placing a portion of a regex inside parentheses ( ), you create a capture group. This allows you to apply quantifiers to the entire group or refer to the group later in the expression.
Backreferences: Referencing Previous Groups
You can reference the exact string captured by a group earlier in the same pattern using backreferences (\1 for group 1, \2 for group 2, etc.):
# Match characters repeated twice in a row (e.g. "ee", "pp")
# Using raw string literal:
pattern_raw <- r"((.)\1)"
# Using standard string literal (requires double-escaping backslashes):
pattern_std <- "(.)\\1"
word <- "apple"
print(str_detect(word, pattern_raw)) # TRUE (matches "pp")
Group Matches with str_match()
str_match() extracts each capture group as a separate column:
log_entry <- "USER_125: Login Successful"
# Group 1 is the user ID, Group 2 is the status
pattern <- r"(USER_(\d+): (.*))"
matches <- str_match(log_entry, pattern)
print(matches)
# Column 1: Full match ("USER_125: Login Successful")
# Column 2: Group 1 ("125")
# Column 3: Group 2 ("Login Successful")
# Get user ID
user_id <- matches[1, 2]
Hands-on Exercises
Exercise 1: Cleaning Transaction IDs
You have a vector of transaction logs: c("TX_1001", "ERROR_992", "TX_1002", "MISSING_88").
Write R code to:
- Filter the vector using
grepl()to return a logical vector identifying only strings that start with"TX_". - Subset the original vector to print only these valid transaction IDs.
# Write your code below and click Run Code
Click to view Answer
logs <- c("TX_1001", "ERROR_992", "TX_1002", "MISSING_88")
# Find logs starting with TX_
valid_mask <- grepl("^TX_", logs)
print(valid_mask) # TRUE FALSE TRUE FALSE
# Subset vector
valid_logs <- logs[valid_mask]
print(valid_logs) # "TX_1001" "TX_1002"
Exercise 2: Sanitizing Price Tags
You scraped price inputs that contain currency symbols and comma separators: c("$1,200", "$25", "$450.50").
Write R code to:
- Remove both the dollar sign
$OR comma,from the strings usinggsub(). (Hint: You can usegsub("\\$|,", "", x)). Note that$is a special character meaning "end of string", so you must escape it with double backslash\\$! - Convert the resulting clean strings to numeric data types using
as.numeric(). - Print the final numeric prices.
# Write your code below and click Run Code
Click to view Answer
scraped_prices <- c("$1,200", "$25", "$450.50")
# Clean dollar signs and commas
clean_strings <- gsub("\\$|,", "", scraped_prices)
print(clean_strings) # "1200" "25" "450.50"
# Convert to numeric
numeric_prices <- as.numeric(clean_strings)
print(numeric_prices) # 1200.00 25.00 450.50