Regular Expressions (Regex)

Why Learn Regular Expressions in Data Analytics?

Imagine you are cleaning a survey dataset containing user-submitted telephone numbers. The entries are messy and inconsistent:

"123-456-7890" (hyphens)
"1234567890" (raw digits)
"(123) 456-7890" (parentheses and space)
"Call 123-456-7890" (text mixed with digits)

To perform text mining or clean this column for SMS alerts, you need to strip away all non-numeric characters, leaving only the clean 10-digit number.

Using custom string search functions is impossible because the formatting variations are infinite. You need Regular Expressions (Regex). Regex allows you to specify a text search pattern (e.g., "find all non-digits") and replace or match it instantly.

1. R's Built-in Regex Functions

R includes several built-in regex functions:

gsub(pattern, replacement, x): Replaces all occurrences of a pattern in the text x (equivalent to Python's re.sub()).
sub(pattern, replacement, x): Replaces only the first occurrence of a pattern.
grepl(pattern, x): Returns a logical vector (TRUE or FALSE) indicating if the pattern is found in each element of x.
grep(pattern, x): Returns the numeric indices of elements in x that match the pattern.

raw_text <- "catorangecatdogapple"

# Replaces all occurrences of "cat" with "CAT"
result_all <- gsub("cat", "CAT", raw_text)
print(result_all) # "CATorangeCATdogapple"

# Replaces ONLY the first occurrence of "cat"
result_first <- sub("cat", "CAT", raw_text)
print(result_first) # "CATorangecatdogapple"

2. Escaping and Raw Strings

In Python or JavaScript, you write \d in a string to match a digit in regular expressions. In R, because the single backslash \ is a string escape character, you have two choices for writing regex patterns:

Option A: The Double Backslash Rule

You must double-escape every backslash: \\d, \\s, etc.

\\d: Matches any digit (0-9).
\\D: Matches any non-digit.
\\s: Matches any whitespace.
\\w: Matches any alphanumeric character.

phone_messy <- "Call (123) 456-7890"
# Replace all non-digits (\\D) with ""
phone_clean <- gsub("\\D", "", phone_messy)
print(phone_clean) # "1234567890"

Option B: Raw String Literals (R 4.0+)

To avoid the double backslash headache, R supports Raw String Literals using the r"(...)" or r"[...]" syntax. Inside a raw string literal, backslashes are treated as literal characters and do not need to be escaped:

# No double backslash needed!
phone_clean_raw <- gsub(r"(\D)", "", phone_messy)
print(phone_clean_raw) # "1234567890"

# Match a digit followed by a letter using raw string syntax
expression <- r"(\d\w)"

3. Common Regex Operators

Operator	Meaning	Example	Result
`\|`	OR (matches left or right pattern)	`gsub("cat\|dog", "-", "catdog")`	`"--"`
`+`	Matches one or more times	`gsub("a+", "-", "caaat")`	`"c-t"`
*``**	Matches zero or more times	`gsub("ab*", "-", "acabb")`	`"-ca-"`
`^`	Matches the start of a string	`grepl("^cat", "catdog")`	`TRUE`
`$`	Matches the end of a string	`grepl("dog$", "catdog")`	`TRUE`
`[ ]`	Character set (matches any character inside)	`gsub("[aeiou]", "-", "apple")`	`"-pp--"`

4. Tidyverse Regex: stringr

The tidyverse stringr package replaces base R's regex tools with highly consistent counterparts:

str_detect(string, pattern): Equivalent to grepl(). Returns a logical vector.
str_subset(string, pattern): Returns only elements that contain a match.
str_extract(string, pattern) / str_extract_all: Extracts the actual matched text rather than returning logical flags or indices.
str_match(string, pattern): Extracts matched text AND individual capture groups as columns of a matrix.
str_replace(string, pattern, replacement) / str_replace_all: Equivalent to sub() and gsub().

library(stringr)
fruits <- c("apple", "banana", "pear")

# Detect elements starting with 'b'
print(str_detect(fruits, "^b")) # FALSE  TRUE FALSE

# Subset elements containing 'a'
print(str_subset(fruits, "a")) # "apple" "banana" "pear"

# Extract characters matching a pattern
text <- "Error count: 42"
print(str_extract(text, r"(\d+)")) # "42"

5. Grouping and Backreferences

By placing a portion of a regex inside parentheses ( ), you create a capture group. This allows you to apply quantifiers to the entire group or refer to the group later in the expression.

Backreferences: Referencing Previous Groups

You can reference the exact string captured by a group earlier in the same pattern using backreferences (\1 for group 1, \2 for group 2, etc.):

# Match characters repeated twice in a row (e.g. "ee", "pp")
# Using raw string literal:
pattern_raw <- r"((.)\1)"

# Using standard string literal (requires double-escaping backslashes):
pattern_std <- "(.)\\1"

word <- "apple"
print(str_detect(word, pattern_raw)) # TRUE (matches "pp")

Group Matches with str_match()

str_match() extracts each capture group as a separate column:

log_entry <- "USER_125: Login Successful"
# Group 1 is the user ID, Group 2 is the status
pattern <- r"(USER_(\d+): (.*))"

matches <- str_match(log_entry, pattern)
print(matches)
# Column 1: Full match ("USER_125: Login Successful")
# Column 2: Group 1 ("125")
# Column 3: Group 2 ("Login Successful")

# Get user ID
user_id <- matches[1, 2]

Hands-on Exercises

Exercise 1: Cleaning Transaction IDs

You have a vector of transaction logs: c("TX_1001", "ERROR_992", "TX_1002", "MISSING_88"). Write R code to:

Filter the vector using grepl() to return a logical vector identifying only strings that start with "TX_".
Subset the original vector to print only these valid transaction IDs.

# Write your code below and click Run Code

Click to view Answer

logs <- c("TX_1001", "ERROR_992", "TX_1002", "MISSING_88")

# Find logs starting with TX_
valid_mask <- grepl("^TX_", logs)
print(valid_mask) # TRUE FALSE  TRUE FALSE

# Subset vector
valid_logs <- logs[valid_mask]
print(valid_logs) # "TX_1001" "TX_1002"

Exercise 2: Sanitizing Price Tags

You scraped price inputs that contain currency symbols and comma separators: c("$1,200", "$25", "$450.50"). Write R code to:

Remove both the dollar sign $ OR comma , from the strings using gsub(). (Hint: You can use gsub("\\$|,", "", x)). Note that $ is a special character meaning "end of string", so you must escape it with double backslash \\$!
Convert the resulting clean strings to numeric data types using as.numeric().
Print the final numeric prices.

# Write your code below and click Run Code

Click to view Answer

scraped_prices <- c("$1,200", "$25", "$450.50")

# Clean dollar signs and commas
clean_strings <- gsub("\\$|,", "", scraped_prices)
print(clean_strings) # "1200" "25" "450.50"

# Convert to numeric
numeric_prices <- as.numeric(clean_strings)
print(numeric_prices) # 1200.00   25.00  450.50