MOBI BOOT CAMP CORP. logoLearning Buddy
  • SIGN IN
  • Introduction
  • Setup
  • 1A: Fundamental Building Blocks
  • 1B: Compound Statements
  • 2: Ordered Collection
  • 3: Key-Value Map and Structures
  • 4: More Data types
  • 5: Iteration Constructs
  • 6: Other constructs
  • 7. Regex
  • 8. Date and Time
  • Revision
  • Practice Exercise

Regular Expressions (Regex)

Why Learn Regular Expressions in Data Analytics?

Imagine you are cleaning a survey dataset containing user-submitted telephone numbers. The entries are messy and inconsistent:

  • "123-456-7890" (hyphens)
  • "1234567890" (raw digits)
  • "(123) 456-7890" (parentheses and space)
  • "Call 123-456-7890" (text mixed with digits)

To perform text mining or clean this column for SMS alerts, you need to strip away all non-numeric characters, leaving only the clean 10-digit number.

Using custom string search functions is impossible because the formatting variations are infinite. You need Regular Expressions (Regex). Regex allows you to specify a text search pattern (e.g., "find all non-digits") and replace or match it instantly.


1. R's Built-in Regex Functions

R includes several built-in regex functions:

  • gsub(pattern, replacement, x): Replaces all occurrences of a pattern in the text x (equivalent to Python's re.sub()).
  • sub(pattern, replacement, x): Replaces only the first occurrence of a pattern.
  • grepl(pattern, x): Returns a logical vector (TRUE or FALSE) indicating if the pattern is found in each element of x.
  • grep(pattern, x): Returns the numeric indices of elements in x that match the pattern.
raw_text <- "catorangecatdogapple"

# Replaces all occurrences of "cat" with "CAT"
result_all <- gsub("cat", "CAT", raw_text)
print(result_all) # "CATorangeCATdogapple"

# Replaces ONLY the first occurrence of "cat"
result_first <- sub("cat", "CAT", raw_text)
print(result_first) # "CATorangecatdogapple"

2. Escaping and Raw Strings

In Python or JavaScript, you write \d in a string to match a digit in regular expressions. In R, because the single backslash \ is a string escape character, you have two choices for writing regex patterns:

Option A: The Double Backslash Rule

You must double-escape every backslash: \\d, \\s, etc.

  • \\d: Matches any digit (0-9).
  • \\D: Matches any non-digit.
  • \\s: Matches any whitespace.
  • \\w: Matches any alphanumeric character.
phone_messy <- "Call (123) 456-7890"
# Replace all non-digits (\\D) with ""
phone_clean <- gsub("\\D", "", phone_messy)
print(phone_clean) # "1234567890"

Option B: Raw String Literals (R 4.0+)

To avoid the double backslash headache, R supports Raw String Literals using the r"(...)" or r"[...]" syntax. Inside a raw string literal, backslashes are treated as literal characters and do not need to be escaped:

# No double backslash needed!
phone_clean_raw <- gsub(r"(\D)", "", phone_messy)
print(phone_clean_raw) # "1234567890"

# Match a digit followed by a letter using raw string syntax
expression <- r"(\d\w)"

3. Common Regex Operators

Operator Meaning Example Result
| OR (matches left or right pattern) gsub("cat|dog", "-", "catdog") "--"
+ Matches one or more times gsub("a+", "-", "caaat") "c-t"
* Matches zero or more times gsub("ab*", "-", "acabb") "-ca-"
^ Matches the start of a string grepl("^cat", "catdog") TRUE
$ Matches the end of a string grepl("dog$", "catdog") TRUE
[ ] Character set (matches any character inside) gsub("[aeiou]", "-", "apple") "-pp--"

4. Tidyverse Regex: stringr

The tidyverse stringr package replaces base R's regex tools with highly consistent counterparts:

  • str_detect(string, pattern): Equivalent to grepl(). Returns a logical vector.
  • str_subset(string, pattern): Returns only elements that contain a match.
  • str_extract(string, pattern) / str_extract_all: Extracts the actual matched text rather than returning logical flags or indices.
  • str_match(string, pattern): Extracts matched text AND individual capture groups as columns of a matrix.
  • str_replace(string, pattern, replacement) / str_replace_all: Equivalent to sub() and gsub().
library(stringr)
fruits <- c("apple", "banana", "pear")

# Detect elements starting with 'b'
print(str_detect(fruits, "^b")) # FALSE  TRUE FALSE

# Subset elements containing 'a'
print(str_subset(fruits, "a")) # "apple" "banana" "pear"

# Extract characters matching a pattern
text <- "Error count: 42"
print(str_extract(text, r"(\d+)")) # "42"

5. Grouping and Backreferences

By placing a portion of a regex inside parentheses ( ), you create a capture group. This allows you to apply quantifiers to the entire group or refer to the group later in the expression.

Backreferences: Referencing Previous Groups

You can reference the exact string captured by a group earlier in the same pattern using backreferences (\1 for group 1, \2 for group 2, etc.):

# Match characters repeated twice in a row (e.g. "ee", "pp")
# Using raw string literal:
pattern_raw <- r"((.)\1)"

# Using standard string literal (requires double-escaping backslashes):
pattern_std <- "(.)\\1"

word <- "apple"
print(str_detect(word, pattern_raw)) # TRUE (matches "pp")

Group Matches with str_match()

str_match() extracts each capture group as a separate column:

log_entry <- "USER_125: Login Successful"
# Group 1 is the user ID, Group 2 is the status
pattern <- r"(USER_(\d+): (.*))"

matches <- str_match(log_entry, pattern)
print(matches)
# Column 1: Full match ("USER_125: Login Successful")
# Column 2: Group 1 ("125")
# Column 3: Group 2 ("Login Successful")

# Get user ID
user_id <- matches[1, 2]

Hands-on Exercises

Exercise 1: Cleaning Transaction IDs

You have a vector of transaction logs: c("TX_1001", "ERROR_992", "TX_1002", "MISSING_88"). Write R code to:

  1. Filter the vector using grepl() to return a logical vector identifying only strings that start with "TX_".
  2. Subset the original vector to print only these valid transaction IDs.
# Write your code below and click Run Code
Click to view Answer
logs <- c("TX_1001", "ERROR_992", "TX_1002", "MISSING_88")

# Find logs starting with TX_
valid_mask <- grepl("^TX_", logs)
print(valid_mask) # TRUE FALSE  TRUE FALSE

# Subset vector
valid_logs <- logs[valid_mask]
print(valid_logs) # "TX_1001" "TX_1002"

Exercise 2: Sanitizing Price Tags

You scraped price inputs that contain currency symbols and comma separators: c("$1,200", "$25", "$450.50"). Write R code to:

  1. Remove both the dollar sign $ OR comma , from the strings using gsub(). (Hint: You can use gsub("\\$|,", "", x)). Note that $ is a special character meaning "end of string", so you must escape it with double backslash \\$!
  2. Convert the resulting clean strings to numeric data types using as.numeric().
  3. Print the final numeric prices.
# Write your code below and click Run Code
Click to view Answer
scraped_prices <- c("$1,200", "$25", "$450.50")

# Clean dollar signs and commas
clean_strings <- gsub("\\$|,", "", scraped_prices)
print(clean_strings) # "1200" "25" "450.50"

# Convert to numeric
numeric_prices <- as.numeric(clean_strings)
print(numeric_prices) # 1200.00   25.00  450.50
Privacy Policy | Terms & Conditions