Strings and Factors (Categorical Data)

Why Learn Strings and Factors in Data Analytics?

Imagine you are analyzing feedback survey responses from a subscription app. You receive the following customer data:

Feedback comment: "Love the app, but it crashes on startup!"
Satisfaction ratings: "High", "Low", "Medium", "High"

To make sense of this dataset:

You want to measure the length of each comment using a function. If a comment is extremely long, it might contain a detailed bug report.
You want to merge the customer's name with their feedback into a single sentence.
You want R to understand that satisfaction ratings are categorical and have a logical order: "Low" < "Medium" < "High".

In R, text is stored as Character Strings. However, for statistical categorization, R uses a specialized type called Factors. Let's learn how to manipulate strings and create factors to handle categorical data.

1. Character Strings in R

A character string holds text values and can be wrapped in double quotes (") or single quotes ('). Double quotes are preferred by R developers.

2. String Concatenation: paste() and paste0()

CRITICAL DIFFERENCE FROM PYTHON

In Python, you can join strings using the + operator ("a" + "b"). In R, using + on text will throw an error! Instead, we use R's built-in paste() or paste0() functions:

paste(..., sep = " "): Concatenates strings with a custom separator (default is a single space).
paste0(...): Concatenates strings with no separator (equivalent to paste(..., sep = "")).

first_name <- "Alice"
last_name <- "Smith"

# Using paste (adds a space by default)
full_name <- paste(first_name, last_name)
print(full_name)  # "Alice Smith"

# Using paste0 (no separator)
user_id <- paste0("ID_", 105)
print(user_id)    # "ID_105"

3. String Length: nchar() vs length()

R STRING LENGTH PITFALL

To find the number of characters in a string, do not use length().

length("hello") returns 1 because it measures the number of elements in the vector (a single string is a vector of length 1).
nchar("hello") returns 5 because it measures the number of characters in the string.

feedback <- "Great app!"

print(length(feedback)) # 1 (vector length)
print(nchar(feedback))  # 10 (character count)

4. Basic String Manipulation

R provides useful functions for text cleaning:

tolower(text): Converts string to lowercase.
toupper(text): Converts string to uppercase.
substring(text, first, last): Extracts a portion of the text from character position first to last.

code <- "PROD_9982"

# Extract just the numeric ID (characters 6 to 9)
num_id <- substring(code, 6, 9)
print(num_id)  # "9982"

4B. Tidyverse String Manipulation: stringr

While base R string functions are useful, the tidyverse includes the stringr package, which provides a highly consistent set of tools starting with the str_ prefix:

1. Concatenation: str_c() vs. paste()

str_c(...) works similarly to paste0(), but handles missing values differently:

paste("hello", NA) converts NA to characters and returns "hello NA".
str_c("hello", NA) propagates NA and returns NA (the mathematical standard).

library(stringr)
print(paste("Hello", NA))   # "Hello NA"
print(str_c("Hello", NA))   # NA

2. Flattening Vectors: str_flatten()

To combine a vector of multiple strings into a single string:

words <- c("R", "is", "awesome")
flat_sentence <- str_flatten(words, collapse = " ")
print(flat_sentence) # "R is awesome"

3. Smart Subsetting: str_sub()

Unlike base substring(), str_sub(string, start, end) supports negative indexes to count characters backwards from the end of a string:

code <- "PROD_9982"
# Get the last 4 characters using negative indexes
print(str_sub(code, -4, -1)) # "9982"

4. Splitting Strings: str_split()

str_split(string, pattern) splits a string into parts. It returns a list of character vectors. You can use unlist() to flatten this list back into a simple vector:

phrase <- "apple,banana,orange"
parts_list <- str_split(phrase, ",")
print(parts_list) # Returns a list

# Flatten into a vector
parts_vector <- unlist(parts_list)
print(parts_vector) # "apple" "banana" "orange"

5. Pattern Occurrence & Location: str_count() and str_locate()

str_count(string, pattern): Counts how many times a pattern occurs.
str_locate(string, pattern): Returns the start and end position matrix of the first match.

word <- "statistical"
print(str_count(word, "t")) # 3

# Find index range of "tistic"
print(str_locate(word, "tistic")) # start 2, end 7

5. Factors (Categorical Data)

In data science, some columns represent categories rather than arbitrary text. Factors are used to store categorical data and can be ordered (ordinal data) or unordered (nominal data).

# Creating an unordered factor
genders <- factor(c("Male", "Female", "Female", "Male"))
print(genders) 
# Displays the levels: Levels: Female Male

# Creating an ordered factor (ordinal data)
satisfaction <- factor(
  c("High", "Low", "Medium", "High"),
  levels = c("Low", "Medium", "High"),
  ordered = TRUE
)
print(satisfaction)
# Displays levels with order: Levels: Low < Medium < High

R stores factors internally as integers and associates them with labels. This is highly efficient for running statistical models like regressions!

6. Manipulating Categorical Variables: forcats

When analyzing factors, we often need to change the order of their levels or rename/re-group the categories. The tidyverse includes the forcats package, which provides helpful utilities for factors (all starting with fct_):

1. Modifying Factor Order: fct_reorder()

By default, R sorts factor levels alphabetically. If you want to sort levels of a factor based on another numeric variable (e.g., ordering job titles by average salary), use fct_reorder():

library(forcats)
library(dplyr)
# Reorder job titles by average salary
jobs <- factor(c("Developer", "Manager", "Developer", "Manager"))
salaries <- c(80000, 120000, 85000, 115000)

# Sort jobs by salaries
ordered_jobs <- fct_reorder(jobs, salaries, .fun = mean)
print(levels(ordered_jobs)) # "Developer" "Manager"

2. Modifying Order by Frequency: fct_infreq() and fct_rev()

fct_infreq(f): Sorts factor levels by the frequency of each category (most common first).
fct_rev(f): Reverses the order of factor levels.

colors <- factor(c("red", "blue", "red", "red", "blue", "green"))
# Sorts levels by frequency: red, blue, green
freq_sorted <- fct_infreq(colors)
print(levels(freq_sorted)) # "red" "blue" "green"

3. Recoding Levels Manually: fct_recode()

To rename levels of a factor, use fct_recode(factor, new_name = "old_name"):

education <- factor(c("HS", "UG", "HS", "Grad"))
# Rename HS to High School, UG to Undergrad, Grad to Postgrad
cleaned_edu <- fct_recode(education,
  "High School" = "HS",
  "Undergrad"   = "UG",
  "Postgrad"    = "Grad"
)
print(levels(cleaned_edu)) # "High School" "Postgrad" "Undergrad"

4. Collapsing Levels: fct_collapse()

If you want to manually combine multiple categories into a single, broader category, use fct_collapse(factor, new_level = c("old1", "old2")):

party <- factor(c("Democrat", "Republican", "Green", "Libertarian"))
# Collapse into "Major" or "Minor"
collapsed_party <- fct_collapse(party,
  "Major" = c("Democrat", "Republican"),
  "Minor" = c("Green", "Libertarian")
)
print(levels(collapsed_party)) # "Major" "Minor"

5. Lumping Small Groups: fct_lump_min() and fct_lump_n()

If you have a categorical variable with many rare levels, you can automatically group rare levels into a single "Other" category:

fct_lump_min(f, min): Lumps all levels that appear fewer than min times.
fct_lump_n(f, n): Retains only the n most common levels and lumps the rest.

browser <- factor(c("Chrome", "Chrome", "Chrome", "Safari", "Safari", "Firefox", "Opera"))

# Example 1: Retain levels appearing at least 2 times
lumped_min <- fct_lump_min(browser, min = 2)
print(levels(lumped_min)) # "Chrome" "Safari" "Other"

# Example 2: Retain the single most popular level, lump the rest
lumped_n <- fct_lump_n(browser, n = 1)
print(levels(lumped_n)) # "Chrome" "Other"

6. Relabeling Levels Programmatically: fct_relabel()

If you need to rename levels programmatically (e.g. converting level strings to lowercase, replacing words, or appending text) using a custom function, use fct_relabel(factor, function):

OS <- factor(c("windows 11", "macOS high sierra", "windows 10"))

# Example 1: Convert all levels to uppercase using the built-in toupper function
capitalized_OS <- fct_relabel(OS, toupper)
print(levels(capitalized_OS)) # "MACOS HIGH SIERRA" "WINDOWS 10" "WINDOWS 11"

# Example 2: Use a custom lambda function to shorten "windows" to "Win"
cleaned_OS <- fct_relabel(OS, \(x) gsub("windows", "Win", x))
print(levels(cleaned_OS)) # "macOS high sierra" "Win 10" "Win 11"

Hands-on Exercises

Exercise 1: Dynamic Message Formatter

You need to print a message warning a user about an unauthorized activity. Write R code to:

Assign username <- "alice_dev".
Convert username to uppercase.
Concatenate the uppercase username with " is not authorized to access database " and the number 5 using paste0().
Print the final message and verify that the output character length is 45 characters using nchar().

# Write your code below and click Run Code

Click to view Answer

username <- "alice_dev"
upper_user <- toupper(username)
message <- paste0(upper_user, " is not authorized to access database ", 5)

print(message)
# "ALICE_DEV is not authorized to access database 5"

print(nchar(message)) # Should display 45

Exercise 2: Ordinal Classification

You have a vector of patient risk profiles: c("Medium", "High", "Low", "Medium", "High"). Write R code to:

Convert this vector into an ordered factor where the levels are ordered from lowest risk to highest risk: "Low" < "Medium" < "High".
Print the factor variable and check the output to ensure the levels indicator shows Low < Medium < High.

# Write your code below and click Run Code

Click to view Answer

risk_vector <- c("Medium", "High", "Low", "Medium", "High")

ordered_risk <- factor(
  risk_vector,
  levels = c("Low", "Medium", "High"),
  ordered = TRUE
)

print(ordered_risk)
# The levels printed should be: Levels: Low < Medium < High

Exercise 3: Factor Level Re-grouping and Cleaning

You are given a factor representing server operating systems in a network: servers <- factor(c("Ubuntu Linux", "Ubuntu Linux", "CentOS Linux", "RedHat Linux", "Windows Server 2022", "Windows Server 2019", "FreeBSD"))

Write R code to:

Load the forcats package.
Programmatically convert all level names to lowercase using fct_relabel() and tolower.
Manually collapse the levels: group "ubuntu linux", "centos linux", and "redhat linux" into a single "linux" level; group "windows server 2022" and "windows server 2019" into a "windows" level; and leave "freebsd" as is. Use fct_collapse().
Lump any levels appearing fewer than 2 times into an "other" level using fct_lump_min().
Print the resulting levels to verify that they are: "linux", "windows", and "other".

# Write your code below and click Run Code

Click to view Answer

library(forcats)

servers <- factor(c("Ubuntu Linux", "Ubuntu Linux", "CentOS Linux", "RedHat Linux", "Windows Server 2022", "Windows Server 2019", "FreeBSD"))

# 1. Lowercase levels programmatically
servers_lower <- fct_relabel(servers, tolower)

# 2. Collapse into broader categories
servers_collapsed <- fct_collapse(servers_lower,
  "linux"   = c("ubuntu linux", "centos linux", "redhat linux"),
  "windows" = c("windows server 2022", "windows server 2019")
)

# 3. Lump categories with frequency < 2
servers_lumped <- fct_lump_min(servers_collapsed, min = 2)

print(levels(servers_lumped))
# Expected levels: "linux" "windows" "other"