Custom Functions
Why Learn Custom Functions in Data Analytics?
Imagine you are cleaning monthly user growth rates across several departments (e.g., Marketing, Sales, Engineering). The data cleaning pipeline requires:
- Converting negative rates to absolute values (representing absolute growth magnitude).
- Multiplying by 100 to convert to a percentage.
- Rounding the result to 2 decimal places.
If you copy and paste these three math steps for every department, your code becomes bloated and hard to maintain. If you decide to change the rounding to 3 decimal places later, you must find and update every single copy!
Instead, you want to package this calculation into a reusable box:
clean_growth <- function(rate) {
rounded <- round(abs(rate) * 100, digits = 2)
return(rounded)
}
In R, this is a User-Defined Function. It allows you to modularize your analytics, reduce duplication, and debug errors easily.
1. Defining a Custom Function
In R, functions are assigned to variables using the standard <- operator, followed by the function keyword, argument list, and braces { }.
Syntax
function_name <- function(param1, param2) {
# statements
return(value)
}
2. Implicit vs. Explicit Returns
In R, you can return values in two ways:
- Explicit: Using the
return()function. - Implicit: R automatically returns the value of the last statement evaluated inside the function body.
# Explicit Return
multiply_explicit <- function(x, y) {
return(x * y)
}
# Implicit Return (Recommended R idiom for simple functions)
multiply_implicit <- function(x, y) {
x * y # This is the last line, so it gets returned automatically!
}
print(multiply_implicit(5, 4)) # 20
3. Parameter Defaults
You can set default values for arguments. If a default is set, you don't have to provide that argument when calling the function:
calculate_target <- function(base, growth_rate = 0.05) {
base * (1 + growth_rate)
}
# Uses default growth_rate of 0.05
print(calculate_target(100)) # 105
# Overrides default with 0.10
print(calculate_target(100, 0.10)) # 110
4. Modifying Outer Scope: The Super-Assignment Operator <<-
By default, variables created inside a function are local to that function and disappear when the function ends.
If you want to modify a variable defined in the outer environment (equivalent to Python's global keyword), R uses the super-assignment operator <<-:
total_runs <- 0
log_run <- function() {
# Modify variable in the outer parent environment
total_runs <<- total_runs + 1
}
log_run()
log_run()
print(total_runs) # 2
5. Variable Arguments: The Ellipsis ...
In R, you can accept an arbitrary number of arguments (equivalent to Python's *args and **kwargs) using three dots ... (the ellipsis). This is commonly used to pass parameters down to nested built-in functions:
# A custom mean calculator that passes any extra settings (like na.rm) to mean()
custom_mean <- function(data, ...) {
print("Calculating average...")
mean(data, ...)
}
dataset <- c(10, 20, NA, 30)
# Pass na.rm = TRUE through the ellipsis to the inner mean() function
print(custom_mean(dataset, na.rm = TRUE)) # 20
6. Tidy Evaluation in Functions: Embracing {{ }}
If you write custom wrapper functions that interact with the tidyverse (especially packages like dplyr and ggplot2), you will run into a challenge known as indirection.
Tidyverse functions use data masking, which allows you to refer to columns in a table without using quotes (e.g. typing Age instead of "Age"). However, if you try to pass an unquoted column name directly into your own function, R will look for a global variable with that name and fail:
library(dplyr)
# This function will FAIL!
grouped_mean_fail <- function(df, group_var, mean_var) {
df |>
group_by(group_var) |>
summarize(mean(mean_var))
}
# Throws error: object 'model' not found
# grouped_mean_fail(mpg, model, hwy)
The Fix: Embracing with curly-curly {{ }}
To pass unquoted column names as arguments to your function, you must embrace them using double curly braces {{ }}. This instructs R to evaluate the argument inside the context of the data frame:
# This function WORKS!
grouped_mean_success <- function(df, group_var, mean_var) {
df |>
group_by({{ group_var }}) |>
summarize(mean({{ mean_var }}))
}
# Successfully groups by model and calculates average highway mileage!
result <- grouped_mean_success(mpg, model, hwy)
print(head(result))
Hands-on Exercises
Exercise 1: Clean and Standardize Metric
Write a function called normalize_metric that takes a numeric vector, finds the difference of each element from the mean of the vector, and divides it by the standard deviation (sd()).
- Define
normalize_metric <- function(vec) { ... } - The formula to return is
(vec - mean(vec)) / sd(vec). Use implicit return. - Test your function by calling it with
c(10, 20, 30)and print the result.
# Write your code below and click Run Code
Click to view Answer
normalize_metric <- function(vec) {
(vec - mean(vec)) / sd(vec)
}
test_vec <- c(10, 20, 30)
print(normalize_metric(test_vec))
# Output: -1 0 1
Exercise 2: Discount Calculator with Defaults
Write a function calculate_price that takes a raw price and a discount percentage (defaulting to 0.10).
Write R code to:
- Define the function. The formula is
price * (1 - discount). - Call it with a price of
100and no discount (verify it returns90). - Call it with a price of
150and a discount of0.20(verify it returns120).
# Write your code below and click Run Code
Click to view Answer
calculate_price <- function(price, discount = 0.10) {
price * (1 - discount)
}
# Test 1
print(calculate_price(100)) # 90
# Test 2
print(calculate_price(150, 0.20)) # 120