Custom Functions

Why Learn Custom Functions in Data Analytics?

Imagine you are cleaning monthly user growth rates across several departments (e.g., Marketing, Sales, Engineering). The data cleaning pipeline requires:

Converting negative rates to absolute values (representing absolute growth magnitude).
Multiplying by 100 to convert to a percentage.
Rounding the result to 2 decimal places.

If you copy and paste these three math steps for every department, your code becomes bloated and hard to maintain. If you decide to change the rounding to 3 decimal places later, you must find and update every single copy!

Instead, you want to package this calculation into a reusable box:

clean_growth <- function(rate) {
  rounded <- round(abs(rate) * 100, digits = 2)
  return(rounded)
}

In R, this is a User-Defined Function. It allows you to modularize your analytics, reduce duplication, and debug errors easily.

1. Defining a Custom Function

In R, functions are assigned to variables using the standard <- operator, followed by the function keyword, argument list, and braces { }.

Syntax

function_name <- function(param1, param2) {
  # statements
  return(value)
}

2. Implicit vs. Explicit Returns

In R, you can return values in two ways:

Explicit: Using the return() function.
Implicit: R automatically returns the value of the last statement evaluated inside the function body.

# Explicit Return
multiply_explicit <- function(x, y) {
  return(x * y)
}

# Implicit Return (Recommended R idiom for simple functions)
multiply_implicit <- function(x, y) {
  x * y # This is the last line, so it gets returned automatically!
}

print(multiply_implicit(5, 4)) # 20

3. Parameter Defaults

You can set default values for arguments. If a default is set, you don't have to provide that argument when calling the function:

calculate_target <- function(base, growth_rate = 0.05) {
  base * (1 + growth_rate)
}

# Uses default growth_rate of 0.05
print(calculate_target(100))      # 105

# Overrides default with 0.10
print(calculate_target(100, 0.10)) # 110

4. Modifying Outer Scope: The Super-Assignment Operator <<-

By default, variables created inside a function are local to that function and disappear when the function ends.

If you want to modify a variable defined in the outer environment (equivalent to Python's global keyword), R uses the super-assignment operator <<-:

total_runs <- 0

log_run <- function() {
  # Modify variable in the outer parent environment
  total_runs <<- total_runs + 1 
}

log_run()
log_run()
print(total_runs) # 2

5. Variable Arguments: The Ellipsis ...

In R, you can accept an arbitrary number of arguments (equivalent to Python's *args and **kwargs) using three dots ... (the ellipsis). This is commonly used to pass parameters down to nested built-in functions:

# A custom mean calculator that passes any extra settings (like na.rm) to mean()
custom_mean <- function(data, ...) {
  print("Calculating average...")
  mean(data, ...)
}

dataset <- c(10, 20, NA, 30)

# Pass na.rm = TRUE through the ellipsis to the inner mean() function
print(custom_mean(dataset, na.rm = TRUE)) # 20

6. Tidy Evaluation in Functions: Embracing {{ }}

If you write custom wrapper functions that interact with the tidyverse (especially packages like dplyr and ggplot2), you will run into a challenge known as indirection.

Tidyverse functions use data masking, which allows you to refer to columns in a table without using quotes (e.g. typing Age instead of "Age"). However, if you try to pass an unquoted column name directly into your own function, R will look for a global variable with that name and fail:

library(dplyr)
# This function will FAIL!
grouped_mean_fail <- function(df, group_var, mean_var) {
  df |>
    group_by(group_var) |>
    summarize(mean(mean_var))
}

# Throws error: object 'model' not found
# grouped_mean_fail(mpg, model, hwy)

The Fix: Embracing with curly-curly {{ }}

To pass unquoted column names as arguments to your function, you must embrace them using double curly braces {{ }}. This instructs R to evaluate the argument inside the context of the data frame:

# This function WORKS!
grouped_mean_success <- function(df, group_var, mean_var) {
  df |>
    group_by({{ group_var }}) |>
    summarize(mean({{ mean_var }}))
}

# Successfully groups by model and calculates average highway mileage!
result <- grouped_mean_success(mpg, model, hwy)
print(head(result))

Hands-on Exercises

Exercise 1: Clean and Standardize Metric

Write a function called normalize_metric that takes a numeric vector, finds the difference of each element from the mean of the vector, and divides it by the standard deviation (sd()).

Define normalize_metric <- function(vec) { ... }
The formula to return is (vec - mean(vec)) / sd(vec). Use implicit return.
Test your function by calling it with c(10, 20, 30) and print the result.

# Write your code below and click Run Code

Click to view Answer

normalize_metric <- function(vec) {
  (vec - mean(vec)) / sd(vec)
}

test_vec <- c(10, 20, 30)
print(normalize_metric(test_vec))
# Output: -1  0  1

Exercise 2: Discount Calculator with Defaults

Write a function calculate_price that takes a raw price and a discount percentage (defaulting to 0.10). Write R code to:

Define the function. The formula is price * (1 - discount).
Call it with a price of 100 and no discount (verify it returns 90).
Call it with a price of 150 and a discount of 0.20 (verify it returns 120).

# Write your code below and click Run Code

Click to view Answer

calculate_price <- function(price, discount = 0.10) {
  price * (1 - discount)
}

# Test 1
print(calculate_price(100)) # 90

# Test 2
print(calculate_price(150, 0.20)) # 120