Data Frames (Tabular Structures)

Why Learn Data Frames in Data Analytics?

Imagine you are analyzing customer registration data. You have the following details for three customers:

Names: Alice, Bob, Charlie
Ages: 25, 32, 19
Subscription Status: TRUE, FALSE, TRUE

If you keep these as three separate vectors, filtering for "active subscribers over 20 years old" is challenging. If you sort or reorder one vector, it falls out of sync with the others.

You need a spreadsheet-like structure where columns have names, but rows represent connected records.

In R, this structure is a Data Frame (data.frame). It is R's native 2D grid structure. In this chapter, we will learn how to create data frames, inspect their structures, extract rows or columns, and filter for target subsets.

1. Creating a Data Frame

A data frame is constructed using the data.frame() function, combining vectors of the exact same length as columns:

# Create vectors
names <- c("Alice", "Bob", "Charlie")
ages <- c(25, 32, 19)
active <- c(TRUE, FALSE, TRUE)

# Combine into a data frame
customers <- data.frame(Name = names, Age = ages, ActiveStatus = active)
print(customers)

2. Inspecting Data Frames

When working with larger datasets, you cannot print the whole table to the console. R provides helpful functions to inspect tables:

dim(df): Returns dimensions (rows, columns).
nrow(df) / ncol(df): Returns the number of rows or columns.
head(df, n): Prints the first n rows.
str(df): Displays the structure of the data frame (column types, dimensions).
summary(df): Provides statistical summary statistics for each column.

# Inspect customer data frame structure
str(customers)

3. Selecting Columns and Rows (Indexing)

Like vectors, data frames use square brackets [row, column] for indexing.

df[row, col]: Gets value at specific row and column.
df[row, ]: Returns the entire row as a sub-table.
df[, col]: Returns the entire column as a vector.
df$col: Fast syntax to get a column as a vector.

# Get Alice's Age (Row 1, Column 2)
print(customers[1, 2]) # 25

# Get Bob's entire record (Row 2)
print(customers[2, ]) 

# Get all Names (Column 1)
print(customers[, 1]) # "Alice" "Bob" "Charlie"

# Alternative: Get all Ages using $
print(customers$Age)  # 25 32 19

4. Subsetting / Filtering Data

In data science, we constantly filter tables based on logical conditions. We can place a logical comparison inside the row position of df[row_condition, ]:

# Find all rows where Age is greater than 20
adult_rows <- customers[customers$Age > 20, ]
print(adult_rows)

# Find active customers
active_customers <- customers[customers$ActiveStatus == TRUE, ]
print(active_customers)

5. Modern Tables: Tibble vs. data.frame

While base R provides data.frame, the tidyverse package tibble provides a modern alternative.

library(dplyr)
# Create a tibble
cust_tibble <- tibble(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 32, 19)
)
print(class(cust_tibble))
# Output displays three classes: "tbl_df" "tbl" "data.frame"

Why use Tibbles?

Painless Printing: A tibble only prints the first 10 rows and lists column data types (e.g., <chr>, <dbl>), preventing console flooding.
Strict Subsetting: Double brackets [[ ]] are required to extract single columns as vectors, avoiding silent type conversions.
glimpse(): A tidyverse function to view data columns transpose-style (useful for very wide tables):

glimpse(cust_tibble)

6. Combining Tables: rbind() and cbind()

To merge different vectors or tables together, R provides two functions:

rbind(...): Binds/joins data frames or matrices by row (adds observations). The tables must have the exact same column names.
cbind(...): Binds/joins data frames, matrices, or vectors by column (adds new variables). The items must have the exact same number of rows.

# rbind example (adding a row/observation)
t1 <- tibble(Name = "Alice", Age = 25)
t2 <- tibble(Name = "Bob", Age = 32)
combined_rows <- rbind(t1, t2)
print(combined_rows)

# cbind example (adding a column/variable)
new_info <- c("Boston", "Chicago")
combined_cols <- cbind(combined_rows, City = new_info)
print(combined_cols)

7. Importing and Exporting Data (File I/O)

In real-world data science, you rarely type datasets by hand. The tidyverse package readr is the standard way to load and save data.

Loading Delimited Files: read_csv()

read_csv(file): Reads a Comma-Separated Values file into a tibble.
read_tsv(file): Reads a Tab-Separated Values file.

library(readr)
# Load a CSV file
# df <- read_csv("sales_data.csv")

Handling Common CSV Issues

Sometimes CSV files contain metadata at the top, comments, or missing value indicators. read_csv handles this via keyword arguments:

Skipping Rows: Use skip = n to skip metadata headers, or use comment = "#" to skip lines beginning with a comment symbol.
Missing Column Headers: If the CSV file has no header row, set col_names = FALSE (R will assign names like X1, X2) or pass a character vector of names: col_names = c("Product", "Price", "Qty").
Specifying Missing (NA) values: If missing data is represented by special symbols like . or N/A, specify them using the na argument: na = c(".", "N/A").
Enforcing Column Types: R tries to guess column types based on the first 1,000 rows. You can override R's guess using the col_types parameter:

# Load file and explicitly specify column types
# df <- read_csv("sales_data.csv",
#   col_types = cols(
#     Transaction_ID = col_integer(),
#     Price = col_double(),
#     Date = col_date(format = "%Y-%m-%d")
#   )
# )

Exporting / Saving Data

To write a data frame back to disk, use write_csv():

# write_csv(combined_cols, "output_data.csv")

Native R Binary Files: RDS

If you want to save a complex R object (like a list, factor, or model) while preserving its exact classes (e.g. maintaining factor levels), write it to an RDS binary file using write_rds() and read it with read_rds():

# Save to RDS
# write_rds(combined_cols, "my_dataset.rds")

# Load from RDS
# my_data <- read_rds("my_dataset.rds")

Hands-on Exercises

Exercise 1: Building a Sales Ledger

Create a data frame representing sales transactions with three columns: Product (values "Laptop", "Mouse", "Keyboard"), Price (values 1200, 25, 75), and Quantity (values 2, 10, 5). Write R code to:

Create and print the data frame.
Calculate the average Price using the mean() function on the Price column.
Retrieve and print the dimensions of the data frame.

# Write your code below and click Run Code

Click to view Answer

ledger <- data.frame(
  Product = c("Laptop", "Mouse", "Keyboard"),
  Price = c(1200, 25, 75),
  Quantity = c(2, 10, 5)
)

print(ledger)

avg_price <- mean(ledger$Price)
print(paste("Average Price: $", avg_price))

print(dim(ledger)) # 3 rows, 3 columns

Exercise 2: Filtering High-Value Sales

Using the ledger data frame from Exercise 1: Write R code to:

Re-create the ledger data frame.
Filter and print only the rows where Price * Quantity (total value) is greater than $200.

# Write your code below and click Run Code

Click to view Answer

ledger <- data.frame(
  Product = c("Laptop", "Mouse", "Keyboard"),
  Price = c(1200, 25, 75),
  Quantity = c(2, 10, 5)
)

# Calculate total value and filter
high_value_sales <- ledger[(ledger$Price * ledger$Quantity) > 200, ]
print(high_value_sales)
# Only Laptop (Value: 2400) and Keyboard (Value: 375) rows should print