Data Frames (Tabular Structures)
Why Learn Data Frames in Data Analytics?
Imagine you are analyzing customer registration data. You have the following details for three customers:
- Names: Alice, Bob, Charlie
- Ages: 25, 32, 19
- Subscription Status:
TRUE,FALSE,TRUE
If you keep these as three separate vectors, filtering for "active subscribers over 20 years old" is challenging. If you sort or reorder one vector, it falls out of sync with the others.
You need a spreadsheet-like structure where columns have names, but rows represent connected records.
In R, this structure is a Data Frame (data.frame). It is R's native 2D grid structure. In this chapter, we will learn how to create data frames, inspect their structures, extract rows or columns, and filter for target subsets.
1. Creating a Data Frame
A data frame is constructed using the data.frame() function, combining vectors of the exact same length as columns:
# Create vectors
names <- c("Alice", "Bob", "Charlie")
ages <- c(25, 32, 19)
active <- c(TRUE, FALSE, TRUE)
# Combine into a data frame
customers <- data.frame(Name = names, Age = ages, ActiveStatus = active)
print(customers)
2. Inspecting Data Frames
When working with larger datasets, you cannot print the whole table to the console. R provides helpful functions to inspect tables:
dim(df): Returns dimensions (rows, columns).nrow(df)/ncol(df): Returns the number of rows or columns.head(df, n): Prints the firstnrows.str(df): Displays the structure of the data frame (column types, dimensions).summary(df): Provides statistical summary statistics for each column.
# Inspect customer data frame structure
str(customers)
3. Selecting Columns and Rows (Indexing)
Like vectors, data frames use square brackets [row, column] for indexing.
df[row, col]: Gets value at specific row and column.df[row, ]: Returns the entire row as a sub-table.df[, col]: Returns the entire column as a vector.df$col: Fast syntax to get a column as a vector.
# Get Alice's Age (Row 1, Column 2)
print(customers[1, 2]) # 25
# Get Bob's entire record (Row 2)
print(customers[2, ])
# Get all Names (Column 1)
print(customers[, 1]) # "Alice" "Bob" "Charlie"
# Alternative: Get all Ages using $
print(customers$Age) # 25 32 19
4. Subsetting / Filtering Data
In data science, we constantly filter tables based on logical conditions. We can place a logical comparison inside the row position of df[row_condition, ]:
# Find all rows where Age is greater than 20
adult_rows <- customers[customers$Age > 20, ]
print(adult_rows)
# Find active customers
active_customers <- customers[customers$ActiveStatus == TRUE, ]
print(active_customers)
5. Modern Tables: Tibble vs. data.frame
While base R provides data.frame, the tidyverse package tibble provides a modern alternative.
library(dplyr)
# Create a tibble
cust_tibble <- tibble(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 32, 19)
)
print(class(cust_tibble))
# Output displays three classes: "tbl_df" "tbl" "data.frame"
Why use Tibbles?
- Painless Printing: A tibble only prints the first 10 rows and lists column data types (e.g.,
<chr>,<dbl>), preventing console flooding. - Strict Subsetting: Double brackets
[[ ]]are required to extract single columns as vectors, avoiding silent type conversions. glimpse(): A tidyverse function to view data columns transpose-style (useful for very wide tables):
glimpse(cust_tibble)
6. Combining Tables: rbind() and cbind()
To merge different vectors or tables together, R provides two functions:
rbind(...): Binds/joins data frames or matrices by row (adds observations). The tables must have the exact same column names.cbind(...): Binds/joins data frames, matrices, or vectors by column (adds new variables). The items must have the exact same number of rows.
# rbind example (adding a row/observation)
t1 <- tibble(Name = "Alice", Age = 25)
t2 <- tibble(Name = "Bob", Age = 32)
combined_rows <- rbind(t1, t2)
print(combined_rows)
# cbind example (adding a column/variable)
new_info <- c("Boston", "Chicago")
combined_cols <- cbind(combined_rows, City = new_info)
print(combined_cols)
7. Importing and Exporting Data (File I/O)
In real-world data science, you rarely type datasets by hand. The tidyverse package readr is the standard way to load and save data.
Loading Delimited Files: read_csv()
read_csv(file): Reads a Comma-Separated Values file into a tibble.read_tsv(file): Reads a Tab-Separated Values file.
library(readr)
# Load a CSV file
# df <- read_csv("sales_data.csv")
Handling Common CSV Issues
Sometimes CSV files contain metadata at the top, comments, or missing value indicators. read_csv handles this via keyword arguments:
- Skipping Rows: Use
skip = nto skip metadata headers, or usecomment = "#"to skip lines beginning with a comment symbol. - Missing Column Headers: If the CSV file has no header row, set
col_names = FALSE(R will assign names likeX1,X2) or pass a character vector of names:col_names = c("Product", "Price", "Qty"). - Specifying Missing (NA) values: If missing data is represented by special symbols like
.orN/A, specify them using thenaargument:na = c(".", "N/A"). - Enforcing Column Types: R tries to guess column types based on the first 1,000 rows. You can override R's guess using the
col_typesparameter:
# Load file and explicitly specify column types
# df <- read_csv("sales_data.csv",
# col_types = cols(
# Transaction_ID = col_integer(),
# Price = col_double(),
# Date = col_date(format = "%Y-%m-%d")
# )
# )
Exporting / Saving Data
To write a data frame back to disk, use write_csv():
# write_csv(combined_cols, "output_data.csv")
Native R Binary Files: RDS
If you want to save a complex R object (like a list, factor, or model) while preserving its exact classes (e.g. maintaining factor levels), write it to an RDS binary file using write_rds() and read it with read_rds():
# Save to RDS
# write_rds(combined_cols, "my_dataset.rds")
# Load from RDS
# my_data <- read_rds("my_dataset.rds")
Hands-on Exercises
Exercise 1: Building a Sales Ledger
Create a data frame representing sales transactions with three columns: Product (values "Laptop", "Mouse", "Keyboard"), Price (values 1200, 25, 75), and Quantity (values 2, 10, 5).
Write R code to:
- Create and print the data frame.
- Calculate the average Price using the
mean()function on thePricecolumn. - Retrieve and print the dimensions of the data frame.
# Write your code below and click Run Code
Click to view Answer
ledger <- data.frame(
Product = c("Laptop", "Mouse", "Keyboard"),
Price = c(1200, 25, 75),
Quantity = c(2, 10, 5)
)
print(ledger)
avg_price <- mean(ledger$Price)
print(paste("Average Price: $", avg_price))
print(dim(ledger)) # 3 rows, 3 columns
Exercise 2: Filtering High-Value Sales
Using the ledger data frame from Exercise 1:
Write R code to:
- Re-create the
ledgerdata frame. - Filter and print only the rows where
Price * Quantity(total value) is greater than$200.
# Write your code below and click Run Code
Click to view Answer
ledger <- data.frame(
Product = c("Laptop", "Mouse", "Keyboard"),
Price = c(1200, 25, 75),
Quantity = c(2, 10, 5)
)
# Calculate total value and filter
high_value_sales <- ledger[(ledger$Price * ledger$Quantity) > 200, ]
print(high_value_sales)
# Only Laptop (Value: 2400) and Keyboard (Value: 375) rows should print