Interacting with SQL in R
Why Learn Database Queries inside R?
In professional enterprise environments, datasets are almost never stored in CSV flat files on your local machine. They are stored in secure, central relational databases.
If you have a database of 500 million records, downloading the entire table to R's memory will freeze your R session or crash your computer.
Instead, you want to:
- Connect to the database from R.
- Query the data using SQL (Structured Query Language) or let R translate your tidyverse code into SQL.
- Bring only the filtered, aggregated query result back to R for plotting.
In this chapter, we will learn how to interface with SQLite databases using the DBI package and harness the dbplyr translation engine to convert standard dplyr verbs to SQL queries automatically.
1. Connecting to SQL Databases: DBI
R connects to databases using driver packages (such as RSQLite for SQLite files, or RPostgres for PostgreSQL servers) combined with the unified interface DBI.
library(DBI)
# Connect to an in-memory SQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Write R's mpg table into a SQL table named "cars"
dbWriteTable(con, "cars", mpg)
# List all tables in the database
dbListTables(con) # "cars"
2. Executing Raw SQL Queries: dbGetQuery()
To run raw SQL SELECT queries and return the results as standard R data frames:
# Query the database
result <- dbGetQuery(con, "
SELECT manufacturer, model, hwy
FROM cars
WHERE hwy > 30
LIMIT 5
")
print(result)
Closing Connections
To prevent memory leaks and database locking, always disconnect from your database when done:
# Close the database connection
dbDisconnect(con)
3. Database Translation: dbplyr
Writing complex raw SQL strings inside R scripts can be tedious. The dbplyr package provides a translation engine that lets you write standard dplyr code (like filter, select, group_by), translates it to SQL, and executes it directly on the database!
Lazy Evaluation: tbl(), show_query(), & collect()
tbl(con, "table"): Creates a virtual table reference. R does not download the data yet.show_query(): Displays the SQL query that R generated.collect(): Executes the query on the database, downloads the results, and saves them as a standard local tibble in your R environment.
# 1. Re-establish connection and table
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "cars", mpg)
# Create virtual reference
db_cars <- tbl(con, "cars")
# 2. Write standard dplyr code on the database table
db_pipeline <- db_cars |>
select(manufacturer, model, hwy) |>
filter(hwy > 30)
# 3. View the translated SQL query
show_query(db_pipeline)
# Output prints: SELECT `manufacturer`, `model`, `hwy` FROM `cars` WHERE (`hwy` > 30.0)
# 4. Fetch the final results to your local R environment
local_result <- collect(db_pipeline)
print(local_result)
dbDisconnect(con)
Hands-on Exercises
Exercise 1: Querying Cylinders
Create an in-memory database, insert the mpg table, and write a raw SQL query.
Write R code to:
- Connect to an in-memory SQLite database named
my_dbusingdbConnect(). - Write R's
mpgtable to a SQL database table named"vehicle_log"usingdbWriteTable(). - Query the database using
dbGetQuery()to retrieve themanufacturer,model, andcylcolumn values for all vehicles with exactly 5 cylinders (cyl = 5). - Disconnect from the database.
# Write your code below and click Run Code
Click to view Answer
library(DBI)
# Connect
my_db <- dbConnect(RSQLite::SQLite(), ":memory:")
# Write table
dbWriteTable(my_db, "vehicle_log", mpg)
# Query
five_cyl_vehicles <- dbGetQuery(my_db, "
SELECT manufacturer, model, cyl
FROM vehicle_log
WHERE cyl = 5
")
print(five_cyl_vehicles)
# Disconnect
dbDisconnect(my_db)
Exercise 2: Tidyverse to SQL Translation
Connect to an in-memory database and use dbplyr to build and inspect a query.
Write R code using pipes to:
- Connect to an in-memory SQLite database named
temp_db. - Write R's
mpgtable to the database under the table name"cars_db". - Create a table reference using
tbl(temp_db, "cars_db")and pipe it. - Filter for cars of class
"suv". - Group by
manufacturerand calculate the average highway mileagemean_hwy = mean(hwy). - Use
show_query()to inspect the translated SQL code. - Use
collect()to run the query and download the final table to R. - Disconnect from
temp_db.
# Write your code below and click Run Code
Click to view Answer
library(DBI)
library(dplyr)
# Connect & write
temp_db <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(temp_db, "cars_db", mpg)
# Build pipeline on virtual table
db_ref <- tbl(temp_db, "cars_db")
query_pipeline <- db_ref |>
filter(class == "suv") |>
group_by(manufacturer) |>
summarize(mean_hwy = mean(hwy))
# Show query
show_query(query_pipeline)
# Fetch data
final_table <- collect(query_pipeline)
print(final_table)
# Close connection
dbDisconnect(temp_db)