Note: The contents of this post are adapted from a half-day presentation I gave at an R workshop targeted at graduate students. Around the time that the workshop took place targets
superceded drake
. The process explained in the post below is geared mostly towards introducing graduate students to the concept of workflow management using drake
and does not address targets
. It’s roughly formatted for workshops styled after the Carpentries.
The dataset for this walkthrough comes from Fanaee-T and Gama (2013). More info here.
drake
drake
and what does it require of us?drake
workflow for analyzing our datasetdrake
responds to changes in data/scripts Stretch goals:drake
?“Ideally, the reproduction of your results is a one-button operation. And this is valuable not just for others, but also for yourself (or your future self). For example, if the primary data should change (and it often does), wouldn’t it be nice to have one command that re-runs everything?”
Workflow management software can coordinate the steps of your workflow (raw files > analyses > outputs). It’s more reliable than memorizing the order scripts are supposed to run in, or hoping humans follow your README file correctly. Some can coordinate multiple programs and file formats.
Workflows can get complicated! So it helps to have tools to wrangle them and to make them understandable by a broad audience.
drake
?drake
can help use build, visualize, and manage workflows from raw files to outputsdrake
?There are a variety of different scripting philosophies out there! Some of the common ones I run into are shown here:
Loose scripts: Keeping a bunch of loosely related scripts together in a folder without much structure
Simple numbered scripts: A simple project might allow you to have a few numbered scripts that are clearly related
Complex numbered scripts: When projects become complex, with parallel steps, the numbering process can get out of hand or become impossible to maintain logically
These organization schemes may make sense for you, specific projects, etc. But there is also a place for something more formalized and centralized like drake
!
drake
?Our ingredients include:
make.R
packages.R
functions.R
plan.R
Plans are made up of “targets”. Targets are analysis steps.
A target is:
Note: If you have a big dataset, you are better off with one or two steps that require the full dataset, instead of many steps that require a lot of memory. (i.e., step 1: clean, step 2: model)
These guidelines were taken from the drake
manual.
drake
depends on functionsWhy is that?
Functions save code for easy re-use later on, and gives a name to that code (e.g., mean()
) that you can reference in the future. Functions also have inputs and outputs.
Anatomy of a function:
Sources: Wickham & Grolemund (2017), Wright & Zimmerman (2016)
library(tidyverse)
library(drake)
# Two years of daily data from the Capital Bikeshare system in D.C. paired with weather data.
bike_data <- read.csv(file = "daily_bike_data.csv", stringsAsFactors = FALSE)
We’re going to make an example function to illustrate what they do.
Suppose that you wanted to create summary stats for a column in your dataset, including the minimum, maximum, and median values for bike use by registered riders. You might start out like this:
col_min <- min(bike_data[, "registered"], na.rm = TRUE)
col_min
## [1] 20
col_max <- max(bike_data[, "registered"], na.rm = TRUE)
col_max
## [1] 6946
col_med <- median(bike_data[, "registered"], na.rm = TRUE)
col_med
## [1] 3662
We use the functions min()
, max()
, and median()
to generate individual summaries.
We now know that the registered column has a min of 20, max of 6946, and median of 3662. But if we wanted to determine this for multiple columns we’d pretty soon be copy and pasting the same three lines of code over and over again. (every column = 16 cols x 2 lines = 32 lines of code…)
We can combine these into a function that can be used with one line of code on any column using the following ingredients:
function() {
# Contents
return()
}
# A function that returns the min, max, and median of any column
summary_fun <- function(data, col_name){
col_min <- min(data[, col_name], na.rm = TRUE)
col_max <- max(data[, col_name], na.rm = TRUE)
col_med <- median(data[, col_name], na.rm = TRUE)
# Return all three values to the function user
return(c(col_min, col_max, col_med))
}
summary_fun(data = bike_data, col_name = "registered")
## [1] 20 6946 3662
summary_fun(data = bike_data, col_name = "casual")
## [1] 2 3410 713
Now we have made a method that greatly reduces the amount of code we have to write in order to get multiple output values for one column.
Take 5 minutes: Write a function named prop_reg
that will return the proportion of ride counts that are by registered users for every row [i.e., date] in the dataset. Hint: proportion = registered / registered + casual.
Answer:
# Two main options:
# Option 1
prop_reg <- function(data) {
prop <- data$registered / (data$registered + data$casual)
return(prop)
}
prop_reg(data = bike_data)
# Option 2
prop_reg_2 <- function(data, col_one, col_two) {
prop <- data[, col_one] / (data[, col_one] + data[, col_two])
return(prop)
}
prop_reg_2(data = bike_data, col_one = "registered", col_two = "casual")
# What would happen if you used sum() in this?
You can access the dataset and my folder/file structure template for the upcoming drake
walkthrough by cloning this GitHub repo.
drake
workflow:Now we’ve gotten a sense for how functions work and what the purpose of drake
is. Let’s put our knowledge of these two pieces together and build an example workflow with drake
.
We’ll be putting together a short, but “full” analysis of our dataset from cleaning through visualization and modeling with drake
.
Let’s say we’re setting out to investigate the effects of temperature and season on bike use in D.C.
Start by restarting R, closing any scripts, etc.
First, let’s review our file structure. Open up the example_drake_project
folder you downloaded.
This is the file structure you should have:
## -- data
## |__daily_bike_data.csv
## -- documents
## -- figures
## -- make.R
## -- R
## |__functions.R
## |__packages.R
## |__plan.R
Quick explanation of the folders:
An explanation of the scripts:
make.R
: This will set everything up and run our projectR/packages.R
: This loads all of our packagesR/functions.R
: This stores all of the function definitions we will custom write for our workflowR/plan.R
: This is where we call all of our functions, and define how our workflow is constructedpackages.R
We’ll ease into this slowly. First we start by filling out our packages script. The purpose of this script is to document and load the packages needed for the rest of the project.
# Packages needed for this project
library(drake)
library(tidyverse)
library(broom)
Now, run this script so that these packages are available to us as we write the rest of our workflow.
And that’s it! We list the package library()
calls and then we can move on. This script will be referenced by make.R
later on and the packages will be loaded automatically.
Next let’s get into the meat of our analysis by writing the functions.R
script. This is going to be our longest script, but we’ll work through it at a reasonable pace.
As a first step, we will want to set our working directory to example_drake_project
(the main folder you downloaded).
Next, we should read in our dataset so that we can test out our code as we write it.
Just run this in your Console
:
bike_data <- read.csv(file = "data/daily_bike_data.csv", stringsAsFactors = FALSE)
The plan is to make separate functions that perform each of the steps that we want to accomplish in the workflow. We can start by first naming the steps/functions:
subset_data
create_model_plot
run_model
Each of these names will turn into a function that we’ll use as a target in our workflow plan. The first will subset the dataset to include only the data we want. The second will create a plot of our data. And the third will run a basic model of our data.
Let’s put together a definition for subset_data()
! What I want to do is limit dates we use to non-holidays and rename the cnt
column for clarity.
Here’s how I’d do this normally:
bike_data %>%
filter(holiday == "not holiday") %>%
rename(total_riders = cnt)
For use with drake
things are mostly the same under the hood, but we need to wrap it up in a function.
# The shell of the function:
subset_data <- function(data){
}
# Change "bike_data" to "data" to work with the function argument
bike_data %>%
filter(holiday == "not holiday") %>%
rename(total_riders = cnt)
# Then combine them
subset_data <- function(data){
data %>%
filter(holiday == "not holiday") %>%
rename(total_riders = cnt)
}
A quick point: We replace bike_data
with data
because data
is the name of our function’s argument. We want to be able to feed in the name of any dataset we want in to the function.
Challenge question! In a small group, take 15 minutes to create the next function in our workflow. This is create_model_plot()
. The function should create a plot with:
The function should also use ggsave()
to export the plot. Provide an argument called “out_path” to use for “filename =” in ggsave.
An example answer:
create_model_plot <- function(data, out_path){
bike_model_plot <- ggplot(data = data,
aes(x = temp, y = total_riders)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
facet_wrap(. ~ season)
ggsave(filename = out_path,
plot = bike_model_plot,
width = 8, height = 8, units = "in")
}
OK! Now we have our second function built. Let’s take care of the last one.
It will be a linear model of this format: total_riders ~ temperature * season
# Build an lm using "data"
lm(formula = as.formula(paste(resp_var, "~", exp_var1, "*", exp_var2)),
data = data)
# This is the shell of the function
run_model <- function(data, resp_var, exp_var1, exp_var2) {
}
# Now combine them
run_model <- function(data, resp_var, exp_var1, exp_var2) {
lm(formula = as.formula(paste(resp_var, "~", exp_var1, "*", exp_var2)),
data = data)
}
And that’s our full functions.R
script!
Next up, we will build our drake
“plan.”
The plan tells drake
:
Let’s outline the functions in the order we want them to run:
plan <- drake_plan(
# Read in our data
read.csv(),
# Subset the data
subset_data(),
# Plot data relationship
create_model_plot(),
# Make model
run_model(),
# Export model summaries
write.csv(),
write.csv()
)
Now, we need to add in the details here. We’ll first start by giving the function outputs names and then connecting their outputs to one another.
plan <- drake_plan(
bike_data = read.csv(file = file_in("data/daily_bike_data.csv"),
stringsAsFactors = FALSE),
bike_subset = subset_data(data = bike_data),
model_plot = create_model_plot(data = bike_subset,
out_path = file_out("figures/bike_model_plot.png")),
bike_model = run_model(data = bike_subset,
resp_var = "total_riders",
exp_var1 = "temp",
exp_var2 = "season"),
# NOTE: We write these verbatim here instead of in functions.R bc they're small:
write_coef_table = write.csv(x = tidy(bike_model),
file = file_out("documents/coef_table.csv")),
write_sum_table = write.csv(x = glance(bike_model),
file = file_out("documents/sum_table.csv")),
)
See, for example, how the bike_subset
target requires the output of bike_data
? This shows drake
where the connections between steps of the workflow exist. We made sure while writing our functions script that the outputs from each of the targets is usable by whichever target needs it later on in the process.
Part of this is that we use file_in()
and file_out()
to tell drake
that files we are reading in or exporting (like our ggplot
figure) are dependencies or objects that need to be tracked. These functions are not allowed inside of our functions.R
script, though, so we put them here in plan.R
.
Here we are! This is our last step on the way to a completed project: filling out our make.R
script.
What’s this script do? It basically sets the stage for our plan, and then runs the entire workflow and reports back to us.
We want to create a make.R
script that will:
make
function to actually build our project/targetsWe’ll start by reading the previous scripts we’ve written. We will use the source()
function, which reads R code from files outside of our current script.
# Load packages
source("R/packages.R")
# Load functions
source("R/functions.R")
# Load plan
source("R/plan.R")
Next, let’s write our make()
call:
make(
# Run the plan created in plan.R
plan,
# Show target-by-target messages
verbose = 1
)
Note that because plan.R
has been read using source()
, the plan
object exists in our environment already when R gets to this point in the current script.
To finish up, let’s add in vis_drake_graph()
before and after we run our make()
call. This will show us which targets are outdated (i.e., have been edited), and then confirm that things make sense after the plan is built/rebuilt.
# Load packages
source("R/packages.R")
# Load functions
source("R/functions.R")
# Load plan
source("R/plan.R")
vis_drake_graph(plan)
make(
# Run the plan created in plan.R
plan,
# Show target-by-target messages
verbose = 1
)
vis_drake_graph(plan)
What questions are there about this script, or any of the other scripts in this process?
If you run the script, R will produce a dependency graph to visualize your workflow. Before make()
call, your plan will look like this (if it’s the first time you’ve run the script):
After make()
runs, it should look like this:
Lastly, if we want to pull some of the objects that our workflow created out, we can do so with readd()
or loadd()
.
readd()
will print/pull the contents of an object from your plan from the drake
cache.
readd(bike_model)
loadd()
will actually load it into your R environment as an object.
loadd(bike_model)
This way you can work with the objects directly if you’d like.
drake
workflow:Congratulations! We’ve successfully put together a full workflow and seen that it will build.
Check out the dependency graph using vis_drake_graph(plan)
.
Now, let’s say that after some consideration, we’ve decided to change how we subset the data initially.
Let’s modify our subset_data
function from filter(holiday == "not holiday")
to filter(holiday == "holiday")
.
Now that we’ve made that change, go ahead and source
the functions.R
script again. Then run the following:
# Check for outdated components
outdated(plan)
# See how the outdated pieces are related to one another
vis_drake_graph(plan)