Note: The contents of this post are adapted from a half-day presentation I gave at an R workshop targeted at graduate students. Around the time that the workshop took place targets superceded drake. The process explained in the post below is geared mostly towards introducing graduate students to the concept of workflow management using drake and does not address targets. It’s roughly formatted for workshops styled after the Carpentries.

The dataset for this walkthrough comes from Fanaee-T and Gama (2013). More info here.

Introduction to workflow management with drake

Topics and goals for today

  1. What is workflow management?
  2. What is drake and what does it require of us?
  3. Brief intro to functions
  4. Build a simple drake workflow for analyzing our dataset
  5. Discuss how drake responds to changes in data/scripts Stretch goals:
  6. How might you do more with drake?


Workflow management and automation

“Ideally, the reproduction of your results is a one-button operation. And this is valuable not just for others, but also for yourself (or your future self). For example, if the primary data should change (and it often does), wouldn’t it be nice to have one command that re-runs everything?”

Workflow management software can coordinate the steps of your workflow (raw files > analyses > outputs). It’s more reliable than memorizing the order scripts are supposed to run in, or hoping humans follow your README file correctly. Some can coordinate multiple programs and file formats.

from Guo, 2012 (p.5)

Workflows can get complicated! So it helps to have tools to wrangle them and to make them understandable by a broad audience.


What is drake?

  • Package for R that analyzes and manages your workflow
  • Re-runs outdated analysis steps only when needed
  • Part of rOpenSci


drake can help use build, visualize, and manage workflows from raw files to outputs

image by Will Landau, https://github.com/ropensci/`drake`


A good source for info, examples, etc. for drake is the User Manual:

Link to drake User Manual


What are the benefits of drake?

  • Reproducibility
  • Allows you to automate your workflow
  • Don’t need to press run on ten scripts. Just one
  • Don’t need to keep track of which scripts to re-run, or in which order
  • Allows you to thoroughly document the plan of your analysis
  • Makes it easier to keep track of complex projects
  • Can integrate with high performance computers
  • It’s R-specific


Scripting philosophies

There are a variety of different scripting philosophies out there! Some of the common ones I run into are shown here:

Loose scripts: Keeping a bunch of loosely related scripts together in a folder without much structure
Simple numbered scripts: A simple project might allow you to have a few numbered scripts that are clearly related
Complex numbered scripts: When projects become complex, with parallel steps, the numbering process can get out of hand or become impossible to maintain logically

These organization schemes may make sense for you, specific projects, etc. But there is also a place for something more formalized and centralized like drake!


What do you need to get started with drake?

Our ingredients include:

  1. make.R
  2. packages.R
  3. functions.R
  4. plan.R

This is your directory structure on drake


First, let’s talk about “plans”

Plans are made up of “targets”. Targets are analysis steps.

The skeleton of a plan

An example plan with some detail filled in


How to craft a target?

A target is:

  • A meaningful step in your workflow
  • A large enough process to take time to run
  • Not so large that it can’t be skipped often
  • Compatible with saveRDS()

Note: If you have a big dataset, you are better off with one or two steps that require the full dataset, instead of many steps that require a lot of memory. (i.e., step 1: clean, step 2: model)

These guidelines were taken from the drake manual.


drake depends on functions

Why is that?

  • Targets (steps in the plan) are defined by functions
  • Functions keep the plan simple, easy to read
  • Functions are stored in the functions.R script


You might not be familiar with functions yet, so let’s talk more about them.

Functions save code for easy re-use later on, and gives a name to that code (e.g., mean()) that you can reference in the future. Functions also have inputs and outputs.

Anatomy of a function:

  • Name
  • Arguments (inputs)
  • Body

Sources: Wickham & Grolemund (2017), Wright & Zimmerman (2016)


Time to code! Let’s practice working with functions briefly.

Artwork by @allison_horst



First, load the packages we’ll need

library(tidyverse)
library(drake)

Now load the dataset

# Two years of daily data from the Capital Bikeshare system in D.C. paired with weather data.
bike_data <- read.csv(file = "daily_bike_data.csv", stringsAsFactors = FALSE)

Intro to functions

We’re going to make an example function to illustrate what they do.

Suppose that you wanted to create summary stats for a column in your dataset, including the minimum, maximum, and median values for bike use by registered riders. You might start out like this:

col_min <- min(bike_data[, "registered"], na.rm = TRUE)
col_min
## [1] 20
col_max <- max(bike_data[, "registered"], na.rm = TRUE)
col_max
## [1] 6946
col_med <- median(bike_data[, "registered"], na.rm = TRUE)
col_med
## [1] 3662

We use the functions min(), max(), and median() to generate individual summaries.

We now know that the registered column has a min of 20, max of 6946, and median of 3662. But if we wanted to determine this for multiple columns we’d pretty soon be copy and pasting the same three lines of code over and over again. (every column = 16 cols x 2 lines = 32 lines of code…)

We can combine these into a function that can be used with one line of code on any column using the following ingredients:

function() {
  
  # Contents
  
  return()
  
}
# A function that returns the min, max, and median of any column
summary_fun <- function(data, col_name){
  
  col_min <- min(data[, col_name], na.rm = TRUE)
  
  col_max <- max(data[, col_name], na.rm = TRUE)
  
  col_med <- median(data[, col_name], na.rm = TRUE)
  
  # Return all three values to the function user
  return(c(col_min, col_max, col_med))
  
}
summary_fun(data = bike_data, col_name = "registered")
## [1]   20 6946 3662
summary_fun(data = bike_data, col_name = "casual")
## [1]    2 3410  713

Now we have made a method that greatly reduces the amount of code we have to write in order to get multiple output values for one column.

Challenge question!

Take 5 minutes: Write a function named prop_reg that will return the proportion of ride counts that are by registered users for every row [i.e., date] in the dataset. Hint: proportion = registered / registered + casual.


Answer:

# Two main options:

# Option 1
prop_reg <- function(data) {
  
  prop <- data$registered / (data$registered + data$casual)
  
  return(prop)
  
}

prop_reg(data = bike_data)

# Option 2
prop_reg_2 <- function(data, col_one, col_two) {
  
  prop <- data[, col_one] / (data[, col_one] + data[, col_two])
  
  return(prop)
  
}

prop_reg_2(data = bike_data, col_one = "registered", col_two = "casual")


# What would happen if you used sum() in this?


Download the file structure

You can access the dataset and my folder/file structure template for the upcoming drake walkthrough by cloning this GitHub repo.



This is a good time to take a ten minute break. Get something to eat and come back!



Making our drake workflow:

Now we’ve gotten a sense for how functions work and what the purpose of drake is. Let’s put our knowledge of these two pieces together and build an example workflow with drake.

We’ll be putting together a short, but “full” analysis of our dataset from cleaning through visualization and modeling with drake.

Let’s say we’re setting out to investigate the effects of temperature and season on bike use in D.C.


Start by restarting R, closing any scripts, etc.


1. File structure

First, let’s review our file structure. Open up the example_drake_project folder you downloaded.

This is the file structure you should have:

## -- data
##    |__daily_bike_data.csv
## -- documents
## -- figures
## -- make.R
## -- R
##    |__functions.R
##    |__packages.R
##    |__plan.R

Quick explanation of the folders:

  • data: Where our dataset will be stored
  • documents: Where some exported files will go
  • figures: Where we will export figures
  • R: The home of most of the scripts we write

An explanation of the scripts:

  • make.R: This will set everything up and run our project
  • R/packages.R: This loads all of our packages
  • R/functions.R: This stores all of the function definitions we will custom write for our workflow
  • R/plan.R: This is where we call all of our functions, and define how our workflow is constructed


2. Let’s write our first script: packages.R

We’ll ease into this slowly. First we start by filling out our packages script. The purpose of this script is to document and load the packages needed for the rest of the project.

# Packages needed for this project

library(drake)
library(tidyverse)
library(broom)

Now, run this script so that these packages are available to us as we write the rest of our workflow.

And that’s it! We list the package library() calls and then we can move on. This script will be referenced by make.R later on and the packages will be loaded automatically.


3. Building our functions

Next let’s get into the meat of our analysis by writing the functions.R script. This is going to be our longest script, but we’ll work through it at a reasonable pace.

As a first step, we will want to set our working directory to example_drake_project (the main folder you downloaded).

Next, we should read in our dataset so that we can test out our code as we write it.
Just run this in your Console:

bike_data <- read.csv(file = "data/daily_bike_data.csv", stringsAsFactors = FALSE)

The plan is to make separate functions that perform each of the steps that we want to accomplish in the workflow. We can start by first naming the steps/functions:

subset_data

create_model_plot

run_model

Each of these names will turn into a function that we’ll use as a target in our workflow plan. The first will subset the dataset to include only the data we want. The second will create a plot of our data. And the third will run a basic model of our data.

Let’s put together a definition for subset_data()! What I want to do is limit dates we use to non-holidays and rename the cnt column for clarity.

Here’s how I’d do this normally:

bike_data %>%
  filter(holiday == "not holiday") %>%
  rename(total_riders = cnt)

For use with drake things are mostly the same under the hood, but we need to wrap it up in a function.

# The shell of the function:

subset_data <- function(data){
  
}


# Change "bike_data" to "data" to work with the function argument
bike_data %>%
  filter(holiday == "not holiday") %>%
  rename(total_riders = cnt)


# Then combine them
subset_data <- function(data){
  data %>%
    filter(holiday == "not holiday") %>%
    rename(total_riders = cnt)
}

A quick point: We replace bike_data with data because data is the name of our function’s argument. We want to be able to feed in the name of any dataset we want in to the function.


Challenge question!

Challenge question! In a small group, take 15 minutes to create the next function in our workflow. This is create_model_plot(). The function should create a plot with:

  1. A scatterplot (geom_point())
  2. A line through the points on the plot (geom_smooth())
  3. x axis = temperature, y axis = total riders, and season represented by facets, color, or another dimension of your choice.

The function should also use ggsave() to export the plot. Provide an argument called “out_path” to use for “filename =” in ggsave.


An example answer:

create_model_plot <- function(data, out_path){
  
  bike_model_plot <- ggplot(data = data,
                            aes(x = temp, y = total_riders)) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "lm") +
    facet_wrap(. ~ season)
  
  ggsave(filename = out_path,
         plot = bike_model_plot,
         width = 8, height = 8, units = "in")
  
}


OK! Now we have our second function built. Let’s take care of the last one.

It will be a linear model of this format: total_riders ~ temperature * season

# Build an lm using "data"
lm(formula = as.formula(paste(resp_var, "~", exp_var1, "*", exp_var2)),
   data = data)


# This is the shell of the function
run_model <- function(data, resp_var, exp_var1, exp_var2) {
  
  
}


# Now combine them
run_model <- function(data, resp_var, exp_var1, exp_var2) {
  
  lm(formula = as.formula(paste(resp_var, "~", exp_var1, "*", exp_var2)),
     data = data)
  
}

And that’s our full functions.R script!


4. Let’s make a plan

Next up, we will build our drake “plan.”

The plan tells drake:

  • What order to execute our functions (workflow) in
  • What inputs the functions need
  • How to store the output of the functions

Let’s outline the functions in the order we want them to run:

plan <- drake_plan(
  
  # Read in our data
  read.csv(),
  
  # Subset the data
  subset_data(),
  
  # Plot data relationship
  create_model_plot(),
  
  # Make model
  run_model(),
  
  # Export model summaries
  write.csv(),
  
  write.csv()
)

Now, we need to add in the details here. We’ll first start by giving the function outputs names and then connecting their outputs to one another.

plan <- drake_plan(
  
  bike_data = read.csv(file = file_in("data/daily_bike_data.csv"),
                       stringsAsFactors = FALSE),
  
  bike_subset = subset_data(data = bike_data),
  
  model_plot = create_model_plot(data = bike_subset,
                                 out_path = file_out("figures/bike_model_plot.png")),  
  
  bike_model = run_model(data = bike_subset,
                         resp_var = "total_riders",
                         exp_var1 = "temp",
                         exp_var2 = "season"),
  
  # NOTE: We write these verbatim here instead of in functions.R bc they're small:
  write_coef_table = write.csv(x = tidy(bike_model),
                               file = file_out("documents/coef_table.csv")),
  
  write_sum_table = write.csv(x = glance(bike_model),
                              file = file_out("documents/sum_table.csv")),
  
)

See, for example, how the bike_subset target requires the output of bike_data? This shows drake where the connections between steps of the workflow exist. We made sure while writing our functions script that the outputs from each of the targets is usable by whichever target needs it later on in the process.

Part of this is that we use file_in() and file_out() to tell drake that files we are reading in or exporting (like our ggplot figure) are dependencies or objects that need to be tracked. These functions are not allowed inside of our functions.R script, though, so we put them here in plan.R.


5. “Make” the project

Here we are! This is our last step on the way to a completed project: filling out our make.R script.

What’s this script do? It basically sets the stage for our plan, and then runs the entire workflow and reports back to us.

We want to create a make.R script that will:

  1. Read the other scripts that we have made
  2. Run the make function to actually build our project/targets
  3. Visualize the dependencies of our workflow (i.e., which targets rely on which other ones)

We’ll start by reading the previous scripts we’ve written. We will use the source() function, which reads R code from files outside of our current script.

# Load packages
source("R/packages.R")
# Load functions
source("R/functions.R")
# Load plan
source("R/plan.R")

Next, let’s write our make() call:

make(
  # Run the plan created in plan.R
  plan,
  # Show target-by-target messages
  verbose = 1
)

Note that because plan.R has been read using source(), the plan object exists in our environment already when R gets to this point in the current script.

To finish up, let’s add in vis_drake_graph() before and after we run our make() call. This will show us which targets are outdated (i.e., have been edited), and then confirm that things make sense after the plan is built/rebuilt.

# Load packages
source("R/packages.R")
# Load functions
source("R/functions.R")
# Load plan
source("R/plan.R")

vis_drake_graph(plan)

make(
  # Run the plan created in plan.R
  plan,
  # Show target-by-target messages
  verbose = 1
)

vis_drake_graph(plan)

What questions are there about this script, or any of the other scripts in this process?

If you run the script, R will produce a dependency graph to visualize your workflow. Before make() call, your plan will look like this (if it’s the first time you’ve run the script):

After make() runs, it should look like this:

Lastly, if we want to pull some of the objects that our workflow created out, we can do so with readd() or loadd().

readd() will print/pull the contents of an object from your plan from the drake cache.

readd(bike_model)

loadd() will actually load it into your R environment as an object.

loadd(bike_model)

This way you can work with the objects directly if you’d like.


Take a 5 minute break!



Updating our drake workflow:

Congratulations! We’ve successfully put together a full workflow and seen that it will build.

Check out the dependency graph using vis_drake_graph(plan).

Now, let’s say that after some consideration, we’ve decided to change how we subset the data initially.

Let’s modify our subset_data function from filter(holiday == "not holiday") to filter(holiday == "holiday").

Now that we’ve made that change, go ahead and source the functions.R script again. Then run the following:

# Check for outdated components
outdated(plan)

# See how the outdated pieces are related to one another
vis_drake_graph(plan)

Now THIS is cool! We can see:

  1. That drake has detected that by changing just one line of code we have outdated eight parts of our workflow.
  2. How all eight of these pieces are related to one another.

Why is this important? Often with a complicated project, it’s easy to forget which of your scripts or functions depend on other tasks you’ve done. drake will make sure you don’t forget any steps.

Let’s re-make the plan:

make(
  # Run the plan created in plan.R
  plan,
  # Show target-by-target messages
  verbose = 1
)

Notice that we get some output as drake re-runs the parts that need to be re-run. It won’t re-run anything that doesn’t need updating.

What happens if we run make() again?

make(
  # Run the plan created in plan.R
  plan,
  # Show target-by-target messages
  verbose = 1
)

We get the message, v All targets are already up to date.

Recovering

Ah, beans. Maybe we decided that change wasn’t something that we want to do after all.

It was a small change, and we can quickly undo it. But, what if we were working on something that took 20 minutes or more to run each time? It would be time consuming to redo bits of the workflow. drake has an option to recover old parts of your workflow, instead of re-running them.

First, change the function back to its previous definition

subset_data <- function(data){
  data %>%
    # Revert to "not holiday"
    filter(holiday == "not holiday") %>%
    rename(total_riders = cnt)
}

Now, add recover = TRUE to the make() call and run.

make(
  # Run the plan created in plan.R
  plan,
  # Show target-by-target messages
  verbose = 1,
  
  recover = TRUE
)

Instead of running the whole workflow all over again, drake detects if it can just pull up old values that would be identical to re-running your functions again.

Note:

  • Recovering old values is still experimental, and if your software, etc. has updated between builds of your workflow, recovering might not be all that reproducible.

Miscellaneous

The clean() function will force your targets to be considered out-of-date, invalidating them. You might do this, for example, if you receive a drake project from someone and want to check their results on your own computer. This will cause the entire process to be re-run from start to finish when you run make() next.

Steps:

  1. Restart R
  2. source() your drake scripts
  3. Run clean()
  4. Check vis_drake_graph() to confirm things are outdated.

You can use which_clean() to see which targets will be invalidated before running clean(). Be sure that you actually want to do this!


References

  • Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, Web Link.
  • Guo, P. (2012). Software tools to facilitate research programming (PhD thesis). Stanford University, Computer Science Department. https://purl.stanford.edu/mb510fs4943
  • Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. " O’Reilly Media, Inc."
  • Will Landau, K., Müller, K., Axthelm, A., Clarkberg, J., Walthert, L., Hughes, E., & Strasiotto, M. (n.d.). The drake R Package User Manual. Retrieved November 06, 2020, from https://books.ropensci.org/drake/
  • Thomas Wright and Naupaka Zimmerman (eds): “Software Carpentry: R for Reproducible Scientific Analysis.” Version 2016.06, June 2016, https://github.com/swcarpentry/r-novice-gapminder, 10.5281/zenodo.57520.