A quick-start guide to the `caret` R package

Last updated on Apr 19, 2022

Background

The caret R package has been a staple of machine learning (ML) methods in R for a long time. The name caret stands for “Classification and Regression Training” according to the authors. It provides methods for common ML steps, such as pre-processing, training, tuning, and evaluating predictive models.

In addition to caret, there is also a group of packages referred to as tidymodels that is currently in development and which is also available for use. Both caret and tidymodels have Max Kuhn as a main author, but tidymodels aims to streamline ML for use with tidyverse packages.

I’ll be focusing on caret in this intro. We’ll be working with data from the palmerpenguins package, and using the caret, and tidyverse packages.

Working with data

Start by loading the necessary packages:

library(tidyverse)
library(palmerpenguins)
library(caret)

We’ll use the penguins dataset from palmerpenguins. It will be loaded automatically when you load the palmerpenguins package:

head(penguins)

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

This package contains LTER data for three penguin species on islands in Antarctica.

Reviewing the data

caret provides a function called featurePlot(), which is used to visualize datasets. It runs off of the lattice package, so if you are familiar with this method of plotting you might find it familiar. As someone who primarily uses ggplot2 I found this function a bit difficult to use, but you may find it helpful still. Max Kuhn’s The caret` Package Bookdown document provides some interesting examples of its functionality. Here’s a basic plot with a couple of our variables:

featurePlot(x = penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm")], 
            y = penguins$body_mass_g, 
            plot = "scatter", 
            layout = c(3, 1))

In the plot above, the x-axis corresponds to each predictor (by panel) and the y-axis is body_mass_g.

Pre-processing

caret has built-in functionality to help with pre-processing your data as well. There are some more sophisticated options, but here we’ll just take a look at one. For example, in some modeling situations you’ll want to check for correlated predictors (multicollinearity). The caret function for this is findCorrelation().

First we’ll want to make a correlation matrix, which will be fed into findCorrelation().

penguin_cor <- penguins %>%
  filter(!if_any(everything(), is.na)) %>%
  # Numeric columns only
  select(where(is.numeric)) %>%
  cor()

Then findCorrelation() will check for correlations above a cutoff value that we provide:

findCorrelation(x = penguin_cor,
                # Use 0.7 correlation cutoff
                cutoff = 0.7,
                # Provide more details
                verbose = TRUE,
                # Return column names instead of indices
                names = TRUE)

## Compare row 3  and column  4 with corr  0.873 
##   Means:  0.564 vs 0.334 so flagging column 3 
## All correlations <= 0.7

## [1] "flipper_length_mm"

It notes that row 3 and column 4 of our correlation object have a correlation of 0.87:

##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956  0.58945111
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917 -0.47201566
## flipper_length_mm      0.6530956    -0.5777917         1.0000000  0.87297890
## body_mass_g            0.5894511    -0.4720157         0.8729789  1.00000000
## year                   0.0326569    -0.0481816         0.1510679  0.02186213
##                          year
## bill_length_mm     0.03265690
## bill_depth_mm     -0.04818160
## flipper_length_mm  0.15106792
## body_mass_g        0.02186213
## year               1.00000000

This means flipper_length_mm (row 3) and body_mass_g (column 4) are correlated. That’s fine since body mass is our response variable.

Splitting the data

We’ll want to make a training dataset with which to build our model. caret provides some methods for doing this, including some for balanced splits for classification datasets, splitting using maximum dissimilarity, splitting for time series, and more.

We’ll proceed by modeling body_mass_g. First we’ll remove NAs, then do a basic split into a training dataset.

model_data <- penguins %>%
  filter(!if_any(everything(), is.na))

# 80% of the data for training - these are row indices
set.seed(500)

training_indices <- createDataPartition(y = model_data$body_mass_g, 
                                        p = .8,
                                        # Do not return a list
                                        list = FALSE)

# Subset the rows selected above
training_sample <- model_data[training_indices, ]

Modeling

With our data in hand we now can progress to training a model. We’ll use a random forest for this example. Note that there is a lot to explore in caret that I won’t go over here. I highly recommend reading The caret` Package.

First we can use the trainControl() function to customize the model training process:

train_params <- trainControl(
  # Repeated K-fold cross-validation
  method = "repeatedcv",
  # 10 folds
  number = 10,
  # 3 repeats
  repeats = 3
)

Now we can train the random forest with the parameters above:

set.seed(500)

rf_model <- train(
  # Modeling formula indicating to model body_mass_g using all other vars
  form = body_mass_g ~ .,
  # Our custom parameters
  trControl = train_params,
  # The dataset
  data = training_sample,
  # Use a random forest
  method = "rf"
)

Here are our results:

# Note that mtry = the number of randomly selected predictors caret uses at each split
rf_model

## Random Forest 
## 
## 268 samples
##   7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 242, 241, 242, 241, 241, 242, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     291.4107  0.8760304  233.9042
##   5     291.0023  0.8739321  234.9932
##   9     295.2779  0.8700284  240.0389
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

The final model can be accessed from the trained object. But note that you probably shouldn’t rely on this R^2^ as your output.

rf_model$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 86096.17
##                     % Var explained: 86.69

Its mtry value was 5 based on RMSE, which we can view using ggplot():

ggplot(rf_model)

Now we want to know how the model performs on our testing data subset:

# Pull out the rows for testing and add predictions from the RF
test_sample <- model_data[-training_indices, ] %>%
  mutate(predicted = predict(object = rf_model, newdata = .))

# Get a new R2
postResample(pred = test_sample$predicted,
             obs = test_sample$body_mass_g)

##       RMSE   Rsquared        MAE 
## 310.377285   0.864062 241.696781

Plot the relationship

ggplot(data = test_sample) +
  # A 1:1 line
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray65") +
  geom_point(aes(x = body_mass_g, y = predicted, color = species)) +
  xlim(c(2500, 6500)) +
  ylim(c(2500, 6500)) +
  theme_bw()

We can look at a scaled metric for variable importance using the varImp() function:

plot(varImp(rf_model))

References

Teaching Live coding