library(tidymodels)
library(stacks)
Hide code cell output
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 1.3.0 ──
 broom        1.0.7      recipes      1.1.1
 dials        1.4.0      rsample      1.2.1
 dplyr        1.1.4      tibble       3.2.1
 ggplot2      3.5.1      tidyr        1.3.1
 infer        1.0.7      tune         1.3.0
 modeldata    1.4.0      workflows    1.2.0
 parsnip      1.3.0      workflowsets 1.1.0
 purrr        1.0.4      yardstick    1.3.2
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
 purrr::%||%()    masks base::%||%()
 purrr::discard() masks scales::discard()
 dplyr::filter()  masks stats::filter()
 dplyr::lag()     masks stats::lag()
 recipes::step()  masks stats::step()

Ensemble Learning#

What is ensembling and why does it work?

As we toured the most popular algorithms for classification and regression, we have compared the performances of each technique and commented on which seemed ``better” for the problem at hand. In doing so, we have somewhat forced ourselves into a false dichotomy.

The “no free lunch theorem” essentially requires us to try out a variety of different models for each problem. Each of these approach the problem from a slightly different angle, so it’s likely that each has “learned” something that none of the other models have. Why not combine the predictions from all the models to squeeze out all aspects of the relationships that were learned?

Combining models’ predictions is the basis of ensembling.

Back when we discussed bagging and random forests, we saw that there was strength in numbers.

Note

The average of many guesses typically has less error than any one of those guesses in isolation. The decrease in error from averaging is largest when the guesses are independent, but there is still usually a decrease when the guesses are somewhat correlated.

Bagging usually increased the performance of a model because of this ``strength in numbers” approach. Although the guesses of each model during bagging are not independent (the bootstrapped versions of the training sample the models use contain some of the same individuals), they were independent enough that averaging their predictions decreased errors.

Random forests further decreased the correlation between trees by making the rules of each tree randomly consider only a subset of available predictors. The net result was yet a further decrease in error.

Basics of ensembling#

The key to creating a strong ensemble is diversity. Combining predictions from models that approach the problem in very different ways (and thus whose predictions are not strongly correlated) will often give a stronger model than any of its individual components. Combining predictions from models that are very similar (e.g., a vanilla partition model and a random forest) can actually give a weaker model than its strongest component.

There are a few approaches for creating an ensemble:

  • Average together the predictions of many different models (or let each model ``vote” as to the class of the individual).

  • Make a weighted sum of the predictions (since some models will be ``better” than others for a particular problem).

  • Bayesian model combination (``smart” way of combining models).

  • Blending - use the predictions output from models B, C, and D as predictors when building model A.

  • Stacking - treat the predictions of different models as features (\(x\)-variables) that are combined using in a yet another model (like a neural network) to predict \(y\).

When you interview for an important job, you typically go through multiple rounds of interviews with different people. A single interviewer may not be able to assess or test the candidate for each skill or trait required for the job. However, when you combine the input from many different interviewers, each who probed the candidate in a different way, you have a better view of that candidate’s suitability for the job.

Same thing with ensembling. Each model develops a different ``understanding” of the relationships, so if they are combined intelligently, you get to select the best parts of each model.

Stacking#

The predictions from a logistic regression, support vector machine, Naive Bayes, boosted tree, and nearest neighbor models could be use as predictor variables (the \(x\)’s) in a neural network that is used to predict \(y\). Although each model was built using the same individuals, their predictions are not perfectly correlated because they approach the problem differently.

A word about stacking#

When you use the predictions of models as the \(x\)-variables for your final model, you need to be careful to use the right sets of data.

  • Do not fit each model to the entirety of the training data, save its predictions, then use those as the \(x\) variables to (re-)predict \(y\) with another model-building algorithm.

  • Doing so mean you have used the training data twice for two different tasks. You’ve ``double-dipped”. It’s gross and data miners do not do it.

  • In this case, I have described response leakage. By fitting a model onto the entirety of the training data and using its predictions as new variables, those variables directly encode information about the values of \(y\) on the training sample (i.e., it’s memorizing the values somewhat).

  • Predicting with variables that have ``response leakage” is a surefire way to end up with an overfit model!

  • Rather, during \(K\)-fold crossvalidation, the predictions on the ``holdout” fold (which the models don’t see that round during training) are saved and it is those predictions that are used as the \(x\) variables to (re-predict) \(y\).

In round 1, a model’s form is derived using the \(K-1\) training folds and predictions are made on the “pseudo-holdout” fold. These predictions get stored into the relevant positions in the “derived feature” (the \(x\) variable we will use in the stacking model). Etc.

Ensembling in practice#

In theory, ensembling sounds great: combining the strong points of different models who approach the relationship in different ways sounds foolproof. However, it doesn’t always work in practice (or give a LARGE decrease in error). Why?

  • The predictions of the models in an ensemble tend to be highly correlated. In other words, there wasn’t much variation in the prediction. Ensembles are better than their components when combining weakly correlated models.

  • Ensembling works best when you have lots and lots of data.

  • The boost in performance you get from ensembling isn’t ever going to be HUGE. Ensembling is to squeeze the last bit of predictive performance out of your models. It’s how you win data mining competitions, but its rarely needed for everyday predictive analytics.

  • Save ensembling for when you already have a few good models for the process you’re studying and you want to improve it just a little bit more!

Example#

data(credit_data, package = 'modeldata')
set.seed(474)
tt_split <- initial_split(credit_data, prop = 0.8)
TRAIN <- training(tt_split)
HOLDOUT <- testing(tt_split)
REC <- recipe(Status ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_lincomb(all_predictors())
set.seed(474)
CV <- vfold_cv(TRAIN, v = 5)
GRID <- control_stack_grid()
MODEL_ALL <- list(
    RF = rand_forest( mode = 'classification', mtry = tune() )
    ,
    SVM = svm_rbf( mode = 'classification', cost = tune(), rbf_sigma = tune() )
    ,
    MLP = mlp( mode = 'classification', hidden_units = tune(), penalty = tune() )
)
MODEL_ALL
$RF
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()

Computational engine: ranger 


$SVM
Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  rbf_sigma = tune()

Computational engine: kernlab 


$MLP
Single Layer Neural Network Model Specification (classification)

Main Arguments:
  hidden_units = tune()
  penalty = tune()

Computational engine: nnet 
tune_model <- function(model){
    WF <- workflow(REC, model)

    RES <- WF %>%
        tune_grid(
            resamples = CV,
            grid = 10,
            control = GRID,
        )

    return(RES)
}

RES_ALL <- lapply(MODEL_ALL, tune_model)
i Creating pre-processing data to finalize unknown parameter: mtry
test_model <- function(model, res){
    cv <- show_best(res, metric = 'roc_auc', n=1) %>% select(mean, std_err)

    best <- res %>% select_best(metric = 'roc_auc')

    final <- workflow(REC, model) %>%
        finalize_workflow(best) %>%
        last_fit(tt_split) %>%
        collect_metrics() %>%
        filter(.metric=='roc_auc')

    cv$final <- final$.estimate

    cv <- unlist(cv)

    return(cv)
}

FINAL_ALL <- as.data.frame(mapply(test_model, MODEL_ALL, RES_ALL))
FINAL_ALL
A data.frame: 3 × 3
RFSVMMLP
<dbl><dbl><dbl>
mean0.8349836660.8241919140.833073900
std_err0.0051680980.0076893260.006122087
final0.8193362660.8243239750.826277656
STACK <- stacks() %>%
  add_candidates(RES_ALL$RF, name='RF') %>%
  add_candidates(RES_ALL$SVM, name='SVM') %>%
  add_candidates(RES_ALL$MLP, name='MLP') %>%
  blend_predictions(metric = metric_set(roc_auc)) %>%
  fit_members()
STACK_metrics <- STACK$metrics %>%
    filter(.metric=='roc_auc') %>%
    arrange(desc(mean))
STACK_metrics
A tibble: 6 × 8
penaltymixture.metric.estimatormeannstd_err.config
<dbl><dbl><chr><chr><dbl><int><dbl><chr>
1e-011roc_aucbinary0.8430544250.002039685Preprocessor1_Model6
1e-021roc_aucbinary0.8429849250.002161517Preprocessor1_Model5
1e-031roc_aucbinary0.8427538250.002170250Preprocessor1_Model4
1e-041roc_aucbinary0.8427310250.002171834Preprocessor1_Model3
1e-051roc_aucbinary0.8427179250.002172569Preprocessor1_Model2
1e-061roc_aucbinary0.8427143250.002172698Preprocessor1_Model1
PRED <- predict(STACK, HOLDOUT, type = 'prob') %>%
    bind_cols(select(HOLDOUT, Status))
PRED
A tibble: 891 × 3
.pred_bad.pred_goodStatus
<dbl><dbl><fct>
0.28558290.7144171good
0.33379210.6662079good
0.68260820.3173918bad
0.14808770.8519123good
0.50840130.4915987bad
0.15323050.8467695good
0.15142540.8485746good
0.15306540.8469346bad
0.14403140.8559686good
0.37504700.6249530bad
0.13504150.8649585good
0.43838460.5616154bad
0.50073330.4992667good
0.38536130.6146387bad
0.15996360.8400364good
0.18571670.8142833good
0.25315320.7468468good
0.32250300.6774970good
0.23093050.7690695bad
0.14592540.8540746good
0.25333650.7466635good
0.15952380.8404762good
0.13127730.8687227good
0.14451640.8554836bad
0.30680640.6931936good
0.17981100.8201890good
0.16784050.8321595good
0.38482360.6151764bad
0.40017250.5998275good
0.16091830.8390817good
0.57690950.4230905bad
0.16616480.8338352good
0.14416730.8558327good
0.21974400.7802560good
0.70981950.2901805bad
0.15894970.8410503good
0.13659390.8634061good
0.38894690.6110531good
0.27036850.7296315good
0.20486650.7951335good
0.14355930.8564407good
0.13470910.8652909good
0.30521820.6947818good
0.63571980.3642802bad
0.17489380.8251062good
0.32390600.6760940good
0.40665410.5933459bad
0.53316350.4668365good
0.39140830.6085917good
0.14531610.8546839good
0.24973800.7502620good
0.13886350.8611365good
0.69736800.3026320bad
0.27630540.7236946bad
0.45482970.5451703good
0.50358000.4964200bad
0.51346110.4865389good
0.16135440.8386456good
0.28937870.7106213bad
0.18062050.8193795good
STACK_final <- roc_auc(PRED, Status, .pred_bad)
STACK_final
A tibble: 1 × 3
.metric.estimator.estimate
<chr><chr><dbl>
roc_aucbinary0.8301591
STACK_cv <- STACK_metrics %>% head(1) %>% select(mean, std_err)
STACK_cv$final <- STACK_final$.estimate
FINAL_ALL$STACK <- unlist(STACK_cv)
FINAL_ALL
A data.frame: 3 × 4
RFSVMMLPSTACK
<dbl><dbl><dbl><dbl>
mean0.8349836660.8241919140.8330739000.843054360
std_err0.0051680980.0076893260.0061220870.002039685
final0.8193362660.8243239750.8262776560.830159141
FINAL_ALL <- FINAL_ALL %>%
    t() %>%
    as.data.frame() %>%
    rownames_to_column(var = 'model') %>%
    mutate(model = factor(model, levels=model))
FINAL_ALL
A data.frame: 4 × 4
modelmeanstd_errfinal
<fct><dbl><dbl><dbl>
RF 0.83498370.0051680980.8193363
SVM 0.82419190.0076893260.8243240
MLP 0.83307390.0061220870.8262777
STACK0.84305440.0020396850.8301591
ggplot(FINAL_ALL, aes(model, mean)) + geom_point() + geom_errorbar(aes(ymin=mean-std_err, ymax=mean+std_err))
../_images/f66bf42bc272193f5911c93557187457580050cb4c36ffaca62d28f7119b9ad6.png