Ensemble Learning

Ensemble Learning#

What is ensembling and why does it work?

As we toured the most popular algorithms for classification and regression, we have compared the performances of each technique and commented on which seemed ``better” for the problem at hand. In doing so, we have somewhat forced ourselves into a false dichotomy.

The “no free lunch theorem” essentially requires us to try out a variety of different models for each problem. Each of these approach the problem from a slightly different angle, so it’s likely that each has “learned” something that none of the other models have. Why not combine the predictions from all the models to squeeze out all aspects of the relationships that were learned?

Combining models’ predictions is the basis of ensembling.

Back when we discussed bagging and random forests, we saw that there was strength in numbers.

Note

The average of many guesses typically has less error than any one of those guesses in isolation. The decrease in error from averaging is largest when the guesses are independent, but there is still usually a decrease when the guesses are somewhat correlated.

Bagging usually increased the performance of a model because of this ``strength in numbers” approach. Although the guesses of each model during bagging are not independent (the bootstrapped versions of the training sample the models use contain some of the same individuals), they were independent enough that averaging their predictions decreased errors.

Random forests further decreased the correlation between trees by making the rules of each tree randomly consider only a subset of available predictors. The net result was yet a further decrease in error.

Basics of ensembling#

The key to creating a strong ensemble is diversity. Combining predictions from models that approach the problem in very different ways (and thus whose predictions are not strongly correlated) will often give a stronger model than any of its individual components. Combining predictions from models that are very similar (e.g., a vanilla partition model and a random forest) can actually give a weaker model than its strongest component.

There are a few approaches for creating an ensemble:

Average together the predictions of many different models (or let each model ``vote” as to the class of the individual).
Make a weighted sum of the predictions (since some models will be ``better” than others for a particular problem).
Bayesian model combination (``smart” way of combining models).
Blending - use the predictions output from models B, C, and D as predictors when building model A.
Stacking - treat the predictions of different models as features (\(x\)-variables) that are combined using in a yet another model (like a neural network) to predict \(y\).

When you interview for an important job, you typically go through multiple rounds of interviews with different people. A single interviewer may not be able to assess or test the candidate for each skill or trait required for the job. However, when you combine the input from many different interviewers, each who probed the candidate in a different way, you have a better view of that candidate’s suitability for the job.

Same thing with ensembling. Each model develops a different ``understanding” of the relationships, so if they are combined intelligently, you get to select the best parts of each model.

Stacking#

The predictions from a logistic regression, support vector machine, Naive Bayes, boosted tree, and nearest neighbor models could be use as predictor variables (the \(x\)’s) in a neural network that is used to predict \(y\). Although each model was built using the same individuals, their predictions are not perfectly correlated because they approach the problem differently.

A word about stacking#

When you use the predictions of models as the \(x\)-variables for your final model, you need to be careful to use the right sets of data.

Do not fit each model to the entirety of the training data, save its predictions, then use those as the \(x\) variables to (re-)predict \(y\) with another model-building algorithm.
Doing so mean you have used the training data twice for two different tasks. You’ve ``double-dipped”. It’s gross and data miners do not do it.
In this case, I have described response leakage. By fitting a model onto the entirety of the training data and using its predictions as new variables, those variables directly encode information about the values of \(y\) on the training sample (i.e., it’s memorizing the values somewhat).
Predicting with variables that have ``response leakage” is a surefire way to end up with an overfit model!
Rather, during \(K\)-fold crossvalidation, the predictions on the ``holdout” fold (which the models don’t see that round during training) are saved and it is those predictions that are used as the \(x\) variables to (re-predict) \(y\).

In round 1, a model’s form is derived using the \(K-1\) training folds and predictions are made on the “pseudo-holdout” fold. These predictions get stored into the relevant positions in the “derived feature” (the \(x\) variable we will use in the stacking model). Etc.

Ensembling in practice#

In theory, ensembling sounds great: combining the strong points of different models who approach the relationship in different ways sounds foolproof. However, it doesn’t always work in practice (or give a LARGE decrease in error). Why?

The predictions of the models in an ensemble tend to be highly correlated. In other words, there wasn’t much variation in the prediction. Ensembles are better than their components when combining weakly correlated models.
Ensembling works best when you have lots and lots of data.
The boost in performance you get from ensembling isn’t ever going to be HUGE. Ensembling is to squeeze the last bit of predictive performance out of your models. It’s how you win data mining competitions, but its rarely needed for everyday predictive analytics.
Save ensembling for when you already have a few good models for the process you’re studying and you want to improve it just a little bit more!

Example#

data(credit_data, package = 'modeldata')
set.seed(474)
tt_split <- initial_split(credit_data, prop = 0.8)
TRAIN <- training(tt_split)
HOLDOUT <- testing(tt_split)

REC <- recipe(Status ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_unknown(all_nominal_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_impute_mean(all_numeric_predictors()) %>%
    step_lincomb(all_predictors())

set.seed(474)
CV <- vfold_cv(TRAIN, v = 5)
GRID <- control_stack_grid()
MODEL_ALL <- list(
    RF = rand_forest( mode = 'classification', mtry = tune() )
    ,
    SVM = svm_rbf( mode = 'classification', cost = tune(), rbf_sigma = tune() )
    ,
    MLP = mlp( mode = 'classification', hidden_units = tune(), penalty = tune() )
)
MODEL_ALL

$RF
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()

Computational engine: ranger 


$SVM
Radial Basis Function Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  rbf_sigma = tune()

Computational engine: kernlab 


$MLP
Single Layer Neural Network Model Specification (classification)

Main Arguments:
  hidden_units = tune()
  penalty = tune()

Computational engine: nnet 

tune_model <- function(model){
    WF <- workflow(REC, model)

    RES <- WF %>%
        tune_grid(
            resamples = CV,
            grid = 10,
            control = GRID,
        )

    return(RES)
}

RES_ALL <- lapply(MODEL_ALL, tune_model)

i Creating pre-processing data to finalize unknown parameter: mtry

test_model <- function(model, res){
    cv <- show_best(res, metric = 'roc_auc', n=1) %>% select(mean, std_err)

    best <- res %>% select_best(metric = 'roc_auc')

    final <- workflow(REC, model) %>%
        finalize_workflow(best) %>%
        last_fit(tt_split) %>%
        collect_metrics() %>%
        filter(.metric=='roc_auc')

    cv$final <- final$.estimate

    cv <- unlist(cv)

    return(cv)
}

FINAL_ALL <- as.data.frame(mapply(test_model, MODEL_ALL, RES_ALL))
FINAL_ALL

A data.frame: 3 × 3
	RF	SVM	MLP
	<dbl>	<dbl>	<dbl>
mean	0.834983666	0.824191914	0.833073900
std_err	0.005168098	0.007689326	0.006122087
final	0.819336266	0.824323975	0.826277656

STACK <- stacks() %>%
  add_candidates(RES_ALL$RF, name='RF') %>%
  add_candidates(RES_ALL$SVM, name='SVM') %>%
  add_candidates(RES_ALL$MLP, name='MLP') %>%
  blend_predictions(metric = metric_set(roc_auc)) %>%
  fit_members()

STACK_metrics <- STACK$metrics %>%
    filter(.metric=='roc_auc') %>%
    arrange(desc(mean))
STACK_metrics

A tibble: 6 × 8
penalty	mixture	.metric	.estimator	mean	n	std_err	.config
<dbl>	<dbl>	<chr>	<chr>	<dbl>	<int>	<dbl>	<chr>
1e-01	1	roc_auc	binary	0.8430544	25	0.002039685	Preprocessor1_Model6
1e-02	1	roc_auc	binary	0.8429849	25	0.002161517	Preprocessor1_Model5
1e-03	1	roc_auc	binary	0.8427538	25	0.002170250	Preprocessor1_Model4
1e-04	1	roc_auc	binary	0.8427310	25	0.002171834	Preprocessor1_Model3
1e-05	1	roc_auc	binary	0.8427179	25	0.002172569	Preprocessor1_Model2
1e-06	1	roc_auc	binary	0.8427143	25	0.002172698	Preprocessor1_Model1

PRED <- predict(STACK, HOLDOUT, type = 'prob') %>%
    bind_cols(select(HOLDOUT, Status))
PRED

A tibble: 891 × 3
.pred_bad	.pred_good	Status
<dbl>	<dbl>	<fct>
0.2855829	0.7144171	good
0.3337921	0.6662079	good
0.6826082	0.3173918	bad
0.1480877	0.8519123	good
0.5084013	0.4915987	bad
0.1532305	0.8467695	good
0.1514254	0.8485746	good
0.1530654	0.8469346	bad
0.1440314	0.8559686	good
0.3750470	0.6249530	bad
0.1350415	0.8649585	good
0.4383846	0.5616154	bad
0.5007333	0.4992667	good
0.3853613	0.6146387	bad
0.1599636	0.8400364	good
0.1857167	0.8142833	good
0.2531532	0.7468468	good
0.3225030	0.6774970	good
0.2309305	0.7690695	bad
0.1459254	0.8540746	good
0.2533365	0.7466635	good
0.1595238	0.8404762	good
0.1312773	0.8687227	good
0.1445164	0.8554836	bad
0.3068064	0.6931936	good
0.1798110	0.8201890	good
0.1678405	0.8321595	good
0.3848236	0.6151764	bad
0.4001725	0.5998275	good
0.1609183	0.8390817	good
⋮	⋮	⋮
0.5769095	0.4230905	bad
0.1661648	0.8338352	good
0.1441673	0.8558327	good
0.2197440	0.7802560	good
0.7098195	0.2901805	bad
0.1589497	0.8410503	good
0.1365939	0.8634061	good
0.3889469	0.6110531	good
0.2703685	0.7296315	good
0.2048665	0.7951335	good
0.1435593	0.8564407	good
0.1347091	0.8652909	good
0.3052182	0.6947818	good
0.6357198	0.3642802	bad
0.1748938	0.8251062	good
0.3239060	0.6760940	good
0.4066541	0.5933459	bad
0.5331635	0.4668365	good
0.3914083	0.6085917	good
0.1453161	0.8546839	good
0.2497380	0.7502620	good
0.1388635	0.8611365	good
0.6973680	0.3026320	bad
0.2763054	0.7236946	bad
0.4548297	0.5451703	good
0.5035800	0.4964200	bad
0.5134611	0.4865389	good
0.1613544	0.8386456	good
0.2893787	0.7106213	bad
0.1806205	0.8193795	good

STACK_final <- roc_auc(PRED, Status, .pred_bad)
STACK_final

A tibble: 1 × 3
.metric	.estimator	.estimate
<chr>	<chr>	<dbl>
roc_auc	binary	0.8301591

STACK_cv <- STACK_metrics %>% head(1) %>% select(mean, std_err)
STACK_cv$final <- STACK_final$.estimate
FINAL_ALL$STACK <- unlist(STACK_cv)
FINAL_ALL

A data.frame: 3 × 4
	RF	SVM	MLP	STACK
	<dbl>	<dbl>	<dbl>	<dbl>
mean	0.834983666	0.824191914	0.833073900	0.843054360
std_err	0.005168098	0.007689326	0.006122087	0.002039685
final	0.819336266	0.824323975	0.826277656	0.830159141

FINAL_ALL <- FINAL_ALL %>%
    t() %>%
    as.data.frame() %>%
    rownames_to_column(var = 'model') %>%
    mutate(model = factor(model, levels=model))
FINAL_ALL

A data.frame: 4 × 4
model	mean	std_err	final
<fct>	<dbl>	<dbl>	<dbl>
RF	0.8349837	0.005168098	0.8193363
SVM	0.8241919	0.007689326	0.8243240
MLP	0.8330739	0.006122087	0.8262777
STACK	0.8430544	0.002039685	0.8301591

ggplot(FINAL_ALL, aes(model, mean)) + geom_point() + geom_errorbar(aes(ymin=mean-std_err, ymax=mean+std_err))

../_images/f66bf42bc272193f5911c93557187457580050cb4c36ffaca62d28f7119b9ad6.png