Support Vector Machines

── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 1.3.0 ──

✔ broom        1.0.7     ✔ recipes      1.1.1
✔ dials        1.4.0     ✔ rsample      1.2.1
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.5.1     ✔ tidyr        1.3.1
✔ infer        1.0.7     ✔ tune         1.3.0
✔ modeldata    1.4.0     ✔ workflows    1.2.0
✔ parsnip      1.3.0     ✔ workflowsets 1.1.0
✔ purrr        1.0.4     ✔ yardstick    1.3.2

── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::%||%()    masks base::%||%()
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()

Attaching package: ‘e1071’

The following object is masked from ‘package:tune’:

    tune

The following object is masked from ‘package:rsample’:

    permutations

The following object is masked from ‘package:parsnip’:

    tune

Attaching package: ‘MASS’

The following object is masked from ‘package:dplyr’:

    select

Support Vector Machines#

Support Vector Machines (SVM) are a relatively recent approach for classification and regression that work well with little user input or tuning (like random forests).

Contrary to the name, the technique has nothing to do with machines. The word “support” applies only in a way mathematicians think. The word “vector” has nothing to do with the vectors we create in R. But it sure sounds fancy.

With classification, the technique is concerned with finding pivotal individuals in the dataset that seem to play a key role in differentiating one class from another. With regression, the technique is concerned with finding pivotal individuals in the dataset to whose errors are important to minimize.

Maximum Margin Classifier#

A separating hyperplane is simply a ``flat” boundary that separates the data space (all possible combinations of the predictor variables) into two regions.

Maybe there’s a separating hyperplane that can evenly split the data space so all individuals on one class are on one side and all members of the other class are on the other side?

With one predictor, the “data space” is just the number line, so a “separating hyperplane” would just be some particular point on the line (or perhaps a “line” that extends into a second dimension (“no man’s land”; a region outside the data space).

../_images/673e4fec72a2f1e282fda0a33d608b533b80a38f61a081bd106408ee957e2092.png

With two predictors, the data space is visualized as the standard scatterplot, and a “separating hyperplane” would be a line that bisects it. You can imagine this line extending out towards you and away from you into a third dimension (“no man’s land”; not part of the data space).

../_images/153d7b7ade84dffac5c96a1f191ca3c49476c565ccb6719d5bb1b5908f0fa5d7.png

With four or more predictor variables, the dataspace is 4-dimensional, and the separating hyperplane would extend into an ``unseen” 5-th dimension. No way to visualize this, but you get the idea.

Note

The maximum margin classifier seeks to find the separating hyperplane that puts the most distance between the nearest example of each class (assuming of course, that it is possible to do so).

Separating Hyperplane#

All four of these “separating hyperplanes” succeed in bisecting the data space into a “blue region” and a “red region”. Classification proceeds by figuring out which side a new individual is on: red or blue? But which of these separating hyperplanes is the “best”?

../_images/753421c404cc2e137e2724f2359d8b24b4606ad703b5efb5b00a30e7d9362528.png

Maximal Soft Margin Classifiers#

The maximal margin classifier takes the best separating hyperplane to be the one that gives the largest ``margin” (i.e., spacing; black arrows) between it and the nearest point of each class (on the plot, the distance from the separating hyperplane to the dotted lines measured perpendicularly). All points are not only on the correct side of the hyperplane, \uline{they are also on the correct side of the margin} (red points are above the dashed red line; blue points are below the dashed blue line)!

../_images/1795bca158bd773c8054d5a199a886dc484b52b02b449599ccd28d2c104174e9.png

Support Vectors#

You’ll note that three of the points have been starred. These are the support vectors. They ``support” the separating hyperplane because if their values changed, so would the hyperplane.

Notice that the rest of the data is essentially irrelevant at this point: it’s only the positions of the support vectors that matter. This means the procedure can be quite robust to outliers.

../_images/4cbf528d72dcad743952262a384748e1aa4d512180fdff7bda2ab1fe1f4cd48c.png

What if not separable?#

What if a hyperplane doesn’t neatly divide the data space in two so that all of one class lives on one side and all of the other is on the other side?

../_images/0e852024b38169b94360e6f95ac93e1ea767bb1d23f0cba20509bc20ba46ee94.png

What if a hyperplane doesn’t neatly divide the data space in two so that all of one class lives on one side and all of the other is on the other side?

Support vector classifiers (aka soft margin classifier) allow some observations to be on the wrong side of the margin and even on the wrong side of the hyperplane.

../_images/c98f96ae7847d6ac5e82ebe8842022487c333ffa5488eb5f36e9cc4d370c36db.png

Soft margin classifier details#

The soft margin classifier depends on a parameter $C$. This is essentially a ``budget” that allows for some misclassifications.

When $C>0$, no more than $C$ observations can be on the wrong side of the hyperplane.
Larger values of $C$ allow more misclassifications (and more support vectors).
Smaller values of $C$ allow fewer misclassifications (and fewer support vectors).

The ``best” value of $C$ is typically estimated with cross-validation, since its value will depend on how mixed together the classes are. Smaller values of $C$ yield models that are more complex and fit the training data better (since few misclassifications are allowed). Larger values of $C$ yield simpler models which may generalize better (at the cost of less accuracy).

Kernel Trick in Support Vector Machine#

Pictured is an extreme artificial case where no hyperplane would do a good job separating the classes, but there is still a distinct but nonlinear boundary.

../_images/cd6c3d1794b084e7b8cbd6549c405a10c1c8449b31cdffa356e2a6317b67042a.png

Support vector machines are somewhat unique in the field of data mining and machine learning in that it takes the philosophy that the model is good, it’s the data that needs to become more flexible.

The support vector machine believes that a linear regression (for numerical $y$) is a great model, it’s just that the original data space isn’t the place to fit it. Likewise, it believes that a flat boundary (separating hyperplane) is a great way to make classifications, it’s just the original data space isn’t the place to use it.

This is in stark contrast to other models that that the observed values (or transformations or derived quantities) as is, then modifies the model until there’s a good fit.

The support vector machine is an extension to the soft margin classifier that uses the ``kernel trick” to discern non-linear boundaries between classes.

The mathematics of the kernel trick are beyond the scope of this course, but in essence it takes the original data and projects it into a data space of higher dimensionality, where it hopes it can find a separating hyperplane.

Usually, the ``curse of dimensionality” means that tasks in high dimensions (many predictors) become difficult. It just so happens to be the case that sometimes it makes tasks easier.

Kernel Trick#

Imagine classifying individuals as red or blue depending on the value of their value of $x$. It is clear that no separating hyperplane exists. It is also clear that a non-linear boundary exists between the two classes. However, this ``boundary” reaches outside the dataspace (the one-dimensional line).

../_images/aec46da479c34014c9dfbd866d77b605223f69bca96087b58fd851a1b5e7cec3.png

The problem is that figuring out a non-linear boundary is very difficult because there are so many possible shapes it could take. Lines and hyperplanes are much easier to work with, so instead of trying to figure out the non-linear boundary, the ``kernel trick” takes the data and projects it into a higher dimensional space where it hopes a separating hyperplane exists.

What does ``projecting into a higher dimensional space” even mean? It basically means creating new variables from ones that you already have. If $x$ is the one predictor you have collected, then using $x^2$ and $\log x$ as additional predictors essentially makes the data space have three dimensions.

These ``new” predictors don’t contain any new information about the classes of course, but oddly enough, they can make the boundary between classes linear.

Project from one to two dimensions#

There was no separating hyperplane in the 1D dataspace (just $x$ as a predictor). But by projecting into a 2D database by adding $x^2$ as a predictor, all of a sudden there is a separating hyperplane.

../_images/856bf6b911c193d2b460b75e52c10bdb3cd16cb607a2590b4110026b97986869.png

Projecting from two to three dimensions#

There is no separating hyperplane in the 2D dataspace using the predictors $x_1$ and $x_2$. But we do see that a curved boundary exists.

../_images/81157d7d10314e359224c3a930626df55276bcc2d432b32c357e0e0e3e3c0e51.png

Projecting from two to three dimensions#

By adding a third “predictor” $x_3 = x_1^2 + x_2^2$ (no new information about the problem), the data space becomes 3D. Now there is a separating hyperplane that neatly divides the space into a “red region” and “blue region”, achieving 100% accuracy.

../_images/f6a24598a1973888bf345316dd62dbbe076f060be1435bfde773d645e60b5135.png

Adding additional predictors to project the data into a higher dimensional space often works, but the question is: what predictors? There’s an infinite number of possible combinations.

SVMs use the ``kernel trick” to get around this problem. A kernel is just a matrix whose values record a somewhat abstract measure of similarity (kind of like a distance) between individuals. The values of the kernel matrix are given by an equation we choose, like

\[$K[i,j]$ = (1 + x_{1,i}x_{1,j} + x_{2,i}x_{2,j})^3\]

The amazing thing is that by defining the kernel matrix, we can implicitly define additional predictors without specifying exactly what they are or how to get them.

Kernels#

Note

It turns out that the equation for the optimal separating hyperplane does not depend on the exact locations (coordinates) of the individuals in the higher-dimension space we are projecting into. Rather, it depends only the entries in the kernel matrix, i.e., how we quantify the similarity between two individuals.

Don’t actually have to project each individual into the higher dimensional space, only have to compute kernel/similarities (more computationally efficient).
Kernel implicitly captures the higher-dimensional space without explicitly having to construct it. This allows projecting into infinite dimensional spaces when exploring the existence of a separating hyperplane.

This is all very trippy stuff!

If you are pressed for a simple explanation, try out the following oversimplification.

Note

When doing classification, support vector machines assigns a score to each individual, where the score is some number derived from transformations of the predictor variables. If that score exceeds a certain threshold, it is assigned class A. Otherwise it is assigned class B.

Popular kernels#

Let the predictor variables are $a$, $b$, $c$, etc., then the inner product between individuals $i$ and $j$ is the sum of the products of the individuals’ predictor variables.

Linear (this is just the soft margin classifier, good for when boundary between classes is hyperplane in the recorded variables):

\[K_{ij} = a_ia_j + b_ib_j + c_ic_j + \ldots\]

Polynomial of order $d$ (good for when boundary between classes is a curve that can be captured by some polynomial in the recorded variables):

\[K_{ij} = (1 + a_ia_j + b_ib_j + c_ic_j + \ldots)^d\]

Radial basis or Gaussian (general purpose; only ``nearby” training observations have an impact on how a new individual is classified):

\[K_{ij} = e^{-\gamma ( (a_i-a_j)^2 + (b_i-b_j)^2 + (c_i-c_j)^2 + \ldots) }\]

Radial Basis Kernel#

Like ``magic”, the SVM with a radial basis kernel is able to recognize that the boundary between classes is non-linear, and does a pretty good job at figuring it out. The circles in this plot represent correct classifications and the X’s represent misclassifications.

../_images/817c87c377c7fadc062e0fa7b9d1787895f42f174c18fedb484ba65379e31fb4.png

../_images/a89fd09957936450d20dd312e16985589969fff3ebbdf72d295eb9a4b109ee3a.png

Another example.

../_images/865a37c7facabc1e5621d8d31249f9692eedf686429d94322e3327c39c2021c6.png

../_images/0d10005b3e04abc95414282256dacb0ec08f2bc361484b1347bb33b121e20772.png

Support Vector Machines for classification summary#

While the whole procedure seems a bit wild and too good to be true, SVMs often work well in practice. The two main considerations are:

What kernel to use? Popular choices are the polynomial kernel (new predictors are polynomial combinations of the original ones) and the radial basis kernel / Gaussian kernel (which projects into the infinite dimensional space by essentially using an infinite number of polynomials).
What parameters of the kernel should be used? Polynomial needs a “degree” parameter (how many powers of the predictors should be used). The radial basis kernel needs a “spread” parameter ($gamma$) which controls how quickly the values inside the kernel matrix decrease with the distance between individuals.
What is the ``cost parameter”, i.e., the penalty to apply to a proposed separating hyperplane when it makes misclassifications. Larger costs make the model fit the training data better (lower bias) at the expense of increasing its variance (sensitivity to the training data) and its risk of overfitting.

Support Vector Machines (SVM) approaches classification in a new and interesting way.

Find a ``separating hyperplane” (basically a flat plane) that divides the data space into two parts where most (hopefully all) members of a class are on one side of the plane.
The location of this separating hyperplane is dependent on only a few pivotal individuals in the dataset (the support vectors) so the solution is ``sparse”. With fewer features, the prediction algorithm is extremely memory and computationally efficient (once the model is fit).

Problem: the boundary between classes may not be well-described by a straight edge!

Solution: project the data into a higher-dimensional data space (i.e., create new variables from the ones where we already have) where the boundary is better described by a straight edge.

Support Vector Machines for Regression#

Support Vector Machines can also be used to predict numerical responses (just ``won” a class competition in BAS 479). The procedure borrows heavily from that used for classification.

SVM assumes that there is a linear relationship between the $y$ variable and some predictor variables (the $x$’s). However, these predictors may not be the raw values that have been collected.
We will project the data into a higher dimensional data space with the “kernel trick”, thereby implicitly creating new predictor variables. With the “right” kernel, relationships may be approximately linear.
A linear regression model is fit to the relationship in this higher-dimensional space but with a twist: not all errors made by the model count against it.
Instead of choosing the form of the linear regression by minimizing its sum of squared errors, we will minimize the sum of absolute values of only errors that are ``large” (greater than some threshold).
The support vectors will be the points whose errors contributed to the form of the model, so once again the model is ``sparse” (depends only on a few pivotal individuals).

How it works#

The errors (lines connecting the dots to the solid blue line) ``don’t count” if they are smaller than some threshold $\epsilon$ (thick vertical black line segment). \uline{SVM finds the line whose coefficients are closest to 0 (i.e., regularization is being performed) and whose errors (specifically errors in excess of $\epsilon$, given by the length of the red segments) are smallest}. The support vectors (stars) are the points whose errors are in excess of $\epsilon$.

../_images/67c05fd468e52aa948ff7e702ce671d585e524b5afc15aa3421e99875a091d74.png

Bias-variance tradeoff#

When the form of the SVM is being determined, it balances the bias and variance of the resulting model through the use of a cost parameter $C$.

Large values of $C$ focus the attention towards the errors made by the model. The result will be a model that has a low bias (it fits the training data well) but potentially high variance (poor generalizability due to sensitivity to the specific individuals in the training data).
Small values of $C$ focus the attention towards the size of the coefficients. The result will be a model that makes a larger error on the training data (higher bias since it won’t have as much ``wiggle room” to fit the data) but that has a lower variance (not as sensitive to the individuals in the data).

The optimal value of the cost parameter $C$, the error threshold $\epsilon$, and any kernel specific parameters (the degree in a polynomial kernel and the ``gamma” parameter of a radial basis kernel) is found via cross-validation.

Larger cost = errors are weighted more. Model will fit the data better, but could be too wiggly to generalize.

Lower cost = coefficients are weighted more. Model will not be as wiggly and will have a larger error on the training, but at least it is not overfit.

../_images/c818a47f7d6d610c1a26adbe8e8e6b42c7a24646edd8ecf957b1a33c7d0c6075.png

Tuning parameters and making predictions#

Tuning the parameters in $e1071$#

The $e1071$ package offers an additional options that are not available through $train$ and $caret$ that you should be aware of. The cost and $\epsilon$ (epsilon) parameters always must be tuned. The default kernel is the radial basis kernel, and this requires additional tuning of the ``gamma” parameter (how quickly the entries of the kernel matrix go to 0 for points that are separated by larger distances). It’s possible to use $tune$ from the $e1071$ library.

# perform a grid search
SVMparams <- tune("svm", y~x,ranges=
                    list(epsilon=seq(0,1,0.2),cost=2^(2:8),gamma=10^(seq(-3,1,.5))))
SVMparams

Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 epsilon cost     gamma
     0.2  256 0.3162278

- best performance: 79.12614 

Plotting the selected model#

In this case, the defaults do a pretty good job (at least visually), but the selected parameters are designed to offer a reasonable tradeoff between the bias and complexity of the model so that it generalizes well to new data.

plot(y~x,pch=20,cex=1.5)
SVM <- svm(x,y,scale=TRUE,cost=256,gamma=0.3162278,epsilon=0.2)
new <- data.frame(x=seq(2,10,length=500) )
pred.y <- predict(SVM,newdata=new,scale=TRUE)
lines(new$x,pred.y,col="red")

../_images/f0ed39f317d106eea936c150a25ec8a89f59f6dbf848a97390d0541ab444dc3c.png

Support vector machines with tidymodels#

Now we build SVMs with tidymodels. We try both linear and non-linear kernels:

svm_linear() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes (using a linear class boundary). For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses a linear fit.
svm_poly() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes using a polynomial class boundary. For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses polynomial functions of the predictors.
svm_rbf() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes using a nonlinear class boundary. For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses nonlinear functions of the predictors.

An example with regression problem.#

data(BODYFAT, package='regclass')

# This is not usually right if want to investigate testing performances.
TRAIN <- BODYFAT

REC <- recipe(BodyFat ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_lincomb(all_predictors())

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_linear( mode = 'regression', cost = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS

A tibble: 14 × 7
cost	.metric	.estimator	mean	n	std_err	.config
<dbl>	<chr>	<chr>	<dbl>	<int>	<dbl>	<chr>
0.1000000	rmse	standard	4.2334779	5	0.19911762	Preprocessor1_Model1
0.1000000	rsq	standard	0.7075926	5	0.01175335	Preprocessor1_Model1
0.3162278	rmse	standard	4.2334779	5	0.19911762	Preprocessor1_Model2
0.3162278	rsq	standard	0.7075926	5	0.01175335	Preprocessor1_Model2
1.0000000	rmse	standard	4.2334779	5	0.19911762	Preprocessor1_Model3
1.0000000	rsq	standard	0.7075926	5	0.01175335	Preprocessor1_Model3
3.1622777	rmse	standard	4.2334779	5	0.19911762	Preprocessor1_Model4
3.1622777	rsq	standard	0.7075926	5	0.01175335	Preprocessor1_Model4
10.0000000	rmse	standard	4.2334779	5	0.19911762	Preprocessor1_Model5
10.0000000	rsq	standard	0.7075926	5	0.01175335	Preprocessor1_Model5
31.6227766	rmse	standard	4.2334779	5	0.19911762	Preprocessor1_Model6
31.6227766	rsq	standard	0.7075926	5	0.01175335	Preprocessor1_Model6
100.0000000	rmse	standard	4.2334779	5	0.19911762	Preprocessor1_Model7
100.0000000	rsq	standard	0.7075926	5	0.01175335	Preprocessor1_Model7

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_poly( mode = 'regression', cost = tune(), degree = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5), degree = seq(1,3) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS

A tibble: 42 × 8
cost	degree	.metric	.estimator	mean	n	std_err	.config
<dbl>	<int>	<chr>	<chr>	<dbl>	<int>	<dbl>	<chr>
0.1000000	1	rmse	standard	4.2772574	5	0.05810841	Preprocessor1_Model01
0.1000000	1	rsq	standard	0.6884937	5	0.03654208	Preprocessor1_Model01
0.3162278	1	rmse	standard	4.2528116	5	0.06796573	Preprocessor1_Model02
0.3162278	1	rsq	standard	0.6907363	5	0.04110505	Preprocessor1_Model02
1.0000000	1	rmse	standard	4.2453571	5	0.07287966	Preprocessor1_Model03
1.0000000	1	rsq	standard	0.6919614	5	0.03914640	Preprocessor1_Model03
3.1622777	1	rmse	standard	4.2320469	5	0.07855867	Preprocessor1_Model04
3.1622777	1	rsq	standard	0.6932687	5	0.03972725	Preprocessor1_Model04
10.0000000	1	rmse	standard	4.2358162	5	0.07803618	Preprocessor1_Model05
10.0000000	1	rsq	standard	0.6929180	5	0.03957301	Preprocessor1_Model05
31.6227766	1	rmse	standard	4.2366662	5	0.07725952	Preprocessor1_Model06
31.6227766	1	rsq	standard	0.6927763	5	0.03956189	Preprocessor1_Model06
100.0000000	1	rmse	standard	4.2386710	5	0.07667949	Preprocessor1_Model07
100.0000000	1	rsq	standard	0.6924852	5	0.03955387	Preprocessor1_Model07
0.1000000	2	rmse	standard	5.2153421	5	0.25317917	Preprocessor1_Model08
0.1000000	2	rsq	standard	0.5758829	5	0.06429783	Preprocessor1_Model08
0.3162278	2	rmse	standard	5.7140481	5	0.22669922	Preprocessor1_Model09
0.3162278	2	rsq	standard	0.5296920	5	0.06855940	Preprocessor1_Model09
1.0000000	2	rmse	standard	7.5380347	5	1.03636336	Preprocessor1_Model10
1.0000000	2	rsq	standard	0.4173636	5	0.08193462	Preprocessor1_Model10
3.1622777	2	rmse	standard	8.5160534	5	1.69340525	Preprocessor1_Model11
3.1622777	2	rsq	standard	0.3901460	5	0.07930041	Preprocessor1_Model11
10.0000000	2	rmse	standard	8.6303556	5	1.77936173	Preprocessor1_Model12
10.0000000	2	rsq	standard	0.3867060	5	0.07652065	Preprocessor1_Model12
31.6227766	2	rmse	standard	8.5630588	5	1.73669920	Preprocessor1_Model13
31.6227766	2	rsq	standard	0.3954085	5	0.06915727	Preprocessor1_Model13
100.0000000	2	rmse	standard	9.2302374	5	1.62033764	Preprocessor1_Model14
100.0000000	2	rsq	standard	0.3393308	5	0.09226738	Preprocessor1_Model14
0.1000000	3	rmse	standard	91.3880946	5	68.46649436	Preprocessor1_Model15
0.1000000	3	rsq	standard	0.1263342	5	0.06111487	Preprocessor1_Model15
0.3162278	3	rmse	standard	141.5058503	5	120.19382589	Preprocessor1_Model16
0.3162278	3	rsq	standard	0.1503114	5	0.06539197	Preprocessor1_Model16
1.0000000	3	rmse	standard	245.4066103	5	222.76878794	Preprocessor1_Model17
1.0000000	3	rsq	standard	0.1660929	5	0.05626166	Preprocessor1_Model17
3.1622777	3	rmse	standard	272.7828149	5	249.50410098	Preprocessor1_Model18
3.1622777	3	rsq	standard	0.1605847	5	0.05414295	Preprocessor1_Model18
10.0000000	3	rmse	standard	272.7828149	5	249.50410098	Preprocessor1_Model19
10.0000000	3	rsq	standard	0.1605847	5	0.05414295	Preprocessor1_Model19
31.6227766	3	rmse	standard	272.7828149	5	249.50410098	Preprocessor1_Model20
31.6227766	3	rsq	standard	0.1605847	5	0.05414295	Preprocessor1_Model20
100.0000000	3	rmse	standard	272.7828149	5	249.50410098	Preprocessor1_Model21
100.0000000	3	rsq	standard	0.1605847	5	0.05414295	Preprocessor1_Model21

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_rbf( mode = 'regression', cost = tune(), rbf_sigma = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5), rbf_sigma = 10^seq(-3,0,by=0.5) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS

A tibble: 98 × 8
cost	rbf_sigma	.metric	.estimator	mean	n	std_err	.config
<dbl>	<dbl>	<chr>	<chr>	<dbl>	<int>	<dbl>	<chr>
0.1000000	0.001000000	rmse	standard	7.2193444	5	0.380066743	Preprocessor1_Model01
0.1000000	0.001000000	rsq	standard	0.4623490	5	0.026279555	Preprocessor1_Model01
0.3162278	0.001000000	rmse	standard	6.5169477	5	0.391014745	Preprocessor1_Model02
0.3162278	0.001000000	rsq	standard	0.4890917	5	0.028176366	Preprocessor1_Model02
1.0000000	0.001000000	rmse	standard	5.6242385	5	0.361280294	Preprocessor1_Model03
1.0000000	0.001000000	rsq	standard	0.5648684	5	0.033165114	Preprocessor1_Model03
3.1622777	0.001000000	rmse	standard	4.8112776	5	0.319776846	Preprocessor1_Model04
3.1622777	0.001000000	rsq	standard	0.6393887	5	0.026069230	Preprocessor1_Model04
10.0000000	0.001000000	rmse	standard	4.2812841	5	0.287633312	Preprocessor1_Model05
10.0000000	0.001000000	rsq	standard	0.6991226	5	0.021149283	Preprocessor1_Model05
31.6227766	0.001000000	rmse	standard	4.0641470	5	0.193576958	Preprocessor1_Model06
31.6227766	0.001000000	rsq	standard	0.7244006	5	0.009198177	Preprocessor1_Model06
100.0000000	0.001000000	rmse	standard	4.0613777	5	0.126205181	Preprocessor1_Model07
100.0000000	0.001000000	rsq	standard	0.7271831	5	0.014283924	Preprocessor1_Model07
0.1000000	0.003162278	rmse	standard	6.6063137	5	0.396994794	Preprocessor1_Model08
0.1000000	0.003162278	rsq	standard	0.4930286	5	0.023542752	Preprocessor1_Model08
0.3162278	0.003162278	rmse	standard	5.6317618	5	0.369515614	Preprocessor1_Model09
0.3162278	0.003162278	rsq	standard	0.5651315	5	0.026921486	Preprocessor1_Model09
1.0000000	0.003162278	rmse	standard	4.7901056	5	0.307670897	Preprocessor1_Model10
1.0000000	0.003162278	rsq	standard	0.6518101	5	0.023240692	Preprocessor1_Model10
3.1622777	0.003162278	rmse	standard	4.2562479	5	0.261801485	Preprocessor1_Model11
3.1622777	0.003162278	rsq	standard	0.7085381	5	0.017381385	Preprocessor1_Model11
10.0000000	0.003162278	rmse	standard	4.0908816	5	0.169849564	Preprocessor1_Model12
10.0000000	0.003162278	rsq	standard	0.7225774	5	0.013465151	Preprocessor1_Model12
31.6227766	0.003162278	rmse	standard	4.0932864	5	0.106654798	Preprocessor1_Model13
31.6227766	0.003162278	rsq	standard	0.7237566	5	0.014452649	Preprocessor1_Model13
100.0000000	0.003162278	rmse	standard	4.1531276	5	0.105710772	Preprocessor1_Model14
100.0000000	0.003162278	rsq	standard	0.7172330	5	0.011932147	Preprocessor1_Model14
0.1000000	0.010000000	rmse	standard	5.8847100	5	0.386888762	Preprocessor1_Model15
0.1000000	0.010000000	rsq	standard	0.5476825	5	0.028620466	Preprocessor1_Model15
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
100.0000000	0.1000000	rmse	standard	6.4155504	5	0.32635118	Preprocessor1_Model35
100.0000000	0.1000000	rsq	standard	0.4360800	5	0.07095699	Preprocessor1_Model35
0.1000000	0.3162278	rmse	standard	6.4946261	5	0.43613805	Preprocessor1_Model36
0.1000000	0.3162278	rsq	standard	0.4736080	5	0.03708764	Preprocessor1_Model36
0.3162278	0.3162278	rmse	standard	5.6362026	5	0.43931174	Preprocessor1_Model37
0.3162278	0.3162278	rsq	standard	0.5392246	5	0.02671818	Preprocessor1_Model37
1.0000000	0.3162278	rmse	standard	5.1931752	5	0.38012825	Preprocessor1_Model38
1.0000000	0.3162278	rsq	standard	0.5722448	5	0.01260412	Preprocessor1_Model38
3.1622777	0.3162278	rmse	standard	5.2187727	5	0.33864221	Preprocessor1_Model39
3.1622777	0.3162278	rsq	standard	0.5569203	5	0.02404800	Preprocessor1_Model39
10.0000000	0.3162278	rmse	standard	5.2571019	5	0.32352796	Preprocessor1_Model40
10.0000000	0.3162278	rsq	standard	0.5489426	5	0.02967668	Preprocessor1_Model40
31.6227766	0.3162278	rmse	standard	5.2571019	5	0.32352796	Preprocessor1_Model41
31.6227766	0.3162278	rsq	standard	0.5489426	5	0.02967668	Preprocessor1_Model41
100.0000000	0.3162278	rmse	standard	5.2571019	5	0.32352796	Preprocessor1_Model42
100.0000000	0.3162278	rsq	standard	0.5489426	5	0.02967668	Preprocessor1_Model42
0.1000000	1.0000000	rmse	standard	7.5253833	5	0.38366783	Preprocessor1_Model43
0.1000000	1.0000000	rsq	standard	0.2324882	5	0.03983346	Preprocessor1_Model43
0.3162278	1.0000000	rmse	standard	7.2473472	5	0.40711516	Preprocessor1_Model44
0.3162278	1.0000000	rsq	standard	0.2583491	5	0.03670843	Preprocessor1_Model44
1.0000000	1.0000000	rmse	standard	6.7914166	5	0.43822868	Preprocessor1_Model45
1.0000000	1.0000000	rsq	standard	0.3313275	5	0.03799679	Preprocessor1_Model45
3.1622777	1.0000000	rmse	standard	6.6751977	5	0.44997757	Preprocessor1_Model46
3.1622777	1.0000000	rsq	standard	0.3501769	5	0.03892751	Preprocessor1_Model46
10.0000000	1.0000000	rmse	standard	6.6751804	5	0.44997577	Preprocessor1_Model47
10.0000000	1.0000000	rsq	standard	0.3501800	5	0.03892593	Preprocessor1_Model47
31.6227766	1.0000000	rmse	standard	6.6751804	5	0.44997577	Preprocessor1_Model48
31.6227766	1.0000000	rsq	standard	0.3501800	5	0.03892593	Preprocessor1_Model48
100.0000000	1.0000000	rmse	standard	6.6751804	5	0.44997577	Preprocessor1_Model49
100.0000000	1.0000000	rsq	standard	0.3501800	5	0.03892593	Preprocessor1_Model49

An example with classification problem.#

data(WINE, package='regclass')

# This is not usually right if want to investigate testing performances.
TRAIN <- WINE

REC <- recipe(Quality ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_lincomb(all_predictors())

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_linear( mode = 'classification', cost = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
        metrics = metric_set(accuracy, precision, recall),
    )

METRICS <- collect_metrics(RES)

METRICS

A tibble: 21 × 7
cost	.metric	.estimator	mean	n	std_err	.config
<dbl>	<chr>	<chr>	<dbl>	<int>	<dbl>	<chr>
0.1000000	accuracy	binary	0.8244444	5	0.009910436	Preprocessor1_Model1
0.1000000	precision	binary	0.7837739	5	0.013676757	Preprocessor1_Model1
0.1000000	recall	binary	0.7616896	5	0.018313221	Preprocessor1_Model1
0.3162278	accuracy	binary	0.8244444	5	0.009910436	Preprocessor1_Model2
0.3162278	precision	binary	0.7837739	5	0.013676757	Preprocessor1_Model2
0.3162278	recall	binary	0.7616896	5	0.018313221	Preprocessor1_Model2
1.0000000	accuracy	binary	0.8244444	5	0.009910436	Preprocessor1_Model3
1.0000000	precision	binary	0.7837739	5	0.013676757	Preprocessor1_Model3
1.0000000	recall	binary	0.7616896	5	0.018313221	Preprocessor1_Model3
3.1622777	accuracy	binary	0.8244444	5	0.009910436	Preprocessor1_Model4
3.1622777	precision	binary	0.7837739	5	0.013676757	Preprocessor1_Model4
3.1622777	recall	binary	0.7616896	5	0.018313221	Preprocessor1_Model4
10.0000000	accuracy	binary	0.8244444	5	0.009910436	Preprocessor1_Model5
10.0000000	precision	binary	0.7837739	5	0.013676757	Preprocessor1_Model5
10.0000000	recall	binary	0.7616896	5	0.018313221	Preprocessor1_Model5
31.6227766	accuracy	binary	0.8244444	5	0.009910436	Preprocessor1_Model6
31.6227766	precision	binary	0.7837739	5	0.013676757	Preprocessor1_Model6
31.6227766	recall	binary	0.7616896	5	0.018313221	Preprocessor1_Model6
100.0000000	accuracy	binary	0.8244444	5	0.009910436	Preprocessor1_Model7
100.0000000	precision	binary	0.7837739	5	0.013676757	Preprocessor1_Model7
100.0000000	recall	binary	0.7616896	5	0.018313221	Preprocessor1_Model7

More experiments and details:#

Important

Please continue the above examples:

start with training/testing splits;
use other modeling functions;
plot the metrics with different tuning parameters.

Pros and Cons#

Pros:

Models non-linear relationships with ease and with little user input.
Actually solves the problem it’s trying to solve: given the cost, $\epsilon$, etc., parameters the algorithm will find the best model (none of the tree-based methods will necessarily stumble upon the best collection of rules).
Automatic and implicit feature creation by the ``kernel trick” (kind of solves the problem of what the predictor variables should be).
Sparse (solution depends only on support vectors); easy to store and make predictions.
Not too much tuning required.
Strong mathematical and theoretical foundation.

Cons:

For classification, hard to get ``probabilities” of being in a class - predictions are binary.
No confidence intervals (but most other methods can’t do this either).
While a strong model, often not the top performing.
Can be very slow for big problems.

Support Vector Machines

Contents

Support Vector Machines#

Maximum Margin Classifier#

Separating Hyperplane#

Maximal Soft Margin Classifiers#

Support Vectors#

What if not separable?#

Soft margin classifier details#

Kernel Trick in Support Vector Machine#

Kernel Trick#

Project from one to two dimensions#

Projecting from two to three dimensions#

Projecting from two to three dimensions#

Kernels#

Popular kernels#

Radial Basis Kernel#

Support Vector Machines for classification summary#

Support Vector Machines for Regression#

How it works#

Bias-variance tradeoff#

Tuning parameters and making predictions#

Tuning the parameters in \(e1071\)#

Plotting the selected model#

Support vector machines with tidymodels#

An example with regression problem.#

An example with classification problem.#

More experiments and details:#

Pros and Cons#