library(tidymodels)
library(e1071)
library(MASS)
library(ISLR)
library(robustbase)
Hide code cell output
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 1.3.0 ──
 broom        1.0.7      recipes      1.1.1
 dials        1.4.0      rsample      1.2.1
 dplyr        1.1.4      tibble       3.2.1
 ggplot2      3.5.1      tidyr        1.3.1
 infer        1.0.7      tune         1.3.0
 modeldata    1.4.0      workflows    1.2.0
 parsnip      1.3.0      workflowsets 1.1.0
 purrr        1.0.4      yardstick    1.3.2
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
 purrr::%||%()    masks base::%||%()
 purrr::discard() masks scales::discard()
 dplyr::filter()  masks stats::filter()
 dplyr::lag()     masks stats::lag()
 recipes::step()  masks stats::step()
Attaching package: ‘e1071’
The following object is masked from ‘package:tune’:

    tune
The following object is masked from ‘package:rsample’:

    permutations
The following object is masked from ‘package:parsnip’:

    tune
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:

    select

Support Vector Machines#

Support Vector Machines (SVM) are a relatively recent approach for classification and regression that work well with little user input or tuning (like random forests).

Contrary to the name, the technique has nothing to do with machines. The word “support” applies only in a way mathematicians think. The word “vector” has nothing to do with the vectors we create in R. But it sure sounds fancy.

With classification, the technique is concerned with finding pivotal individuals in the dataset that seem to play a key role in differentiating one class from another. With regression, the technique is concerned with finding pivotal individuals in the dataset to whose errors are important to minimize.

Maximum Margin Classifier#

A separating hyperplane is simply a ``flat” boundary that separates the data space (all possible combinations of the predictor variables) into two regions.

Maybe there’s a separating hyperplane that can evenly split the data space so all individuals on one class are on one side and all members of the other class are on the other side?

With one predictor, the “data space” is just the number line, so a “separating hyperplane” would just be some particular point on the line (or perhaps a “line” that extends into a second dimension (“no man’s land”; a region outside the data space).

Hide code cell source
set.seed(474); x <- runif(20,1,3)
y <- rep(0,length(x))
plot(y~x,ylim=c(-.5,.5),col=ifelse(x<1.8,"red","blue"),pch=20,cex=1.4,axes=FALSE,ylab="",xlim=c(1,3))
abline(v=1.78,lwd=2)
abline(h=0)
axis(1,at=seq(1,3,by=0.5))
../_images/673e4fec72a2f1e282fda0a33d608b533b80a38f61a081bd106408ee957e2092.png

With two predictors, the data space is visualized as the standard scatterplot, and a “separating hyperplane” would be a line that bisects it. You can imagine this line extending out towards you and away from you into a third dimension (“no man’s land”; not part of the data space).

Hide code cell source
set.seed(474); x1 <- runif(30,1,3); x2 <- runif(30,3,5)
plot(x1~x2,col=ifelse(x1+x2<6,"red","blue"),pch=20,cex=1.3)
lines(c(3,5),c(3,1))
../_images/153d7b7ade84dffac5c96a1f191ca3c49476c565ccb6719d5bb1b5908f0fa5d7.png

With four or more predictor variables, the dataspace is 4-dimensional, and the separating hyperplane would extend into an ``unseen” 5-th dimension. No way to visualize this, but you get the idea.

Note

The maximum margin classifier seeks to find the separating hyperplane that puts the most distance between the nearest example of each class (assuming of course, that it is possible to do so).

Separating Hyperplane#

All four of these “separating hyperplanes” succeed in bisecting the data space into a “blue region” and a “red region”. Classification proceeds by figuring out which side a new individual is on: red or blue? But which of these separating hyperplanes is the “best”?

Hide code cell source
set.seed(474)
X <- matrix( rnorm(80,.2,.15),ncol=2 )
Y <- matrix( rnorm(80,.73,.17),ncol=2 )
DATA <- as.data.frame( rbind(X,Y) )
names(DATA) <- c("x1","x2")
DATA$class <- factor( c(rep("red",40),rep("blue",40) ))
plot(x2~x1,data=DATA,col=2*as.numeric(DATA$class),pch=20,cex=2)
abline(1,-1.5)
abline(0.9,-1.4)
abline(0.8,-1.2)
abline(3,-6)
../_images/753421c404cc2e137e2724f2359d8b24b4606ad703b5efb5b00a30e7d9362528.png

Maximal Soft Margin Classifiers#

The maximal margin classifier takes the best separating hyperplane to be the one that gives the largest ``margin” (i.e., spacing; black arrows) between it and the nearest point of each class (on the plot, the distance from the separating hyperplane to the dotted lines measured perpendicularly). All points are not only on the correct side of the hyperplane, \uline{they are also on the correct side of the margin} (red points are above the dashed red line; blue points are below the dashed blue line)!

Hide code cell source
set.seed(474)
X <- matrix( rnorm(80,.2,.15),ncol=2 )
Y <- matrix( rnorm(80,.73,.17),ncol=2 )
DATA <- as.data.frame( rbind(X,Y) )
names(DATA) <- c("x1","x2")
DATA$class <- factor( c(rep("red",40),rep("blue",40) ))
DATA <- as.data.frame( rbind(X,Y) )
names(DATA) <- c("x1","x2")
DATA$class <- factor( c(rep("red",40),rep("blue",40) ))
plot(x2~x1,data=DATA,col=2*as.numeric(DATA$class),pch=20,cex=2)
SVM <- svm(class~.,data=DATA,type='C-classification', kernel='linear',scale=FALSE,cost=1000)
W <- t(SVM$coefs) %*% SVM$SV
abline(SVM$rho/W[2],-W[1]/W[2],lwd=3 )
#points(SVM$SV,pch=8,cex=2)
margin <- mean( abs( (SVM$SV)[,2] - SVM$rho/W[2] + W[1]/W[2]*(SVM$SV)[,1]) )
abline(SVM$rho/W[2]+margin,-W[1]/W[2],lty=3,col="blue",lwd=2 )
abline(SVM$rho/W[2]-margin,-W[1]/W[2],lty=3,col="red",lwd=2 )
arrows(-0.01067408,0.4691493,0.14137917,0.8686653,col="blue",lwd=3)
arrows(0.7782815,0.4236349,0.6204904,-0.01128126,col="red",lwd=3)
arrows(0.1700685,0.5247781,0.2618742,0.7270647,col="black",lwd=2,code=3,length=.1)
../_images/1795bca158bd773c8054d5a199a886dc484b52b02b449599ccd28d2c104174e9.png

Support Vectors#

You’ll note that three of the points have been starred. These are the support vectors. They ``support” the separating hyperplane because if their values changed, so would the hyperplane.

Notice that the rest of the data is essentially irrelevant at this point: it’s only the positions of the support vectors that matter. This means the procedure can be quite robust to outliers.

Hide code cell source
DATA <- as.data.frame( rbind(X,Y) )
names(DATA) <- c("x1","x2")
DATA$class <- factor( c(rep("red",40),rep("blue",40) ))
plot(x2~x1,data=DATA,col=2*as.numeric(DATA$class),pch=20,cex=2)
SVM <- svm(class~.,data=DATA,type='C-classification', kernel='linear',scale=FALSE,cost=1000)
W <- t(SVM$coefs) %*% SVM$SV
abline(SVM$rho/W[2],-W[1]/W[2],lwd=3 )
points(SVM$SV,pch=8,cex=2)
margin <- mean( abs( (SVM$SV)[,2] - SVM$rho/W[2] + W[1]/W[2]*(SVM$SV)[,1]) )
abline(SVM$rho/W[2]+margin,-W[1]/W[2],lty=3 )
abline(SVM$rho/W[2]-margin,-W[1]/W[2],lty=3 )
../_images/4cbf528d72dcad743952262a384748e1aa4d512180fdff7bda2ab1fe1f4cd48c.png

What if not separable?#

What if a hyperplane doesn’t neatly divide the data space in two so that all of one class lives on one side and all of the other is on the other side?

Hide code cell source
set.seed(474)
X <- matrix( rnorm(80,.2,.15),ncol=2 )
Y <- matrix( rnorm(80,.5,.17),ncol=2 )
DATA <- as.data.frame( rbind(X,Y) )
names(DATA) <- c("x1","x2")
DATA$class <- factor( c(rep("red",40),rep("blue",40) ))
plot(x2~x1,data=DATA,col=2*as.numeric(DATA$class),pch=20,cex=2)
../_images/0e852024b38169b94360e6f95ac93e1ea767bb1d23f0cba20509bc20ba46ee94.png

What if a hyperplane doesn’t neatly divide the data space in two so that all of one class lives on one side and all of the other is on the other side?

Support vector classifiers (aka soft margin classifier) allow some observations to be on the wrong side of the margin and even on the wrong side of the hyperplane.

Hide code cell source
  set.seed(474)
  X <- matrix( rnorm(80,.2,.15),ncol=2 )
  Y <- matrix( rnorm(80,.5,.17),ncol=2 )
  DATA <- as.data.frame( rbind(X,Y) )
  names(DATA) <- c("x1","x2")
  DATA$class <- factor( c(rep("red",40),rep("blue",40) ))
  plot(x2~x1,data=DATA,col=2*as.numeric(DATA$class),cex=2)

DATA <- as.data.frame( rbind(X,Y) )
names(DATA) <- c("x1","x2")
DATA$class <- factor( c(rep("red",40),rep("blue",40) ))
#plot(x2~x1,data=DATA,col=2*as.numeric(DATA$class),cex=1.5)
SVM <- svm(class~.,data=DATA,type='C-classification', kernel='linear',scale=FALSE,cost=10)
W <- t(SVM$coefs) %*% SVM$SV
abline(SVM$rho/W[2],-W[1]/W[2],lwd=3 )

points(SVM$SV,pch=20,cex=2,col=as.character(DATA$class[ as.numeric(rownames( SVM$SV ) ) ]) )
margin <- max( abs( (SVM$SV)[,2] - SVM$rho/W[2] + W[1]/W[2]*(SVM$SV)[,1]) )
abline(SVM$rho/W[2]+margin,-W[1]/W[2],lty=3,col="blue",lwd=2 )
abline(SVM$rho/W[2]-margin,-W[1]/W[2],lty=3,col="red",lwd=2 )
arrows(-0.05221021,0.2874830,0.07263393,0.7178204,col="blue",lwd=2)
arrows(0.7569648,0.4267098,0.6321207,-0.1091025,col="red",lwd=2)
../_images/c98f96ae7847d6ac5e82ebe8842022487c333ffa5488eb5f36e9cc4d370c36db.png

Soft margin classifier details#

The soft margin classifier depends on a parameter \(C\). This is essentially a ``budget” that allows for some misclassifications.

  • When \(C>0\), no more than \(C\) observations can be on the wrong side of the hyperplane.

  • Larger values of \(C\) allow more misclassifications (and more support vectors).

  • Smaller values of \(C\) allow fewer misclassifications (and fewer support vectors).

The ``best” value of \(C\) is typically estimated with cross-validation, since its value will depend on how mixed together the classes are. Smaller values of \(C\) yield models that are more complex and fit the training data better (since few misclassifications are allowed). Larger values of \(C\) yield simpler models which may generalize better (at the cost of less accuracy).

Kernel Trick in Support Vector Machine#

Pictured is an extreme artificial case where no hyperplane would do a good job separating the classes, but there is still a distinct but nonlinear boundary.

Hide code cell source
set.seed(474)
plot(0,0,col="white",xlim=c(-1,1),ylim=c(-1,1),xlab="x1",ylab="x2")
INNER <- matrix(0,ncol=2,nrow=150)
OUTER <- INNER
i <- 1
while( i <= 150 ) {
  x <- runif(2,-1,1)
  if( x[1]^2 + x[2]^2 <= .5^2 ) { INNER[i,] <- x; i <- i+ 1}
}
i<-1
while( i <= 150 ) {
  x <- runif(2,-1,1)
  if( x[1]^2 + x[2]^2 <= 1 & x[1]^2 + x[2]^2 >= .5^2 ) { OUTER[i,] <- x; i <- i+ 1}
}
points(INNER,cex=1,pch=20,col="red")
points(OUTER,cex=1,pch=20,col="blue")
../_images/cd6c3d1794b084e7b8cbd6549c405a10c1c8449b31cdffa356e2a6317b67042a.png

Support vector machines are somewhat unique in the field of data mining and machine learning in that it takes the philosophy that the model is good, it’s the data that needs to become more flexible.

The support vector machine believes that a linear regression (for numerical \(y\)) is a great model, it’s just that the original data space isn’t the place to fit it. Likewise, it believes that a flat boundary (separating hyperplane) is a great way to make classifications, it’s just the original data space isn’t the place to use it.

This is in stark contrast to other models that that the observed values (or transformations or derived quantities) as is, then modifies the model until there’s a good fit.

The support vector machine is an extension to the soft margin classifier that uses the ``kernel trick” to discern non-linear boundaries between classes.

The mathematics of the kernel trick are beyond the scope of this course, but in essence it takes the original data and projects it into a data space of higher dimensionality, where it hopes it can find a separating hyperplane.

Usually, the ``curse of dimensionality” means that tasks in high dimensions (many predictors) become difficult. It just so happens to be the case that sometimes it makes tasks easier.

Kernel Trick#

Imagine classifying individuals as red or blue depending on the value of their value of \(x\). It is clear that no separating hyperplane exists. It is also clear that a non-linear boundary exists between the two classes. However, this ``boundary” reaches outside the dataspace (the one-dimensional line).

Hide code cell source
plot(0,0,col="white",xlab="x",ylab="",xlim=c(0,1),ylim=c(-.5,.5),yaxt='n')
abline(h=0)
set.seed(474)
x <- runif(30)
x[1:2] <- c(.38,.52)
y <- factor( ifelse(x < 0.6 & x >0.37,"red","blue") )
points(x,rep(0,length(x)),col=as.character(y),pch=20,cex=1.5)
curve( 10*(x-0.35)*(x-0.55), add=TRUE )
../_images/aec46da479c34014c9dfbd866d77b605223f69bca96087b58fd851a1b5e7cec3.png

The problem is that figuring out a non-linear boundary is very difficult because there are so many possible shapes it could take. Lines and hyperplanes are much easier to work with, so instead of trying to figure out the non-linear boundary, the ``kernel trick” takes the data and projects it into a higher dimensional space where it hopes a separating hyperplane exists.

What does ``projecting into a higher dimensional space” even mean? It basically means creating new variables from ones that you already have. If \(x\) is the one predictor you have collected, then using \(x^2\) and \(\log x\) as additional predictors essentially makes the data space have three dimensions.

These ``new” predictors don’t contain any new information about the classes of course, but oddly enough, they can make the boundary between classes linear.

Project from one to two dimensions#

There was no separating hyperplane in the 1D dataspace (just \(x\) as a predictor). But by projecting into a 2D database by adding \(x^2\) as a predictor, all of a sudden there is a separating hyperplane.

Hide code cell source
plot(x,x^2,col=as.character(y),xlab="x",ylab="square of x",xlim=c(0,1),pch=20)
m <- (.52^2-.38^2)/(.52-.38)
b <- .38^2 - m*.38
abline(b+.005,m)
../_images/856bf6b911c193d2b460b75e52c10bdb3cd16cb607a2590b4110026b97986869.png

Projecting from two to three dimensions#

There is no separating hyperplane in the 2D dataspace using the predictors \(x_1\) and \(x_2\). But we do see that a curved boundary exists.

Hide code cell source
set.seed(474)
plot(0,0,col="white",xlim=c(-1,1),ylim=c(-1,1),xlab="x1",ylab="x2")
INNER <- matrix(0,ncol=2,nrow=150)
OUTER <- INNER
i <- 1
while( i <= 150 ) {
  x <- runif(2,-1,1)
  if( x[1]^2 + x[2]^2 <= .48^2 ) { INNER[i,] <- x; i <- i+ 1}
}
i<-1
while( i <= 150 ) {
  x <- runif(2,-1,1)
  if( x[1]^2 + x[2]^2 <= 1 & x[1]^2 + x[2]^2 >= .52^2 ) { OUTER[i,] <- x; i <- i+ 1}
}
points(INNER,col="red")
points(OUTER,col="blue")
../_images/81157d7d10314e359224c3a930626df55276bcc2d432b32c357e0e0e3e3c0e51.png

Projecting from two to three dimensions#

By adding a third “predictor” \(x_3 = x_1^2 + x_2^2\) (no new information about the problem), the data space becomes 3D. Now there is a separating hyperplane that neatly divides the space into a “red region” and “blue region”, achieving 100% accuracy.

Hide code cell source
DATA <- data.frame(x1=rbind(INNER,OUTER)[,1],x2=rbind(INNER,OUTER)[,2])
DATA$x3 <- DATA$x1^2 + DATA$x2^2
plot(0,0,col="white",xlim=c(-1,1),ylim=c(0,1),xlab="x1",ylab="x3")
points(DATA[,c('x1','x3')],col=c(rep("red",150),rep("blue",150)))
../_images/f6a24598a1973888bf345316dd62dbbe076f060be1435bfde773d645e60b5135.png

Adding additional predictors to project the data into a higher dimensional space often works, but the question is: what predictors? There’s an infinite number of possible combinations.

SVMs use the ``kernel trick” to get around this problem. A kernel is just a matrix whose values record a somewhat abstract measure of similarity (kind of like a distance) between individuals. The values of the kernel matrix are given by an equation we choose, like

\[$K[i,j]$ = (1 + x_{1,i}x_{1,j} + x_{2,i}x_{2,j})^3\]

The amazing thing is that by defining the kernel matrix, we can implicitly define additional predictors without specifying exactly what they are or how to get them.

Kernels#

Note

It turns out that the equation for the optimal separating hyperplane does not depend on the exact locations (coordinates) of the individuals in the higher-dimension space we are projecting into. Rather, it depends only the entries in the kernel matrix, i.e., how we quantify the similarity between two individuals.

  • Don’t actually have to project each individual into the higher dimensional space, only have to compute kernel/similarities (more computationally efficient).

  • Kernel implicitly captures the higher-dimensional space without explicitly having to construct it. This allows projecting into infinite dimensional spaces when exploring the existence of a separating hyperplane.

This is all very trippy stuff!

If you are pressed for a simple explanation, try out the following oversimplification.

Note

When doing classification, support vector machines assigns a score to each individual, where the score is some number derived from transformations of the predictor variables. If that score exceeds a certain threshold, it is assigned class A. Otherwise it is assigned class B.

Radial Basis Kernel#

Like ``magic”, the SVM with a radial basis kernel is able to recognize that the boundary between classes is non-linear, and does a pretty good job at figuring it out. The circles in this plot represent correct classifications and the X’s represent misclassifications.

Hide code cell source
DATA <- as.data.frame( rbind(INNER,OUTER) )
names(DATA) <- c("x1","x2")
DATA$y <- factor( c(rep("red",150),rep("blue",150)))
plot(DATA[,2:1],col=as.character(DATA[,3]),pch=20)
SVM <- svm(y~.,data=DATA,type='C-classification', kernel='radial',scale=FALSE,cost=50)
plot(SVM,DATA)
../_images/817c87c377c7fadc062e0fa7b9d1787895f42f174c18fedb484ba65379e31fb4.png ../_images/a89fd09957936450d20dd312e16985589969fff3ebbdf72d295eb9a4b109ee3a.png

Another example.

Hide code cell source
DATA <- as.data.frame(matrix(runif(300,-1,1),ncol=2))
DATA[,2] <- DATA[,2]+.5
names(DATA) <- c("x1","x2")
DATA$y <- factor( ifelse(DATA[,2] > DATA[,1]^2,"red","blue"))
plot(DATA[,2:1],col=as.character(DATA[,3]),pch=20)
curve(sqrt(abs(pmax(0,x))),add=TRUE,n=1001,from=-.01,to=1.7)
curve(-sqrt(abs(pmax(0,x))),add=TRUE,n=1001,from=-.01,to=1.7)
SVM <- svm(y~.,data=DATA,type='C-classification', kernel='radial',scale=FALSE,cost=15)
plot(SVM,DATA)
../_images/865a37c7facabc1e5621d8d31249f9692eedf686429d94322e3327c39c2021c6.png ../_images/0d10005b3e04abc95414282256dacb0ec08f2bc361484b1347bb33b121e20772.png

Support Vector Machines for classification summary#

While the whole procedure seems a bit wild and too good to be true, SVMs often work well in practice. The two main considerations are:

  • What kernel to use? Popular choices are the polynomial kernel (new predictors are polynomial combinations of the original ones) and the radial basis kernel / Gaussian kernel (which projects into the infinite dimensional space by essentially using an infinite number of polynomials).

  • What parameters of the kernel should be used? Polynomial needs a “degree” parameter (how many powers of the predictors should be used). The radial basis kernel needs a “spread” parameter (\(gamma\)) which controls how quickly the values inside the kernel matrix decrease with the distance between individuals.

  • What is the ``cost parameter”, i.e., the penalty to apply to a proposed separating hyperplane when it makes misclassifications. Larger costs make the model fit the training data better (lower bias) at the expense of increasing its variance (sensitivity to the training data) and its risk of overfitting.

Support Vector Machines (SVM) approaches classification in a new and interesting way.

  • Find a ``separating hyperplane” (basically a flat plane) that divides the data space into two parts where most (hopefully all) members of a class are on one side of the plane.

  • The location of this separating hyperplane is dependent on only a few pivotal individuals in the dataset (the support vectors) so the solution is ``sparse”. With fewer features, the prediction algorithm is extremely memory and computationally efficient (once the model is fit).

Problem: the boundary between classes may not be well-described by a straight edge!

Solution: project the data into a higher-dimensional data space (i.e., create new variables from the ones where we already have) where the boundary is better described by a straight edge.

Support Vector Machines for Regression#

Support Vector Machines can also be used to predict numerical responses (just ``won” a class competition in BAS 479). The procedure borrows heavily from that used for classification.

  • SVM assumes that there is a linear relationship between the \(y\) variable and some predictor variables (the \(x\)’s). However, these predictors may not be the raw values that have been collected.

  • We will project the data into a higher dimensional data space with the “kernel trick”, thereby implicitly creating new predictor variables. With the “right” kernel, relationships may be approximately linear.

  • A linear regression model is fit to the relationship in this higher-dimensional space but with a twist: not all errors made by the model count against it.

  • Instead of choosing the form of the linear regression by minimizing its sum of squared errors, we will minimize the sum of absolute values of only errors that are ``large” (greater than some threshold).

  • The support vectors will be the points whose errors contributed to the form of the model, so once again the model is ``sparse” (depends only on a few pivotal individuals).

How it works#

The errors (lines connecting the dots to the solid blue line) ``don’t count” if they are smaller than some threshold \(\epsilon\) (thick vertical black line segment). \uline{SVM finds the line whose coefficients are closest to 0 (i.e., regularization is being performed) and whose errors (specifically errors in excess of \(\epsilon\), given by the length of the red segments) are smallest}. The support vectors (stars) are the points whose errors are in excess of \(\epsilon\).

Hide code cell source
set.seed(479)
x <- seq(2,10,length=10)
y <- 5*x + 20 + rnorm(length(x),0,10)
plot(y~x,pch=20,cex=1.5,xlim=c(1,11),ylim=c(10,90))
M <- lm(y~x)
abline(lsfit(x,y),col="blue",lwd=2)
slope <- coef(lm(y~x))[2]
intercept <- coef(M)[1]
abline(intercept-10,slope,lty=2,col="blue")
abline(intercept+10,slope,lty=2,col="blue")
for(i in 1:10) { lines( c(x[i],x[i]), c(y[i],fitted(M)[i]),col=ifelse( abs(y[i]-fitted(M)[i]) > 10,"green","green"  ),lwd=ifelse( abs(y[i]-fitted(M)[i]) > 10,2,1 ),lty=1   ) }
lines( c(6.7,6.7), c( predict(M,newdata=data.frame(x=6.7)), predict(M,newdata=data.frame(x=6.7))+10 ),lwd=4)
text(6.5,63,expression(epsilon),cex=1.5)
oops <- which( abs( fitted(M) - y ) > 10 )
for (i in oops) {
  lines(  c(x[i],x[i]), c(y[i],fitted(M)[i] + 10*ifelse(y[i]>fitted(M)[i],1,-1)  ) ,col="red",lwd=3 )
}
points(x,y,pch=ifelse( abs(y-fitted(M))>10,8,1),cex=1.2)
../_images/67c05fd468e52aa948ff7e702ce671d585e524b5afc15aa3421e99875a091d74.png

Bias-variance tradeoff#

When the form of the SVM is being determined, it balances the bias and variance of the resulting model through the use of a cost parameter \(C\).

  • Large values of \(C\) focus the attention towards the errors made by the model. The result will be a model that has a low bias (it fits the training data well) but potentially high variance (poor generalizability due to sensitivity to the specific individuals in the training data).

  • Small values of \(C\) focus the attention towards the size of the coefficients. The result will be a model that makes a larger error on the training data (higher bias since it won’t have as much ``wiggle room” to fit the data) but that has a lower variance (not as sensitive to the individuals in the data).

The optimal value of the cost parameter \(C\), the error threshold \(\epsilon\), and any kernel specific parameters (the degree in a polynomial kernel and the ``gamma” parameter of a radial basis kernel) is found via cross-validation.

Larger cost = errors are weighted more. Model will fit the data better, but could be too wiggly to generalize.

Lower cost = coefficients are weighted more. Model will not be as wiggly and will have a larger error on the training, but at least it is not overfit.

Hide code cell source
set.seed(474)
x <- sort( runif(50,2,10) )
y <- x^2*sin(x) + 200 + rnorm(50,0,10)
plot(y~x,pch=20,cex=1.5)
SVM <- svm(x,y,scale=TRUE,cost=.1)
new <- data.frame(x=seq(2,10,length=500) )
pred.y <- predict(SVM,newdata=new,scale=TRUE)
lines(new$x,pred.y,col="black")
SVM <- svm(x,y,scale=TRUE,cost=10)
new <- data.frame(x=seq(2,10,length=500) )
pred.y <- predict(SVM,newdata=new,scale=TRUE)
lines(new$x,pred.y,col="red")
legend("topleft",c("Cost=0.1","Cost=10"),lty=1,col=c("black","red"))
../_images/c818a47f7d6d610c1a26adbe8e8e6b42c7a24646edd8ecf957b1a33c7d0c6075.png

Tuning parameters and making predictions#

Tuning the parameters in \(e1071\)#

The \(e1071\) package offers an additional options that are not available through \(train\) and \(caret\) that you should be aware of. The cost and \(\epsilon\) (epsilon) parameters always must be tuned. The default kernel is the radial basis kernel, and this requires additional tuning of the ``gamma” parameter (how quickly the entries of the kernel matrix go to 0 for points that are separated by larger distances). It’s possible to use \(tune\) from the \(e1071\) library.

# perform a grid search
SVMparams <- tune("svm", y~x,ranges=
                    list(epsilon=seq(0,1,0.2),cost=2^(2:8),gamma=10^(seq(-3,1,.5))))
SVMparams
Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 epsilon cost     gamma
     0.2  256 0.3162278

- best performance: 79.12614 

Plotting the selected model#

In this case, the defaults do a pretty good job (at least visually), but the selected parameters are designed to offer a reasonable tradeoff between the bias and complexity of the model so that it generalizes well to new data.

plot(y~x,pch=20,cex=1.5)
SVM <- svm(x,y,scale=TRUE,cost=256,gamma=0.3162278,epsilon=0.2)
new <- data.frame(x=seq(2,10,length=500) )
pred.y <- predict(SVM,newdata=new,scale=TRUE)
lines(new$x,pred.y,col="red")
../_images/f0ed39f317d106eea936c150a25ec8a89f59f6dbf848a97390d0541ab444dc3c.png

Support vector machines with tidymodels#

Now we build SVMs with tidymodels. We try both linear and non-linear kernels:

  • svm_linear() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes (using a linear class boundary). For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses a linear fit.

  • svm_poly() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes using a polynomial class boundary. For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses polynomial functions of the predictors.

  • svm_rbf() defines a support vector machine model. For classification, the model tries to maximize the width of the margin between classes using a nonlinear class boundary. For regression, the model optimizes a robust loss function that is only affected by very large model residuals and uses nonlinear functions of the predictors.

An example with regression problem.#

data(BODYFAT, package='regclass')

# This is not usually right if want to investigate testing performances.
TRAIN <- BODYFAT

REC <- recipe(BodyFat ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_lincomb(all_predictors())

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_linear( mode = 'regression', cost = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS
A tibble: 14 × 7
cost.metric.estimatormeannstd_err.config
<dbl><chr><chr><dbl><int><dbl><chr>
0.1000000rmsestandard4.233477950.19911762Preprocessor1_Model1
0.1000000rsq standard0.707592650.01175335Preprocessor1_Model1
0.3162278rmsestandard4.233477950.19911762Preprocessor1_Model2
0.3162278rsq standard0.707592650.01175335Preprocessor1_Model2
1.0000000rmsestandard4.233477950.19911762Preprocessor1_Model3
1.0000000rsq standard0.707592650.01175335Preprocessor1_Model3
3.1622777rmsestandard4.233477950.19911762Preprocessor1_Model4
3.1622777rsq standard0.707592650.01175335Preprocessor1_Model4
10.0000000rmsestandard4.233477950.19911762Preprocessor1_Model5
10.0000000rsq standard0.707592650.01175335Preprocessor1_Model5
31.6227766rmsestandard4.233477950.19911762Preprocessor1_Model6
31.6227766rsq standard0.707592650.01175335Preprocessor1_Model6
100.0000000rmsestandard4.233477950.19911762Preprocessor1_Model7
100.0000000rsq standard0.707592650.01175335Preprocessor1_Model7
WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_poly( mode = 'regression', cost = tune(), degree = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5), degree = seq(1,3) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS
A tibble: 42 × 8
costdegree.metric.estimatormeannstd_err.config
<dbl><int><chr><chr><dbl><int><dbl><chr>
0.10000001rmsestandard 4.27725745 0.05810841Preprocessor1_Model01
0.10000001rsq standard 0.68849375 0.03654208Preprocessor1_Model01
0.31622781rmsestandard 4.25281165 0.06796573Preprocessor1_Model02
0.31622781rsq standard 0.69073635 0.04110505Preprocessor1_Model02
1.00000001rmsestandard 4.24535715 0.07287966Preprocessor1_Model03
1.00000001rsq standard 0.69196145 0.03914640Preprocessor1_Model03
3.16227771rmsestandard 4.23204695 0.07855867Preprocessor1_Model04
3.16227771rsq standard 0.69326875 0.03972725Preprocessor1_Model04
10.00000001rmsestandard 4.23581625 0.07803618Preprocessor1_Model05
10.00000001rsq standard 0.69291805 0.03957301Preprocessor1_Model05
31.62277661rmsestandard 4.23666625 0.07725952Preprocessor1_Model06
31.62277661rsq standard 0.69277635 0.03956189Preprocessor1_Model06
100.00000001rmsestandard 4.23867105 0.07667949Preprocessor1_Model07
100.00000001rsq standard 0.69248525 0.03955387Preprocessor1_Model07
0.10000002rmsestandard 5.21534215 0.25317917Preprocessor1_Model08
0.10000002rsq standard 0.57588295 0.06429783Preprocessor1_Model08
0.31622782rmsestandard 5.71404815 0.22669922Preprocessor1_Model09
0.31622782rsq standard 0.52969205 0.06855940Preprocessor1_Model09
1.00000002rmsestandard 7.53803475 1.03636336Preprocessor1_Model10
1.00000002rsq standard 0.41736365 0.08193462Preprocessor1_Model10
3.16227772rmsestandard 8.51605345 1.69340525Preprocessor1_Model11
3.16227772rsq standard 0.39014605 0.07930041Preprocessor1_Model11
10.00000002rmsestandard 8.63035565 1.77936173Preprocessor1_Model12
10.00000002rsq standard 0.38670605 0.07652065Preprocessor1_Model12
31.62277662rmsestandard 8.56305885 1.73669920Preprocessor1_Model13
31.62277662rsq standard 0.39540855 0.06915727Preprocessor1_Model13
100.00000002rmsestandard 9.23023745 1.62033764Preprocessor1_Model14
100.00000002rsq standard 0.33933085 0.09226738Preprocessor1_Model14
0.10000003rmsestandard 91.38809465 68.46649436Preprocessor1_Model15
0.10000003rsq standard 0.12633425 0.06111487Preprocessor1_Model15
0.31622783rmsestandard141.50585035120.19382589Preprocessor1_Model16
0.31622783rsq standard 0.15031145 0.06539197Preprocessor1_Model16
1.00000003rmsestandard245.40661035222.76878794Preprocessor1_Model17
1.00000003rsq standard 0.16609295 0.05626166Preprocessor1_Model17
3.16227773rmsestandard272.78281495249.50410098Preprocessor1_Model18
3.16227773rsq standard 0.16058475 0.05414295Preprocessor1_Model18
10.00000003rmsestandard272.78281495249.50410098Preprocessor1_Model19
10.00000003rsq standard 0.16058475 0.05414295Preprocessor1_Model19
31.62277663rmsestandard272.78281495249.50410098Preprocessor1_Model20
31.62277663rsq standard 0.16058475 0.05414295Preprocessor1_Model20
100.00000003rmsestandard272.78281495249.50410098Preprocessor1_Model21
100.00000003rsq standard 0.16058475 0.05414295Preprocessor1_Model21
WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_rbf( mode = 'regression', cost = tune(), rbf_sigma = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5), rbf_sigma = 10^seq(-3,0,by=0.5) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
    )

METRICS <- collect_metrics(RES)

METRICS
A tibble: 98 × 8
costrbf_sigma.metric.estimatormeannstd_err.config
<dbl><dbl><chr><chr><dbl><int><dbl><chr>
0.10000000.001000000rmsestandard7.219344450.380066743Preprocessor1_Model01
0.10000000.001000000rsq standard0.462349050.026279555Preprocessor1_Model01
0.31622780.001000000rmsestandard6.516947750.391014745Preprocessor1_Model02
0.31622780.001000000rsq standard0.489091750.028176366Preprocessor1_Model02
1.00000000.001000000rmsestandard5.624238550.361280294Preprocessor1_Model03
1.00000000.001000000rsq standard0.564868450.033165114Preprocessor1_Model03
3.16227770.001000000rmsestandard4.811277650.319776846Preprocessor1_Model04
3.16227770.001000000rsq standard0.639388750.026069230Preprocessor1_Model04
10.00000000.001000000rmsestandard4.281284150.287633312Preprocessor1_Model05
10.00000000.001000000rsq standard0.699122650.021149283Preprocessor1_Model05
31.62277660.001000000rmsestandard4.064147050.193576958Preprocessor1_Model06
31.62277660.001000000rsq standard0.724400650.009198177Preprocessor1_Model06
100.00000000.001000000rmsestandard4.061377750.126205181Preprocessor1_Model07
100.00000000.001000000rsq standard0.727183150.014283924Preprocessor1_Model07
0.10000000.003162278rmsestandard6.606313750.396994794Preprocessor1_Model08
0.10000000.003162278rsq standard0.493028650.023542752Preprocessor1_Model08
0.31622780.003162278rmsestandard5.631761850.369515614Preprocessor1_Model09
0.31622780.003162278rsq standard0.565131550.026921486Preprocessor1_Model09
1.00000000.003162278rmsestandard4.790105650.307670897Preprocessor1_Model10
1.00000000.003162278rsq standard0.651810150.023240692Preprocessor1_Model10
3.16227770.003162278rmsestandard4.256247950.261801485Preprocessor1_Model11
3.16227770.003162278rsq standard0.708538150.017381385Preprocessor1_Model11
10.00000000.003162278rmsestandard4.090881650.169849564Preprocessor1_Model12
10.00000000.003162278rsq standard0.722577450.013465151Preprocessor1_Model12
31.62277660.003162278rmsestandard4.093286450.106654798Preprocessor1_Model13
31.62277660.003162278rsq standard0.723756650.014452649Preprocessor1_Model13
100.00000000.003162278rmsestandard4.153127650.105710772Preprocessor1_Model14
100.00000000.003162278rsq standard0.717233050.011932147Preprocessor1_Model14
0.10000000.010000000rmsestandard5.884710050.386888762Preprocessor1_Model15
0.10000000.010000000rsq standard0.547682550.028620466Preprocessor1_Model15
100.00000000.1000000rmsestandard6.415550450.32635118Preprocessor1_Model35
100.00000000.1000000rsq standard0.436080050.07095699Preprocessor1_Model35
0.10000000.3162278rmsestandard6.494626150.43613805Preprocessor1_Model36
0.10000000.3162278rsq standard0.473608050.03708764Preprocessor1_Model36
0.31622780.3162278rmsestandard5.636202650.43931174Preprocessor1_Model37
0.31622780.3162278rsq standard0.539224650.02671818Preprocessor1_Model37
1.00000000.3162278rmsestandard5.193175250.38012825Preprocessor1_Model38
1.00000000.3162278rsq standard0.572244850.01260412Preprocessor1_Model38
3.16227770.3162278rmsestandard5.218772750.33864221Preprocessor1_Model39
3.16227770.3162278rsq standard0.556920350.02404800Preprocessor1_Model39
10.00000000.3162278rmsestandard5.257101950.32352796Preprocessor1_Model40
10.00000000.3162278rsq standard0.548942650.02967668Preprocessor1_Model40
31.62277660.3162278rmsestandard5.257101950.32352796Preprocessor1_Model41
31.62277660.3162278rsq standard0.548942650.02967668Preprocessor1_Model41
100.00000000.3162278rmsestandard5.257101950.32352796Preprocessor1_Model42
100.00000000.3162278rsq standard0.548942650.02967668Preprocessor1_Model42
0.10000001.0000000rmsestandard7.525383350.38366783Preprocessor1_Model43
0.10000001.0000000rsq standard0.232488250.03983346Preprocessor1_Model43
0.31622781.0000000rmsestandard7.247347250.40711516Preprocessor1_Model44
0.31622781.0000000rsq standard0.258349150.03670843Preprocessor1_Model44
1.00000001.0000000rmsestandard6.791416650.43822868Preprocessor1_Model45
1.00000001.0000000rsq standard0.331327550.03799679Preprocessor1_Model45
3.16227771.0000000rmsestandard6.675197750.44997757Preprocessor1_Model46
3.16227771.0000000rsq standard0.350176950.03892751Preprocessor1_Model46
10.00000001.0000000rmsestandard6.675180450.44997577Preprocessor1_Model47
10.00000001.0000000rsq standard0.350180050.03892593Preprocessor1_Model47
31.62277661.0000000rmsestandard6.675180450.44997577Preprocessor1_Model48
31.62277661.0000000rsq standard0.350180050.03892593Preprocessor1_Model48
100.00000001.0000000rmsestandard6.675180450.44997577Preprocessor1_Model49
100.00000001.0000000rsq standard0.350180050.03892593Preprocessor1_Model49

An example with classification problem.#

data(WINE, package='regclass')

# This is not usually right if want to investigate testing performances.
TRAIN <- WINE

REC <- recipe(Quality ~ ., TRAIN) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_lincomb(all_predictors())

WF <- workflow() %>%
    add_recipe(REC) %>%
    add_model(svm_linear( mode = 'classification', cost = tune() ))

GRID <- expand.grid( cost = 10^seq(-1,2,by=0.5) )

RES <- WF %>%
    tune_grid(
        resamples = vfold_cv(TRAIN, v = 5),
        grid = GRID,
        metrics = metric_set(accuracy, precision, recall),
    )

METRICS <- collect_metrics(RES)

METRICS
A tibble: 21 × 7
cost.metric.estimatormeannstd_err.config
<dbl><chr><chr><dbl><int><dbl><chr>
0.1000000accuracy binary0.824444450.009910436Preprocessor1_Model1
0.1000000precisionbinary0.783773950.013676757Preprocessor1_Model1
0.1000000recall binary0.761689650.018313221Preprocessor1_Model1
0.3162278accuracy binary0.824444450.009910436Preprocessor1_Model2
0.3162278precisionbinary0.783773950.013676757Preprocessor1_Model2
0.3162278recall binary0.761689650.018313221Preprocessor1_Model2
1.0000000accuracy binary0.824444450.009910436Preprocessor1_Model3
1.0000000precisionbinary0.783773950.013676757Preprocessor1_Model3
1.0000000recall binary0.761689650.018313221Preprocessor1_Model3
3.1622777accuracy binary0.824444450.009910436Preprocessor1_Model4
3.1622777precisionbinary0.783773950.013676757Preprocessor1_Model4
3.1622777recall binary0.761689650.018313221Preprocessor1_Model4
10.0000000accuracy binary0.824444450.009910436Preprocessor1_Model5
10.0000000precisionbinary0.783773950.013676757Preprocessor1_Model5
10.0000000recall binary0.761689650.018313221Preprocessor1_Model5
31.6227766accuracy binary0.824444450.009910436Preprocessor1_Model6
31.6227766precisionbinary0.783773950.013676757Preprocessor1_Model6
31.6227766recall binary0.761689650.018313221Preprocessor1_Model6
100.0000000accuracy binary0.824444450.009910436Preprocessor1_Model7
100.0000000precisionbinary0.783773950.013676757Preprocessor1_Model7
100.0000000recall binary0.761689650.018313221Preprocessor1_Model7

More experiments and details:#

Important

Please continue the above examples:

  • start with training/testing splits;

  • use other modeling functions;

  • plot the metrics with different tuning parameters.

Pros and Cons#

Pros:

  • Models non-linear relationships with ease and with little user input.

  • Actually solves the problem it’s trying to solve: given the cost, \(\epsilon\), etc., parameters the algorithm will find the best model (none of the tree-based methods will necessarily stumble upon the best collection of rules).

  • Automatic and implicit feature creation by the ``kernel trick” (kind of solves the problem of what the predictor variables should be).

  • Sparse (solution depends only on support vectors); easy to store and make predictions.

  • Not too much tuning required.

  • Strong mathematical and theoretical foundation.

Cons:

  • For classification, hard to get ``probabilities” of being in a class - predictions are binary.

  • No confidence intervals (but most other methods can’t do this either).

  • While a strong model, often not the top performing.

  • Can be very slow for big problems.