Classification (Supervised Learning)

Classification (Supervised Learning)#

Overview#

In BAS 320, you learned how to do linear and logistic regression for descriptive analytics, where the goal is to describe the relationship between some quantity of interest (numerical value like sales, or probability of some class like buy/not buy) and different predictor variables.

In BAS 474, you will instead focus on predictive analytics, where the goal is to make accurate predictions on as-yet unobserved individuals.

For all models that we discuss (in this unit and beyond), you will be responsible for knowing:

how each model “works” at a fundamental level (and how to describe this to a layperson)
what assumptions (if any) each model makes (i.e., under what conditions should the model be “good”)
what types of relationships are well-suited for a model and which types are not
what parameters we can “tweak” in a model to make it potentially perform better
how the choice of parameters affects the bias-variance tradeoff of the model

Note

Logistic regression example

how the model “works”. Logistic regression predicts the probability $p$ of each class with a logistic curve that involves a weighted sum of predictor variables.
what assumptions the model makes. Logistic regression assumes that when comparing “otherwise identical individuals” (the same values for all other predictors in the model), individuals with larger values of $x$ always have either larger values of $p$ or smaller values of $p$.
what types of relationships are well-suited for the model and which types are not. The relationships should be relatively simple: transformations of variables and interactions are difficult to model in this context.
what parameters we can “tweak” in the model to make it potentially perform better. Choosing what variables to go into the model and also what transformations are necessary (not an easy task at all).
how about the bias-variance tradeoff of the model. Additional predictor lowers the bias of the model since more complex equations can model more sophisticated relationships, but the expense is increase of the model’s variance and potentially decrease of its ability to generalize.

Key terms in Predictive Analytics#

Generalization error: the error (in the long run) made by a model on individuals it has not yet seen. This quantity cannot be \underline{directly} observed because it requires predicting on an infinite number of new individuals, but it can be estimated with $K$-fold crossvalidation and it can be calculated on a specific set of new individuals (e.g., a holdout sample).
Training data: a set of individuals that are used to “build” the model, i.e., determine its form, estimate coefficients, etc.
Holdout sample: a set of individuals that are used to “test” the model. The model does not look at these individuals when its form is determined, so they are “new” for all intents are purposes.
Overfitting: A model is overfit when it becomes “too complex” and includes features that are unique to the particular set of individuals that happen to be in the training data. Since these features are not present in the population at large, we say model is “fitting noise”. An overfit model’s generalization error will be larger than a simpler model that just captured “the gist” of the relationships.

When talking about classification and numerical prediction models, we use letters to represent what quantity is being predicted and what quantities are used to make the predictions. Sometimes people refer to these quantities as the “response” and “features”.

$y$ is generally used to represent the quantity that we want to predict (classification for “Junk mail” vs. “Safe mail”, numerical prediction for lifetime value, etc.).
$x$ is generally used to represent the set of quantities that we use to make those predictions (e.g., letter/word frequencies, graduation year, major, etc.)
Usually multiple predictors are used, and $x_1$, $x_2$, etc., are used to refer to them.
Response: this is our y variable and the quantity we wish to predict. With classification, $y$ is a factor with two or more levels. With numerical prediction, $y$ is a number.
Feature: this is a characteristic of an individual, i.e., an x variable, that we will use to make predictions (aka predictor variable).

Most algorithms in data mining “work better” if the distributions of the predictor variables looked roughly symmetric as well (we saw this with clustering).

Outliers have the potential to influence the coefficients and form of the model. Models whose form is very sensitive to the particular set of individuals that happen to be in the training data have a “high variance” and potentially a large generalization error.
Regression models (both linear and logistic) tend to work better with symmetric predictors.
Nearest Neighbors and Support Vector Machines tend to work better with symmetric predictors, though making transformations changes somewhat the notion of similarity (or “inner product”) between individuals.
Tree-based models introduce rules between 2 unique values of predictor variables so symmetry/skewness is not relevant (the same sets of individuals will be above/below a chosen threshold regardless of whether a transformation of the predictor is made)!

Evaluating models (metrics)#

The terms and quantities used to evaluate classification models vs. regression models are different.

Classification

Misclassification Rate
Accuracy
AUC (area under the ROC curve), false positive rate, false negative rate, Kappa, etc.

Regression

RMSE: the “root mean squared error” (typical size of error made by model)
$\mathbf{R^2}$: the “R-squared”.
MAE: mean absolute error, etc.

Classification models:

Misclassification Rate: the fraction of individuals in the data that are misclassified by the model (e.g., predicted “junk” but in actuality that email was “safe”).
false positive rate and false negative rate breaks down the errors further (since a model can predict “junk” when the email is safe, or a “safe” when the email is “junk”)
Accuracy: the fraction of individuals in the data that are correctly classified by the model (e.g., predicted “junk” and the email actually is “junk”).
Kappa: a measure of the accuracy of a model relative to the accuracy a “random guess” model would achieve
AUC: the fraction of predicted probabilities that are “ranked” correctly. If a model is predicting if an email is junk, and the AUC is 0.87, then if the model makes predictions on random examples of junk and safe emails, there is an 87% chance the predicted probability of being junk for the junk mail will be higher than the predicted probability of being junk for the safe email.

Classification and Business Analytics#

The task of classification is perhaps the most popular pillar of data mining.

Note

Given a set of training data and the class labels of those individuals, what features and characteristics of these individuals provide information on that individual’s class? How should this information be synthesized and combined into a set of rules that assigns a class to a new individual?

Emails have two class labels: junk (spam) or legitimate. What are the characteristics of junk vs. legitimate email? How can an algorithm determine whether an email is junk or not?
Alumni have a few class labels: never donors (those who will never donate to UT), major donors (those who donate more than $10,000 over their life), and casual donors (the rest). What characterizes each class? How can an algorithm decide who is who?

Many business questions are classification problems.

Classify customers as a future buyer or non-buyer.
Classify a web surfer as one who will click on an ad or not.
Classify political leaning: left, middle, right.
Classify outcome of a new restaurant: succeed, fail.
Classify customer’s future loan payment status: on time, late, default.
Classify loyalty of customers: churn vs. stay.

What other problems might businesses face that can be treated as a classification problem?