Assessing Classification Models#

With linear regression models, we have focused on the \(RMSE\) (root mean squared error) which tells us the typical size of the error made by the model. When doing predictive analytics, we estimated the “generalization error” of a model by estimating what the \(RMSE\) will be when the model makes predictions on new individuals.

There is no direct equivalent of the \(RMSE\) for logistic regression models. The closest is the “misclassification rate”, which is the percentage of classifications the model gets wrong (large is bad). The opposite is the “accuracy”, which is the percentage of classifications the model gets correct (large is good). Another metric is the AUC, which requires us to compare the predicted probabilities of individuals.

Naive Model#

For regression, the naive model find the average value of \(y\) in the training data and uses that value as the predicted value for all individuals (in the training or the holdout!).

For classification, the naive model finds the majority class in the training data and uses it as the predicted class for all individuals (in the training or the holdout!).

table(TRAIN$Donate) #Naive model would predict "No" for everyone
#   No   Yes 
#14529  4843 
table(TRAIN$Income) #Naive model would predict "f60t75" for everyone
#   f0t30   f30t45   f45t60   f60t75   f75t90 f90toINF 
#      47       97       97      106       96       57 

Confusion Matrix#

In many contexts, there is a particular class we are interested in identifying (customers who will churn, emails that are junk, products that will succeed, alumni who will donate). Let us denote that class as the “Yes” class (or “positive” class) and the other as the “No” class (or “negative” class). The confusion matrix tabulates which predictions the model gets right and which the model gets wrong.

In general, there are two types of mistakes – the model may predict the individual to be part of the “Yes” class but in reality it is part of the “No” class, or the model may predict the individual to be part of the “No” class when in reality it is part of the “Yes” class.

Actual Yes

Actual No

Predict Yes

Correct

Incorrect

Predict No

Incorrect

Correct

Below is one example:

          Reference   #Reference stands for "Actual Class"
Prediction No Yes
       No   3   5
       Yes 24  60

In total, the model made 8 “No’’ predictions and 84 “Yes’’ predictions. In reality, there are 27 “No’’ individuals and 65 “Yes’’ individuals. We see that of the 27 “No’’s, the model correctly identifies 3 but classes 24 incorrectly as “Yes’’.

Misclassification Rate and Accuracy#

A model’s misclassification rate is the fraction of predictions that it gets wrong. To find this quantity, add up the number of “Incorrect”s and divide by the total number of predictions.

The model’s correct classification rate or accuracy is the fraction of predictions that it gets correct. To find this quantity, add up the number of “Correct”s and divide by the total number of predictions.

The misclassification rate and correct classification rate will always add up to 100% (unless rounding takes place).

Major donors to UT (which are rare). “Yes” is Donor and “No” is Nondonor

Actual Donor

Actual Nondonor

Predict Donor

25

1032

Predict Nondonor

36

9428

Out of the 10521 predictions made (sum of all numbers in the matrix), a total of 1068 are incorrect (36+1032, the sum of the “off-diagonal elements” of the matrix) and 9453 are correct (25+9428, the sum of the “diagonal elements” of the matrix).

The misclassification rate is 1068/10521 = 0.1015, or 10.15%.

The correct classification rate or accuracy is 9453/10521 = 0.8985, or 89.85%.

The misclassification rate for the UT donor problem is 10.15%. This sounds pretty good right? Not so fast.

In fact, if our model predicted every alumni to be a non-donor, it would make a misclassification rate of 0.58% (the model classifies the 61 major donors as non-donors and makes 61 total errors out of 10521).

Note

Always compare the misclassification rate to that achieved by simply classifying all individuals as the majority class (the naive model). If gauging a model’s performance only on misclassification rate, it better beat this baseline rate!

In fact, this “classify everyone as the majority” model has a special name: the naive model. A data mining model is only “good” if it beats the naive model.

In this case, the “naive model” would classify all individuals as the “No” class since they are the majority (10460 vs. 61).

The misclassification rate of the naive model is 61/10521 = 0.58%. This sounds impressive, but it is completely useless for the purpose of identifying donors!

Accuracy and Kappa \(\kappa\)#

As we have seen, looking at the misclassification rate or accuracy can be misleading, especially if the classes are highly “imbalanced” (present in vastly different proportions).

  • Imagine the Yes vs. No classes in the data appear in a 2% / 98% mix.

  • An accuracy of 97% (sounds good) is actually worse than that of the naive model which predicts “No” for everyone no matter what.

Note

The kappa statistic (\(\kappa\)) compares the accuracy of a model to the “expected accuracy”, i.e., the accuracy of a model that predicts classes at random (with frequencies proportional to what appears in the data).

Larger values of \(\kappa\) indicate better models.

Let \(n\) be the number of predictions.

\[kappa = \kappa = \frac{Accuracy_{model} - Accuracy_{random}}{1 - Accuracy_{random}}\]
\[Accuracy_{random} = \frac{ No_{actual}\cdot No_{pred} + Yes_{actual}\cdot Yes_{pred}}{n^2}\]

Just as there is no set threshold for when the correlation between two variables is “large”, there are only rough guidelines as to what value of kappa are “good”. In terms of the model’s agreement with reality, the guidelines are 0.75 is excellent, 0.4-0.75 is fair/good, and less than 0.4 is poor.

Note: I will not have you calculate kappa by hand.

#       Confusion Matrix and Statistics
#          Reference
#Prediction No Yes
#       No   232   54
#       Yes   30   76
accuracy <- (232+76)/(232+54+30+76)
expected.accuracy <- ( (232+54)*(232+30) + (54+76)*(30+76) )/(232+54+30+76)^2
accuracy; expected.accuracy
kappa <- (accuracy-expected.accuracy)/(1-expected.accuracy)
kappa  #a value of 0.49 is fair/good

The expected accuracy is what a model guessing at random would achieve in terms of correct classification rate. There are 262 “yes” and 130 “no”, so “at random” means picking “yes” with 262/(130+162) = 67% chance and “no” with 33% chance.

Based on the arbitrary guidelines introduced, we’d say this is a “fair to good” classifier that does noticeably better than guessing at random.

False Positive and False Negatives#

Often, one type of error is more important or damaging than the other. Two additional ways to gauge a model’s performance emerge.

  • A false positive occurs when the model predicts that an individual has the “yes” class, but in reality they have the “no” class.

  • A false negative occurs when the model predicts that an individual has the “no” class, but in reality they have the “yes” class.

  • Junkmail data. Junk = positive = “yes” class. Safe = negative = “no” class. Here, a false positive is really bad since a safe email is classified as junk and deleted.

  • Diabetes data. Diabetes = positive = “yes” class. Healthy = negative = “no” class. Here, a false negative is really bad since a woman with diabetes is classified as healthy and will not get the care that is needed.

  • Churn data. Churn = positive = “yes” class. Renew = negative = “no” class. A false negative is disastrous to a company. When this happens, a customer who ends up churning is classified as renewing and an opportunity to intervene and keep the customer is lost.

The false positive rate is the fraction of negatives that the model incorrectly classifies as positives. The false negative rate is the fraction of positives that the model incorrectly classifies as negatives.

Actual Yes

Actual No

Predict Yes

True Positive

False Positive

Predict No

False Negative

True Negative

  • False positive rate (FPR). The fraction of individuals with the “no” class that the model predicts to be “yes”.

  • False negative rate (FNR). The fraction of individuals with the “yes” class that the model predicts to be “no”.

  • True positive rate (TPR). The fraction of the individuals with the “yes” class that the model predicts to be “yes”.

  • True negative rate (TNR). The fraction of the individuals with the “no” class that the model predicts to be “no”.

Actual Donor

Actual Nondonor

Predict Donor

25

1032

Predict Nondonor

36

9428

  • False positive rate (predict yes but actually no). 1032/(1032+9428) = 9.87%. The fraction of individuals with the “no” class that the model predicts to be “yes”.

  • False negative rate (predict no but actually yes). 36/(25+36) = 59.0%. The fraction of individuals with the “yes” class that the model predicts to be “no”.

  • True positive rate. 25/(25+36) = 41.0%. The fraction of the individuals with the “yes” class that the model predicts to be “yes”.

  • True negative rate. 9428/(1032+9428) = 90.13%. The fraction of the individuals with the “no” class that the model predicts to be “no”.

Sensitivity and Specificity#

Sensitivity is another word for the true positive rate: the fraction of individuals in the “Yes” class who are correctly classified as “Yes”.

Specificity is another word for the true negative rate: the fraction of individuals in the “No” class who are correctly classified as “No”.

Mnemonic: sensiTity (T for “true that are true”, i.e., yes’s that are yes) and speciFity (F for “false that are false”, i.e., no’s that are no).

There are an insane number of terms to describe a model’s performance (recall and precision are two other common ones). You’ll have a hard enough time remembering sensitivity and specificity as is, so let’s leave them behind.

Which metric do you think should be used if we want to choose a model for to predict diabetes? Should this value be maximized or should it be minimized?

          Reference
Prediction  No Yes
       No  232  54
       Yes  30  76
  • Misclassification rate: (54+30)/392 = 21.4%

  • Correct classification rate (Accuracy): (232+76)/392 = 78.6%

  • False positive rate: 30/(232+30) = 11.5%

  • False negative rate: 54/(54+76) = 41.5%

  • True positive rate (sensitivity): 76/130 = 58.5%

  • True negative rate (specificity): 232/262 = 88.5%

Metric Debate#

In reality, there are pros and cons to each metric. Typically what is sought is a balance between false positives, false negatives, misclassifications, etc.

  • Misclassification rate and accuracy. If optimize these, we are in effect treating each error on equal footing.

  • False positive rate: 30/262 = 11.5%. A model with a lower false positive rate will always exist (simply classify everyone as the negative class so that there are no false positives, though that is a useless model).

  • False negative rate: 54/130 = 41.5%. A model with a lower false negative rate always exists (simply classify everyone as the positive class so that there are no false negatives, though that is a useless model).

  • True positive rate (sensitivity): 76/130 = 58.5%. A model with a higher true positive rate always exists (simply classify everyone as the positive class, which results in a useless model).

  • True negative rate (specificity): 232/286 = 81.8%. A model with a higher true negative rate always exists (simply classify everyone as the negative class, which results in a useless model).

Cost matrix for different error types#

It is possible to give different “costs” to errors (e.g., a false positive is worse than a false negative). This allows a custom weighting for each type of error, and a model can be chosen that minimizes the “cost”.

Perhaps for the churn example, a false negative is 18 times worse than a false positive. For example, if the company does not intervene with a promotion and the customer churns, the company makes $10 (it makes no further money from the customer, but it didn’t spend $10 in the promotion). If the company does offer a customer a promotion who wasn’t going to churn, the company makes $180 (it makes $190 from the customer in the future, but it “loses” $10 in the promotion).

ROC curve and AUC#

For models that predict the probability that an individual belongs to the class of interest, we use the simple rule that if \(p \ge 0.5\) then the individual is classified as having the class of interest.

What if we used a different classification threshold besides 0.5? That would change the true positive, false positive, true negative, and false negative rates. Maybe that’s a good thing.

The ROC curve (receiver-operating characteristic curve) shows the performance of a classifier (predicting Yes/No) for the full range of thresholds between 0 and 1. It plots pairs of true positive rates (\(Sensitivity\)) and false positive rates (\(1-Specificity\)) for various values of the threshold, then connects them with a curve.

The Area under the ROC curve or AUC is a key metric as well and one that is typically the favorite in data mining.

This is the ROC curve for the \(PIMA\) data where we predict whether a woman has diabetes.

library(tidymodels)
library(pROC)
data(PIMA, package='regclass')
Hide code cell output
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 1.3.0 ──
 broom        1.0.7      recipes      1.1.1
 dials        1.4.0      rsample      1.2.1
 dplyr        1.1.4      tibble       3.2.1
 ggplot2      3.5.1      tidyr        1.3.1
 infer        1.0.7      tune         1.3.0
 modeldata    1.4.0      workflows    1.2.0
 parsnip      1.3.0      workflowsets 1.1.0
 purrr        1.0.4      yardstick    1.3.2
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
 purrr::%||%()    masks base::%||%()
 purrr::discard() masks scales::discard()
 dplyr::filter()  masks stats::filter()
 dplyr::lag()     masks stats::lag()
 recipes::step()  masks stats::step()
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following objects are masked from ‘package:stats’:

    cov, smooth, var
REC <- recipe(Diabetes ~ ., PIMA) %>%
    step_nzv(all_predictors()) %>%
    step_corr(all_predictors()) %>%
    step_lincomb(all_predictors())

MODEL <- workflow() %>%
    add_recipe(REC) %>%
    add_model(logistic_reg()) %>%
    fit(data = PIMA)

ACTUAL <- PIMA$Diabetes
PREDICTIONS <- predict(MODEL, new_data=PIMA, type='prob') %>% pull(.pred_Yes)
plot( roc(ACTUAL, PREDICTIONS), xlab="True Negative Rate", ylab="True Positive Rate", xlim=c(1,0), ylim=c(0,1) )
Setting levels: control = No, case = Yes
Setting direction: controls < cases
../../_images/72084f24fc5791b94cabfccbaf321bfde3aecb4447dc3f421682263153082a9a.png

Note: do take care to specify \(xlab\) and \(ylab\) or it defaults specificity and sensitivity (which can be confusing). Also, the horizontal axis goes from 1 to 0.

  • Ideally, we’d like a model to have a 100% true positive rate an a 0% false negative rate (point at the upper-left of the plot). However, in the real world this never happens.

  • The diagonal line represents a “completely uninformative model” where the model is making classifications “at random”.

  • A “good” model will have the curve shoot up very quickly on the left and flatten out. The closer the curve is to the diagonal line, the worse the model.

  • A “good” model will have a large “area under the ROC curve”.

Area under the ROC curve (AUC) - Ranking Probabilities#

The AUC (area under the ROC curve) is a key quantity in machine learning and business analytics.

  • What if we don’t necessarily care about the probabilities that come out of the model? Maybe a 4% chance of churning is still worrisome!

  • What if we are really just interested in the individuals who have the highest probability of possessing the “Yes” class?

Imagine the alumni office is using a model to decide which of two alumni an officer will meet with when they fly up to Chicago, with the goal of talking the alumnus into donating big. If neither (or both) have the potential to be “major donors”, then it does not really matter who the officer meets with. However, if one has the potential to be a major donor while the other does not, we’d like the model to tell us who to visit.

The officer will meet up with alumnus who has the higher probability of being a major donor. What is the chance that this decision is correct? The AUC tells us this.

Note

Imagine picking an individual from the “Yes” class at random and an individual from the “No” class at random. The AUC gives us the chances that that the scores of two randomly selected individuals (one from the “Yes” class and one from the “No” class) are ranked correctly (with the “Yes” individual having the higher probability or score than the “No” individual).

Note: the model may give the “Yes” individual a probability of \(p=0.15\) and the “No” individual a probability of \(p=0.07\) (so both are classified as “No” by the default criteria), but since the ordering of the probability scores is correct (the “Yes” individual had the higher score) this is considered a triumph.

The AUC is the fraction of pairs of individuals from opposite classes in the data whose probabilities are ordered correctly.

Looking at the ROC curve, you can see that 1.0 is the maximum possible value for the AUC since the possible area is the area of a \(1 \times 1\) square.

The “worse case” value corresponds to the “guessing at random” model (the diagonal line). Thus we see that an AUC of 0.5 corresponds to a “completely uninformative model”.

In business analytics, it’s common to have a terrible misclassification rate (the class of interest is very rare, and it’s very difficult for the model to have the probability belonging to that class greater than 50%). Simultaneously, it might have a respectable AUC of 0.9.

AUC for Diabetes#

ACTUAL <- PIMA$Diabetes
PREDICTIONS <- predict(MODEL, new_data=PIMA, type='prob') %>% pull(.pred_Yes)
roc(ACTUAL, PREDICTIONS)$auc
Setting levels: control = No, case = Yes
Setting direction: controls < cases
0.848649442160893

The AUC is 0.85. If we picked a women who has diabetes at random and a women who does not have diabetes at random, the model has an 85% chance of giving the woman with diabetes a higher score than the woman without diabetes.

The AUC is useful here because maybe we want to follow up with the 50 women who are most at risk (regardless of the actual probabilities of them having diabetes).

AUC for Junk#

Consider a model predicting whether an email is safe or junk based on word and punctuation frequencies.

data(JUNK, package='regclass')

REC <- recipe(Junk ~ ., JUNK)
#REC <- step_nzv(REC, all_predictors())
#REC <- step_corr(REC, all_predictors())
#REC <- step_lincomb(REC, all_predictors())

MODEL <- workflow() %>%
    add_recipe(REC) %>%
    add_model(logistic_reg()) %>%
    fit(data = JUNK)

ROC <- roc(JUNK$Junk, predict(MODEL, new_data=JUNK ,type="prob")$.pred_Junk)
ROC$auc

plot(ROC, xlim=c(1,0), ylim=c(0,1))
Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
Setting levels: control = Junk, case = Safe
Setting direction: controls > cases
0.977368633676279
../../_images/90abcdedc8f32193ccf04a74e6f199f1c60c45f9e471c26cee4a29aabd4ee104.png

The AUC is 0.9774.

When comparing a randomly picked piece of junk mail to a randomly picked piece of legitimate mail, the model has a 97.74% chance of giving the junk mail a higher “junk” probability than the safe mail. The orderings of the predicted probabilities look good!

Is the AUC a useful metric here? Not really. It’s not like a junk mail filter deletes the “10 most likely emails to be junk”. A model that maximizes the accuracy while keeping the false positive rate (predict junk but actually safe) near 0 is desirable.

When is AUC useful?#

When should models be “tuned” based on its accuracy and when should the AUC be used?

  • If both types of errors are equally as bad, then accuracy/misclassification rate is fine thing to use.

  • If one type of error is a lot worse than the other, then AUC is generally preferred.

  • If the ranks of the probabilities (and not the probabilities themselves) are important, then the AUC is used.

Often in business analytics, models are used to selected the “5000 most likely customers to churn”. The AUC, since it focuses on ranking probabilities, is the preferred metric for these types of problems.