Classification Framework#
Two approaches to classification:
Probability: calculate the probability that an individual belongs to class A, B, C, etc., then classify the individual into whichever class has the highest probability.
Scoring: calculate some “score” for each class, then classify the individual into the class with the highest score.
Probabilities have specific ways of being interpreted and have restrictions on their possible values, so algorithms outputting probabilities are in general more restrictive (and based on more assumptions) then others. However, probabilities have a ready interpretation while “scores” do not.
Probability#
Note
Definition: the probability of an event refers to long-run fraction (or proportion) of time that the event occurs.
For example, to estimate the probability an ad gets clicked by a web surfer, we can wait until a “large” number of visitors have seen the ad and compute the fraction of them that clicked the ad. The actual probability is this ratio as the number of surfers approaches infinity:
\(p_{click\,ad} =\) fraction of surfers who have clicked the ad as #surfers \(\rightarrow \infty\)
Because the probability of an event refers to a number that emerges after an infinite number of individuals have been observed, its definition is somewhat problematic.
Note
Except in special cases (where all possible outcomes an event are known and are equally likely to occur, like the faces of a die roll), the probability of an event is never directly observed and we can never truly know the numerical value of a probability.
Probability Example - Clicking an ad#
To illustrate, imagine that we know that the probability of clicking an ad is 10%. Let’s run a simulation to get a feeling over what this 10% means. To simulate a surfer, let’s generate a random number between 1 and 10. If it is a 7, the user clicks on the ad (since this happens with a 10% chance).
Surfer |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
|---|---|---|---|---|---|---|---|---|---|---|---|
Random # |
6 |
9 |
7 |
4 |
8 |
7 |
7 |
2 |
10 |
3 |
10 |
Click? |
No |
No |
Yes |
No |
No |
Yes |
Yes |
No |
No |
No |
No |
Fraction of clicks |
0 |
0 |
1/3 |
1/4 |
1/5 |
2/6 |
3/7 |
3/8 |
3/9 |
3/10 |
3/11 |
As the number of visitors to the ad becomes extremely large, the fraction that click the ad settles down to (close to) 10%. This is what we mean by the probability of clicking being 10%. After a particular number of visits we don’t expect exactly a 10% click rate, but it should be “close”; the more visitors, the closer to 10%.
set.seed(474)
observed <- rbinom(1e7,1,0.1)
total_observed <- cumsum(observed)
estimated_probability <- total_observed/1:1e7
selected <- unique( round(10^(seq(0,7,length=1000))) )
estimated_probability <- estimated_probability[selected]
plot(selected,estimated_probability,xlab="Number of Visitors (logarithmic scale)",ylab="Fraction who clicked",log="x",type="l",lwd=2)
abline(h=0.1,col="red")
Important Probability Properties#
Probabilities are numbers between 0 and 1 since they represent the proportion of time an event happens.
\(p=0\) implies that it is impossible for the event to occur
\(p=1\) implies that the event is certain to occur
If a data mining model outputs the probability of a class, it better be between 0 and 1!
The probability of an event NOT happening is one minus the probability of the event happening and vice versa: \(P(A) = 1-P(not \, A)\) and \(P(not \, A) = 1 - P(A)\).
If the probability of a customer making a purchase is 70%, then the probability of them NOT making a purchase is 30%.
If the probability of a customer NOT clicking an ad is 0.999, then the probability of them clicking an ad is 0.001.
Conditional Probability#
The philosophy of business analytics can be summed up as “information matters”. Once new information is available for a problem, strategies may change. This is true for probabilities too. While we might start with a baseline rate at which we think events may occur, we should update when new information becomes available.
Overall, 3% of people respond to email blasts about pet insurance (baseline rate/marginal probability). However, among pet owners, 11% respond (conditional probability given that the person owns a pet).
Overall, 0.4% of credit card charges are fraudulent (baseline rate/marginal probability). However, given the charge is made in a different country than the account holder’s residence, this increases to 6% (conditional probability given purchase characteristic).
When updating a probability by conditioning on new information, there are a few key terms to be familiar with. Imagine we are considering the probability of event \(A\) (e.g., major donor to UT), and the new information we have is referred to as event \(B\) (e.g., alumnus graduated in 1965 with an engineering major).
Prior probability of A - \(P(A)\): this is the baseline probability of A (in the absence of any other information). For example, it might be known that overall, 6% of alumni are major donors, so the prior probability of an alumnus being a major donor is 0.06. The prior probability is also referred to as marginal probability of the event.
Posterior probability of A - \(P(A|B)\): this is the probability of A given (\(|\) symbol) the additional information B. For example, the posterior probability of an alumnus being a major donor (\(A\)) given the alumnus is a 1965 engineering major (\(B\)) may be \(P(A|B) = 13\%\), i.e., 13% of 1965 engineering graduates are major donors.
Note
The more information you condition on, the more difficult it is to estimate probabilities based on data.
Jane is a 42 year old single mother of 2 (kids are 10 and 13) who works 40 hours a week at an investment firm making 89K a year. She has been a customer of Verizon for 8 years and subscribes to a family plan with 2 additional lines (presumably for her 2 children). Currently she has a 2-year contract that is set to expire next month.
What is the probability that she churns (doesn’t renew the contract, potentially going to another provider)?
The previous discussion on probabilities is difficult in estimating her churn probability!
There is only one Jane, so “repeated trials” are impossible
If we instead estimate the probability by looking at the fraction of similar customers (same age, marital status, kids, job, salary, subscription history), we may be out of luck since that combination of characteristics might be unique to Jane or be very rare.
Remember the margin of error with \(n\) by estimating the probability with \(n\) similar individuals is at most \(1/\sqrt{n}\), and if only 10 others like Jane exist the margin of error could be 32%.
