Clustering (Unsupervised Learning)#

Outline#

Three types of analytics#

Generally, people think of analytics as fitting into three broad categories:

  • Descriptive Analytics - what’s happening now?

  • Predictive Analytics - what will happen next?

  • Prescriptive Analytics - what are the best choices that can be made?

More detailed examples:

  • Descriptive Analytics - describing characteristics / relationships / trends in an existing dataset. Answers “what’s happened in the past” or “what’s happening now”. For example, what has been the average lifetime value of customers for each combination of gender and state of residence?

  • Predictive Analytics - fitting a model that predicts well the characteristics of as-yet-unseen individuals. A good predictive model “generalizes well” (accurately reproduces the characteristics of individuals that already exist but that have no yet been measured or individuals in the future). Answers ``what will happen next”. For example, what will be the lifetime value of Jane, a 42yo single mother, who just made her first purchase.

  • Prescriptive Analytics - given a model that provides a reasonable reflection of reality and/or generalizes well to “new” individuals, what can be done to “optimize” the process of interest? Answers “what are the best choices that can be made”. For example, how often should emails be ``spammed out” to customers, and who should receive what emails?

Four pillars of data mining#

Data mining is the process of extracting actionable insight from data. People often split it up into four pillars:

  • Clustering/Segmentation (descriptive analytics)

  • Association rules (market basket analysis; descriptive analytics)

  • Classification (Descriptive or Predictive Analytics)

  • Regression / Numerical Prediction (Descriptive or Predictive Analytics)

More detailed examples:

  • Classification (Predictive Analytics) - given a set of training data and the ``answers” (actual classes of each individual), develop some model that is effective at predicting the class labels of as yet unseen individuals.

  • Regression / Numerical Prediction (Predictive Analytics) - given a set of training data and the ``answers” (actual \(y\)-values of each individual), develop some model that is effective at predicting the \(y\) value of as yet unseen individuals.

  • Association rules (market basket analysis; descriptive analytics) - find commonly co-occurring traits. What grocery store items are often purchased together? What sets of movies are commonly enjoyed by the same individual? What web pages are useful given a search string?

  • Clustering/Segmentation (descriptive analytics) - find groups of individuals that are “similar” in some regard and “cluster” though, though we don’t know ahead of time whether clusters exist or what they should even look like. For example, what market segments exist? The price-sensitive consumer, the die-hard fan, the early adopter, etc.

Clustering: Fundamental Concepts and Challenges#

Clustering applications in business#

  • A grocery may use clustering to segment its 1.3 million loyalty card customers into 5 different groups based on their buying behavior (fresh food lovers, convenience junkies, etc.). It may then adopt customized marketing strategies for each of these segments in order to target them more effectively.

  • Products in a store can be clustered together into groups based on their attributes like use, size, brand, flavor etc.

  • Stores can be clustered together based: sales, size, customer base, etc.

  • Webpages can be clustered together based on ``similar” content and recommendations for useful links can be made.

  • Warehouse items could be clustered based on similar demand patterns to organize layouts better.

Clustering example#

A bank wants to understand its customer base and to develop “profiles” so it can build targeted campaigns. But what do we cluster by? What information do we use? What clusters do we expect to see?

  • We can split customers into a “cluster” of males and a “cluster” of females. These are definitely two different groups. But is this distinction meaningful and are these groups useful for designing campaigns? Maybe?

  • We can split customers into two clusters based on whether they have a savings account. Once again, is this distinction meaningful for designing campaigns? Do these customers behave differently in the way the bank is interested in (i.e., interest in a refinancing a mortgage)? Maybe?

  • Checking balance?

  • Number of transactions in last 1/3/6/12 months?

  • Balance change in last 1/3/6/12 months?

  • Demographic information?

We can come up with endless ways of ``clustering” customers! It remains to be seen whether the distinction between various groups are meaningful for the purposes of targeting customers with a new product.

Are the clusters going to differ in ways the bank cares about? It is difficult to say ahead of time.

It’s a good idea to let the data suggest what clusters exist instead of coming up with ways that we think might help in splitting individuals up.

Challenges of Clustering#

Task sounds simple: split the individuals into subsets (clusters)

  • Individuals in a cluster share a meaningful set of common characteristics with each other.

  • Individuals in a cluster do not share a meaningful set of common characteristics with members of other cluster.

However, clustering is perhaps the most difficult task in data mining to pull off effectively! Why?

Note

Clustering is unsupervised learning.

Clustering is an example of unsupervised learning because we don’t have class labels to let us know when our clusters are ``right”. We don’t know what customers will respond to a campaign (or buy a new type of product, etc.) ahead of time. If we did, it would instead be a problem for predictive analytics.

Note

Performing clustering is nearly impossible without a deep understanding of the business problem. Because there are not necessarily “right” answers, the techniques and interpretations are very subjective (what are ``similar” individuals?). Clustering needs to be viewed as a jumping-off point for additional descriptive or predictive analytics and not as a stand-alone technique.

How do you know when your clustering scheme is correct?

Note

Your clustering scheme is ``correct” when the clusters are useful for the type of analysis that needs to be done.

Completely dependent on the task at hand!

Let’s consider the following difficulty of Clustering example.

Here are six objects. What is the best way to cluster them?

Person 1: obviously we have two clusters here. Sphere vs. cube! Shape is the defining characteristic.

Person 2: No, there are three clusters?! Red vs. orange vs. blue! Color is the defining characteristic.

Person 3: no there are only 2 clusters, but the ``real” clusters are fruit vs. not fruit. Edibility is the defining characteristic.

So which of the three people is correct? Which is the right set of clusters?

Answer: all of them, and none of them!

Without a question we want to answer or a line of analysis we want to explore, each of these ways of clustering is as ``good” as any other. As long as individuals within a cluster are somehow similar (however we want to define it) to each other than they are to individuals in other cluster(s), it’s a valid scheme!

Which of the following is the ``outlier” and doesn’t belong? The big blue ball, since it won’t fit on a table? The purple piece of art because of its unique color or shape? The apple because it has two dominant colors?

Difficulty of Clustering#

Although clustering is “unsupervised” because we don’t have the answers in the training data, it takes a tremendous amount of ``supervision” on our part to do clustering.

  • We need a goal in mind!

  • We need to define exactly what it is that makes individuals similar or dissimilar to each other. Different metrics will give different results (e.g., cluster on purpose, cluster on color, cluster on shape). What kind of differences between individuals do we find meaningful or interesting?

  • We need to decide if we want the groups to be meaningful, useful, or both? Meaningful clusters will reflect the structure in the data, but they may not be the most useful (clustering on shape is valid, but it doesn’t help tell us what will be edible, e.g., a banana).

  • Clustering is only a starting point for more detailed analysis, e.g., summarization, input variable for predictive model.

Clustering Successes#

Although the task of clustering is not easy, there have been great successes involving its use even outside of business.

  • Biology - find groups of genes that have similar functions.

  • Medical - detect patterns in spatial or temporal distribution of a disease; identify different types of depression.

  • Information Retrieval - group search results into a small number of clusters, each of which captures a particular aspect of a query (e.g., search for ``movie” and one cluster is reviews, another cluster is previews, another cluster is people involved in the movie, etc.).

Let’s discuss one example: RFM segmentation.

RFM (recency, frequency, monetary) analysis takes the marketing axiom that ``80% of your business comes from 20% of your customers” and develops clusters. Often 11 different clusters are generated.

  • Champions (bought recently, buy often, spend the most)

  • Loyal Customers (speed good money often, responsive to promotions)

  • Potential Loyalist (recent customer, spends a good amount, bought more than once)

  • New Customers (bought most recently, but not often)

  • Promising (recent shopper but haven’t spent much)

  • Customer Needing Attention (above average frequency/amount, but not too recent)

  • About to Sleep (below average recency, frequency; will lose if not reactivated)

  • At Risk (spent big money and purchased often in past; need to bring back!)

  • Can’t Lose Them (made biggest purchases and often but haven’t bought in long time)

  • Hibernating (last purchase was a long time ago; low spender, low # orders)

  • Lost (lowest recency, frequency, and monetary scores)

Once clusters are discovered, what do you do with the info?

  • Champions - reward them! Will probably promote your brand. Early adopters?

  • Loyal Customers - upsell higher value products, ask for reviews, engage them!

  • Potential Loyalist - offer membership/loyalty program, recommend other products

  • New Customers - provide on-boarding support, give them early success, start building relationship

  • Promising - create brand awareness, offer free trials

  • Customer Needing Attention - make limited time offers, recommend products based on past purchase

  • About to Sleep - recommend popular products or renewals at discount; reconnect

  • At Risk - send personalized emails to reconnect, offer renewals

  • Can’t Lose Them - win them back via renewals/new products; talk to them

  • Hibernating - offer other relevant products and special discounts

  • Lost - revive interest with reach-out campaign, ignore otherwise

Measures of Similarity/Dissimilarity#

The philosophy behind clustering is simple: individuals with similar sets of characteristics are ``alike” in some regard.

  • Emails with similar frequencies of the words “offer”, “rich”, “enlargement”, etc., have similar probabilities of being junk.

  • Customers with similar calling patterns, plans, and phones may have similar probabilities of churning.

  • Customers who buy similar sets of groceries may have similar probabilities of trying out a particular new product.

  • Customers who have purchased similar amounts of meat and fresh produce may purchase similar amounts of cereal.

This begs the question of how to measure similarity.

The major question is: how do we measure how ``similar” two individuals are to each other?

It turns out its often easier to come up with a score of how “dissimilar” two individuals are to each other.

  • If the dissimilarity score is 0, then two individuals are identical (at least in regards to the characteristics we have measured).

  • Larger values of dissimilarity scores indicate progressively larger differences between individuals.

Surprisingly, there is no universally appropriate measure to quantify the level of dissimilarity between individuals.

Euclidean distance (“ruler” distance):

  • Measures the straight-line distance between individuals as if they were locations on a map, i.e., ``physical proximity”.

  • Default measure for most distance-based algorithms.

Note

The dissimilarity (distance) between two individuals \(A\) and \(B\) with \(c\) measured characteristics \(x_1\), \(x_2\), \(\ldots\), \(x_c\) is:

\[d_{AB} = \sqrt{ (x_{1A} - x_{1B})^2 + (x_{2A} - x_{2B})^2 + \ldots + (x_{cA} - x_{cB})^2 }\]

The distance is the square root of the sum of squared differences in values for each feature. Note: the first number in the subscript refers to the variable being measured (first variable, second variable, etc., and go can from 1 up to \(c\), the number of measured characteristics). The second number in the subscript refers to the individual (first individual and second individual, so it is always either A or B).

Only one characteristic: \(d_{AB} = \sqrt{(x_A - x_B)^2} = |x_A - x_B|\) (absolute value of the difference)

plot(0,0,xlim=c(0,1),ylim=c(0,1),xlab="x",ylab="",yaxt='n',col="white")
abline(h=0.5,lwd=4)
lines( c(0.3,0.7),c(0.5,0.5),lwd=4,col="red")
points( c(0.3,0.7),c(0.5,0.5),pch=20,cex=3)
text(0.3,0.55,"xA")
text(0.7,0.55,"xB")
text(0.5,0.55,"d_AB")
../_images/ef39e6e4998fa027edf3d20c8d944ec4259f3a87a5dd678954f1cbe25e60ec28.png
\[d = | 0.3 - 0.7 | = 0.4\]

Two characteristics: \(d_{AB} = \sqrt{ (x_{1A} - x_{1B})^2 + (x_{2A} - x_{2B})^2}\)

plot(0,0,xlim=c(0,1),ylim=c(0,1),xlab="x1",ylab="x2",col="white")
lines( c(0.3,0.7),c(0.3,0.8),lwd=4,col="red")
points( c(0.3,0.7),c(0.3,0.8),pch=20,cex=3)
text(0.3,0.25,"xA")
text(0.7,0.85,"xB")
text(0.55,0.5,"d_AB")
../_images/2f47f2a32fa1fdd818366fa2c905e5e89321f20342308b8926a4e00ef55e689e.png
\[ d = \sqrt{ (0.3-0.7)^2 + (0.25-0.85)^2 } = \sqrt{ 0.52 } \approx 0.72 \]

The concept of Euclidean distance still makes sense to us for three characteristics (since we can visualize three-dimensional space).

\[ d_{AB} = \sqrt{ (x_{1A} - x_{1B})^2 + (x_{2A} - x_{2B})^2 + (x_{3A} - x_{3B})^2 } \]

While we cannot grasp intuitively the concept of distance in four or more dimensions, the formula for calculating the distance in these higher dimensions (there will be a total of \(c\) dimensions with \(c\) recorded characteristics) is a straightforward extension.

\[ d_{AB} = \sqrt{ (x_{1A} - x_{1B})^2 + (x_{2A} - x_{2B})^2 + \ldots + (x_{cA} - x_{cB})^2 } \]

Shorthand:

\[ d_{AB} = \sqrt{ \displaystyle{ \sum_{i=1,2,\ldots}^{c=\# characteristics} \left(x_{iA} - x_{iB}\right)^2 } } \]

Data cleaning: scaling#

Before doing clustering, it is important to scale each variable, which forces each characteristic to vary over approximately the same range of values.

  • Scale each variable to have a mean of 0 and a standard deviation of 1 (standard).

  • Scale each variable to be between 0 and 1 (ok when values are roughly equally likely over some range).

Scaling is also knowing as standardizing, and you will see these words used interchangeably.

Scaling allows each feature to contribute equally to the clustering scheme, i.e., each characteristic is on equal footing.

Note

Scaling is critical so that every variable contributes more or less equally to the dissimilarity score. If scaling is not done first, then whichever characteristic varies over the biggest range will likely dominate the dissimilarity score!

Scaling allows each feature to contribute equally to the clustering scheme, i.e., each characteristic is on equal footing.

Note

Scaling is critical so that every variable contributes more or less equally to the dissimilarity score. If scaling is not done first, then whichever characteristic varies over the biggest range will likely dominate the dissimilarity score!

Imagine we are clustering based on a woman’s \(x_1 =\) shoe size and \(x_2 =\) yearly income. Shoe sizes tend to range from between 5 to 9 for women and incomes could vary from 0 to a few million.

Imagine Jane has \(x_1 = 5\) and \(x_2 = 45000\) while Mary has \(x_1 = 8\) and \(x_2 = 75000\). The Euclidean distance between them is:

\[d = \sqrt{ (5-8)^2 + (45000-75000)^2 } = \sqrt{ 900000009 } \approx 30000\]

Notice how the distance is basically just equal to the difference in incomes (because these differences were large in comparison to differences in shoe size). Characteristics that vary over a larger range in values will dominate the dissimilarity score!

For the most part, we are going to scale each variable so that the average is 0 and the standard deviation is 1. This is easily accomplished. Calculate the average value of the variable and its standard deviation, then compute new values according to:

\[Scaled \, Value = \frac{ Original \, Value - Average}{Standard \, Deviation }\]

Scaling between a range a values A to B can be accomplished once we know the maximum and minimum values in the original values via:

\[Scaled \, Value = A + (B-A) \times \frac{ Original \, Value - Minimum}{Maximum - Minimum } \]

Categorical variables and distances#

Since categorical variables are non-numerical, dissimilarity metrics have to give them special consideration.

Note

You cannot simply replace the levels of a categorical variable with numbers (e.g., Southeast=1, Northeast=2, Northwest=3, Southwest=4) and treat it is as numerical value!

You can only replace levels of a categorical variable with numbers if the categorical variable is ordinal (meaning that the levels have a meaningful order).

In that case, it may make sense to treat the levels with an appropriately chosen set of numbers. For example, if the levels of Opinion are Disagree, Neutral, Agree, then it might be ok to treat the levels as the numbers -1, 0, and 1, respectively.

However, you have to be smart about it! If the levels are “small”, “medium”, and “large”, and the values in “large” are, say, typically three times as big as “small”, the numbers chosen should reflect that ratio. Choosing 12 to represent large and 4 to represent “large” and ``small” would be ok, but choosing 100 and 4 would not.

In the \(CUSTLOYALTY\) dataframe there is an ordinal variable called \(Income\) with levels \(f0t30\), \(f30t45\), \(\ldots\), \(f75t90\), \(f90toINF\) that represents the yearly income of the individual. Since these have a natural ordering, it may be ok to treat them as numbers.

data(CUSTLOYALTY, package='regclass')
x <- CUSTLOYALTY$Income
levels(x)
levels(x) <- c("30","45","60","75","90","150") #new levels
x <- as.numeric( as.character(x) ) #convert from a factor to numbers
table(x)
  1. 'f0t30'
  2. 'f30t45'
  3. 'f45t60'
  4. 'f60t75'
  5. 'f75t90'
  6. 'f90toINF'
x
 30  45  60  75  90 150 
 47  97  97 106  96  57 

Here I chose to represent the level by the number in the upper range of incomes (and chose 150 for the ``above 90K” group). This scheme makes some sense (90 is three times 30, and the 75-90 level has about three times as much income as the 0-30 level), though the 150 feels a bit arbitrary.

Non-ordinal variables#

When the levels do not have a natural ordering, the simplest way to represent the ``difference” in two individuals characteristics is to let

\[d_{AB} = 1 \hspace{.2in} \textrm{if individuals $A$ and $B$ have different levels}\]
\[d_{AB} = 0 \hspace{.2in} \textrm{if individuals $A$ and $B$ have the same level}\]

For example, if state of birth was a characteristic in the data, then when calculating the component of the distance due to differences in this variable, the ``\((x_A-x_B)^2\)” would either be 0 if the individuals have the same value and 1 if they are different.

Incorporating categorical variables into distance calculations#

The easiest way to incorporate categorical variables is to add a set of indicator variables to represent the levels. Unlike in a regression (where to incorporate an \(L\)-level categorical variable we define a set of \(L-1\) indicator variables), all levels will receive an indicator variable.

Example: Gender (Male/Female)

\[\begin{split} Male = \begin{cases} 0, & \mbox{if female} \\ 1, & \mbox{if male} \end{cases} \hspace{.4in} Female = \begin{cases} 1, & \mbox{if female} \\ 0, & \mbox{if male} \end{cases} \end{split}\]

Example: Color (Blue, Red, Green)

\[\begin{split} Red = \begin{cases} 1, & \mbox{if red} \\ 0, & \mbox{otherwise} \end{cases} \hspace{.2in} Green = \begin{cases} 1, & \mbox{if green} \\ 0, & \mbox{otherwise} \end{cases} \hspace{.2in} Blue = \begin{cases} 1 & \mbox{if blue} \\ 0, & \mbox{otherwise} \end{cases} \end{split}\]

Note

Scaling the numerical variables should happen before converting categorical variables into indicator variables.

Indicator Variables and Scaling#

It is a contentious issue as to whether you scale indicator variables or not.

  • In my opinion, you should be creating sets of indicator variables that are equal to 0 or \(1\) and scaling them.

  • Most procedures just create indicator variables that are 0/1.

  • The opinion is split as to whether you scale them. There will be some situations where one method works better, and others that favor the other method. Most default procedures in R won’t scale them.

Summary#

Clustering is difficult#

Clustering individuals into groups is difficult because there is no “right” or “wrong” ways of coming up with a clustering scheme, only those that are more useful than others.

To ``do clustering” you need to have an idea ahead of time about what you want to get out of a clustering scheme. Otherwise you won’t be able to assess whether the clustering makes sense, is interesting, or useful.

\(K\)-means and hierarchical clustering are both effective ways of identifying groups of individuals that are similar to each other in some meaningful way, but you’ll need to identify exactly what individuals in clusters have in common and what they do not have in common.

Data Cleaning#

Because the clustering techniques we have discussed measure similarity in terms of Euclidean distance (imagining individuals are locations on a map, and using the ruler distance between them), it is important that each variable have a roughly symmetric (or at least not extremely skewed) distribution, so transform your characteristics to symmetry when appropriate.

Also, it is important that each variable is scaled (mean of 0, standard deviation of 1) so that each variable is on equal footing when it comes to determining the clustering scheme.

Interpreting clusters#

Using the \(aggregate\) command to find the average (or median) value for each characteristic for each cluster let’s know what traits individuals have in common and what the have different between and within a cluster.

  • positive values mean above average, negative values mean below average; when data has been transformed it’s difficult to tell exactly how much above/below average those individuals are however

  • Values of \(\pm 1\) or higher get special attention because it shows that characteristic is quite different from the overall average (i.e., it is a ``cluster-defining characteristic”).

  • Looking at the median values of the originally recorded values for each cluster and talking about what traits separate one cluster from another completes the story.

Determining number of clusters#

Although there are graphical techniques for estimating a ``reasonable” number of clusters, it might be a good idea to let your interpretation of them guide you.

  • Make two clusters and see how they differ from each other.

  • Make three clusters and see if the third cluster is ``significantly” different and adds a lot more to the story/analysis based on what you want to get out of the clustering scheme. If so, try 4. If not, stay at 2.

  • Etc.