Unsupervised Clustering Models
My original unsupervised learning project was based on data from a local ski resort — where the object was to classify skiers into different categories based on specific characteristics (the main categories would group skiers based on: how often they visited the resort, how much they spent on tickets, how far they travelled, and the types of tickets purchased). The idea being that grouped customers could be used to improve marketing strategies and grow skier retention rate.
After building out my initial K-Means Cluster model for my Flatiron Capstone project, it was recommended to me that I try some different models for comparison. The following is my account of the various models I attempted and a look at their outcomes. While the internet has many different clustering examples using the Iris dataset and others that use random clusters with only 2-dimensions (which are visually pleasing to graph) not many of these have specific applications or build models outside of an academic environment that are more complex and multi-dimensional.
One of the first places I went to look for help was the scikit-learn website to read about other options. While the list of unclustered data models is robust and very detailed, I find this list to be over-whelming and possibly designed for data scientists with much more experience. Though I do love the colorful chart below along with the table (not pictured here) along with highlighted features and specific parameters for each method.
K-Means Cluster Model
The basic model that I created used the k-means cluster algorithm. Below is the fairly straight-forward code I used. Initiating the model, fitting and predicting, and then grouping the clusters back into the original data frame. Knowing that I would need to run the model numerous times I decided to start by creating a copy of the cleaned and standardized dataset which I could use again for each of my different models.
# Create a copy of cleaned, standardized data
kmeans_all_customers = all_customers.copy()# Set clusters = 5 and include random state for reproducibility
kmeans_all_customers_model_5 = KMeans(n_clusters=5,random_state=123)# Fit and predict the model based on the following columns
y=kmeans_all_customers_model_5.fit_predict(scaled_all_customers_df[['Number of Trips','Total Revenue','Mean Order Time','Adult Tickets', 'Youth/Senior Tickets','Miles to Resort']])# Add cluster feature back to data frame copy
kmeans_all_customers['Cluster'] = y# Create a groupby of the clusters
kmeans_model_clusters = kmeans_all_customers.groupby('Cluster')# Run my examine_cluster function to get a better look at the cluster stats:
Finally, I ran my newly modified data frame through my ‘examine_cluster_again’ function (some code I wrote to analyze the results of the cluster model, but decided not to paste into this blog for the purpose of time and space) that grouped each cluster by either a mean or total based on a few of the features I wanted to highlight — this code needed to be modified as it was originally used with just the kmeans model. Therefore, I had to tweak it a little in order to use it ‘again’ with a variety of new models. Below you can see each cluster and how the mean of customers in each group compares.
- Cluster 0 has the highest Number of Visitors
- Cluster 1 has the highest Revenue Mean and highest Trips Mean
- Cluster 2 has the highest Miles Away Mean highest Order mean
- Cluster 3 has the highest Number of Visitors
- Cluster 4 has the highest Order to Trip Mean
I also used various graphs to explore the clusters visually. Again, I don’t want to include all of the graphs for each cluster, as well as for each method, for sake of this article. Instead, you can see below one example of a box plot and the Python code used.
# Get a visual on the revenue characteristics of each cluster using revenue.
sns.boxplot(x='Cluster', y='Total Revenue', data=kmeans_all_customers)
plt.title('KMeans Customer Revenue', fontsize = 24)
plt.xlabel('Cluster Number', fontsize = 20)
plt.ylabel('Total Revenue', fontsize = 20)
My box plot above shows how cluster 1 has far and away the greatest mean total revenue for the skiers in this group. From this information I am able to target these skiers for marketing purposes — or perhaps look at the skiers in other clusters, such as cluster 3, where the total revenue mean is the lowest.
I was able to create a box plot like this for each of the five clusters as well as cross reference the column features that I created in my function. For the rest of this post, let’s just look at the Revenue Mean column and how the different models compared both in terms of the ‘examine_clusters_again’ function and the box plots of revenue mean.
I read that Gaussian Mixture is one of the faster algorithms and seemed to have pretty straight forward parameters. After reading about some of the history, as well as how it uses Mahalanobis distances, I decided to run it.
“A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.” [Source]
# Create a new copy of cleaned, standardized data
gausmix_all_customers = all_customers.copy()# Set clusters = 5 and include random state for reproducibility
gaussian_all_customers_model_5 = GaussianMixture(n_components=5, random_state=123)# Fit and predict the model based on columns
y=gaussian_all_customers_model_5.fit_predict(scaled_all_customers_df[['Number of Trips','Total Revenue','Mean Order Time','Adult Tickets', 'Youth/Senior Tickets','Miles to Resort']])# Add cluster feature back to data frame copy
gausmix_all_customers['Cluster'] = y# Create a groupby of the clusters
gausmix_model_clusters = gausmix_all_customers.groupby('Cluster')# Run my examine_cluster function to get a better look at the cluster stats:
The code above is pretty much the same as used with my kmeans model— only using the Gaussian Mixture model type. The ‘examine_clusters_again’ function gives the results of clusters below.
- Cluster 0 has the highest Number of Visitors
- Cluster 1 has the highest Revenue Mean and has the highest Trips Mean
- Cluster 2 is just sort of an average group across the board.
- Cluster 3 has the highest Miles Away Mean and has the highest Order mean
- Cluster 4 has the lowest Youth Tics, which is none!
It’s interesting to me that the numbers are very different, but this bar plot turned out pretty similar to the previous model — especially with cluster 1. One obvious difference to me is the number of outliers seems to be lower in the Gaussian Mix model, especially in cluster 2 (which seems to be filled with average customers.
Looking closer at the specific cities where customers came from in cluster 1 as compared to cluster 1 in the kmeans model, there was definitely some overlap even if the top ten cities were slightly different. I don’t plan to show these outputs for privacy sake.
Once again, I ran all the code above, only this time I used Agglomerative Clustering. Using a bottom up hierarchical approach, this model has a “rich-get-richer” behavior that can lead to uneven cluster sizes. For me it seemed like it did a pretty good job of spreading out the data based on the characteristics I had hoped to find (which can be found in my ‘examine_clusters_again’ function).
“The Agglomerative Clustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.” [Source]
- Cluster 0 is the sort of second rate group this time.
- Cluster 1 has the highest Revenue Mean has the highest Trips Mean
- Cluster 2 has the highest Miles Away Mean
- Cluster 3 has the highest Order Mean
- Cluster 4 has the highest Number of Visitors
Again, cluster 1 came out as the Total Revenue champions. And again, looking at some of the cities where these skiers came from, it was obvious that all the models are grouping similar customers together. Also, this bar plot is similar to the two models above with perhaps even less variation in subsequent clusters (notice how clusters 2, 3 and 4 all have pretty low mean revenues). At the same time, I think this model ended up with more outliers and less variance from the mean.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
The DBSCAN model really didn’t work out for me. I’m not sure what I need to do in order to get 5 distinct clusters. Perhaps this just isn’t the best application of the model. I tried effortlessly to adjust the parameters to no avail. The documentations says that you need to adjust the ‘min_samples’ and ‘eps’ but nevertheless I still couldn’t get it to really group the data in a successful way.
“The DBSCAN algorithm views clusters as areas of high density separated by areas of low density…There are two parameters to the algorithm,
eps, which define formally what we mean when we say dense. Higher
epsindicate higher density necessary to form a cluster.” [Source]
Here are the lone two clusters formed by my DBSCAN model. Since it grouped all the customers pretty much together I didn’t bother creating a box plot or looking more deeply into these clusters — instead I played with the parameters and spent a lot of time waiting for the model to run!
This model also didn’t really work out that well either. After a little bit of trial and error I was able to get 5 clusters — though once again, looking at the output it didn’t really help me group the customers.
“The present version of SpectralClustering requires the number of clusters to be specified in advance. It works well for a small number of clusters, but is not advised for many clusters….Note that if the values of your similarity matrix are not well distributed, e.g. with negative values or with a distance matrix rather than a similarity, the spectral problem will be singular and the problem not solvable.” [Source]
- Cluster 0 has the highest Miles Away Mean
- Cluster 1 has the highest Revenue Mean
- Cluster 2 has the highest Trips Mean
- Cluster 3 has the highest Youth Tickets Mean
- Cluster 4 has the highest Number of Visitors
Even though I was able to get five clusters, I’m not sure it really counts if there are such a few numbers of customers in some of the clusters. It also seems strange that cluster number 4 is so full of outliers, especially since it has the majority of customers anyway. Perhaps I need to go back and clean my data more and get rid of more of the outliers. Again, I spent some time playing with the parameters but somehow this model wasn’t very useful.
All in all, I’m really glad I spent some time trying additional models. In the end I think I’m still the most happy with my k-means model — though there is still a lot more I could do to analyze how different customers behaved across the various clusters. Not to mention that I hardly tried at all to create a visual that might better represent how different clusters are being grouped — though seeing as how this is a multi-dimensional dataset I believe this is beyond my Pythonic capabilities at this point!