How to fill those ski lifts? — Photo Credit: Colin Cassidy

Examining Small Ski Resort Data

Looking at Skier Data through KMeans Clusters

For my capstone data science project, I was able to get four years of skier data from a small New England ski resort. Not only do I enjoy skiing, but thought I might be able to help out this local resort by providing some useful data analysis. All the big ski resort conglomerates have data science teams who examine a wide variety of customer data — from the popular ski lifts people ride on a given day or the types of food ordered the most at the mountain lodge.

Unfortunately, the data I got was much more specific to only skiers who purchased online tickets. There was no data on day-tickets, season pass holders, or summer visits. Most small resorts can’t afford all the ticket scanners or bar-code tickets which allow major resorts to know exactly where skiers went and how much they skied on a given day — and small resorts probably don’t want all that (Uncle Sam is watching) technology either!

Though large resorts are collecting as much data as they can and seem to be using it to their advantage. From customizing trip packages to individualized operations adjustments — such as knowing which lifts may experience higher skier volume, or which mountain restaurants are going to busy at certain times, not to mention knowledges about future expansion and marketing decisions.

Since the data for this project was loaned to me in a confidential manner, I don’t plan to go through the whole EDA and cleaning aspects but do hope to focus here on the KMeans cluster models that I built as well as the specific function that I wrote in order to help me interpret the cluster results. While I was able to find countless Stack and coding articles about visualizing data with a relatively small number of variables (or randomly constructed Blobs) it was hard to find any KMeans cluster examples with real data and numerous features.

I used the .csv files I was given from the resort and loaded them into a Jupyter notebook. For this project I will be using Python, as well as a handful of standard libraries. I also imported a file from “www.unitedstateszipcodes.org” which allowed me to obtain the latitude and longitudinal coordinates that matched my zip codes.

For the most part, I had some basic information about about the guests for the lasts four seasons: city, state, zip codes, amount paid for the tickets, order date, and trip date. There were a few extraneous columns with ticket numbers, order numbers, order status, etc. but I deleted these last features pretty quickly for I was unable to find much relevance for them.

There were numerous other features that were included: skier type (or snowboarder for instance), skier experience (beginner, intermediate, advanced), as well as birthdate, height, shoe size, if they needed rental equipment, etc. While I would have really enjoyed playing with this data more, it only amounted to about 10% of the total data and therefore needed to be thrown out. I did do some fun EDA on these columns assuming that it probably wouldn’t represent the average customer nor was it representative of the whole population — since people who ski more often don’t need to rent equipment and are familiar enough with the routine where filling out additional information is usually not required.

One of the most difficult aspects of the data was how it was organized. If a person bought two tickets for a particular day, they showed up as two separate purchases (most likely because only one person can use a ticket ) in two separate rows. Therefore, all information about the tickets was different except for the ‘address’ and ‘trip date’. This was especially true if the customer came back later in the season or even a year later and bought two more tickets. Now the ‘address’ column was the only way I could identify the buyer since names had been deleted for privacy. I knew that I would be calling some difficult groupby() methods in my future…

After learning that the majority of the guests, at least in my minimal data group, were skiers, not snowboarders and that the median age was around 26 years old (again, perhaps older skiers didn’t fill in the “how old are you?” question from the online form). I started my serious EDA with the difference between “order date” and “trip date”. I had already converted the “birthdate” column to a datetime date using the pd.to_datetime() method and cross referenced it to the “trip date” in order to figure out the age of the skier when they visited. So now I just had to do something similar with the “order date” and “trip date” columns — giving me a new column called “order_to_trip_days”.

Turns out that most people buy their online tickets 1 to 2 days in advance, probably depending mostly on the weather and snow conditions. The mean was actually around 5 days, but that is clearly right skewed by the handful of vacation planners that had their date picked out a month or so in advance. I had hoped that this information might give me some insight into the different purchasing types of customers the mountain has but in reality I’m afraid most skiers are more concerned with just getting good snow!

Next I looked at zip codes. Recently I completed a project that involved cross referencing zip codes with latitude and longitude from a separate dataset in order to get a visual on the zip code locations (since apparently it is next to impossible to map out zip code polygons using Python). Not only did I want to see where the skiers were coming from but I also wanted to know from how far away they lived. Since zip codes are really just categorical data and have no numerical sense I needed a way to determine distance.

To be honest, the zip codes needed a lot of cleaning. Some people put in the full ten digit zip code, others left it out completely. Some people entered in random letters instead of number. Plus there was a whole set of random characters reserved for the dash ‘-’ between the standard zip code and the four digit extension. I tried some fancy code that split the zip codes and only took the first five digits but in the end I just decided to concatenate a separate dataset based on city onto my original data frame in order to have a more complete zip code list.

In order to figure out how far customers were traveling to the resort, my project advisor led me to Google’s “The Distance Matrix API”. But I really didn’t want to have to pay, especially since I would have thousands of zip codes to look up. Instead, I took a less specific and far simpler approach. I decided to use ‘import geopy.distance’ and get a straight line distance — which apparently doesn’t take into the curvature of the earth and certainly doesn’t account for the less-than-straight path of roads and highways. Though considering the majority of travelers would be coming from New England and probably not very far, I figured the tradeoffs were worth it.

I then added this new distance column to my data frame calling it, ‘miles to resort’. You can adjust the distance to be in metric as well, which might make more sense (as it usually does) but ‘miles’ just felt more familiar to me. I did a quick double check of the numbers using a couple different web sites that give straight-line distance and found my numbers to be pretty similar.

At this point I was ready to start modeling. Of course in order to do anything I needed to get the data sorted. I used the .groupby() method to look at unique customers by ‘address’ and paired it with their ‘trip date’.

I then did similar a groupby for my other columns, adding them to the original groupby. The only other numerical columns I had involved the price of the tickets (which seemed to change depending on the year as well as the type of ticket), the distance feature I created, and the order time column we worked on earlier.

Before I run the model I want to scale the data, since two of the features above clearly have much larger numbers. I decided to use StandardScaler() because of the distance based metrics. I also wanted to convert the standard scaler array back to a data frame so I could include the column names.

Now with my new data frame created it was time to build my model. I used a ‘WCSS’ (Within Cluster Sum of Squares) Elbow model to get an idea of the optimal cluster parameter and then built a variety of different models from there. The elbow graph showed that a variety of cluster models would work so I ended up running models with 3, 5, and 7 clusters. After looking at the results I decided that the model with n_clusters equal to 5 had the most interpretable results.

Notice the new cluster column. Depending on the model I created and number of clusters chosen, the new column on the right changed. I thought about doing another groupby function to get a look at the items in each cluster but I find it difficult to see similar cluster characteristics and look at all the data on the same level.

Finally, I created the function below which will loop through the original groupby data frame, depending on the number of clusters set, and then output a new data frame that includes some basic stats for each cluster created. I experimented with different stats in order to get a variety of data on each cluster and help determine the type of customer and their characteristics.

From this data frame I can pinpoint which clusters have certain characteristics. Cluster group 0 has the highest Revenue Mean, as well as the highest number of trips (Trips Mean).

Skiers further away tended to buy their tickets more in advance (having a higher ‘order mean’) but not exclusively — Cluster 4. The cluster with the most people were a lot of the skiers who lived the close to the mountain — Cluster 1. And skiers that live the farthest away — Cluster 3.

Looking at my model it’s easy to see that distance was an important feature to the model. Order time, or how far in advance customers bought tickets also seemed to vary greatly from cluster to cluster. Whereas number of trips and ticket revenue were most more similar. I still need to determine if any of these differences are statistically significant and how they play a roll in marketing to new customers.

Aspiring Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store