I recently started working on an independent data science project using a dataset I found on Kaggle. The object is to build a model that will help classify mushrooms as either poisonous or edible based on a number of visually describable characteristics. This blog will focus on turning the categorical data used to identify different parts of mushrooms into numerical data that can then be used in a classification model.
To eat or not to eat
Now I love mushrooms. I love them sautéed or fried, I love them on pizza, in an omelet, on a salad. You name it. I even enjoy seeing them in nature — though I’m scared to death of foraging in the woods for my own. Mostly because no matter how much I read about mushrooms it seems pretty difficult to me as to which ones are going to taste yummy and which ones are going to give you an upset stomach, or worse, a trip to the emergency room.
But as much as I would enjoy discussing what I have learned about how to tell apart different varieties of mushrooms, today I would like to discuss an interesting issue that arose during my data analysis.
This mushroom dataset found on Kaggle comes from the UCI Machine Learning content and has been around for about thirty years. As I load the data into a pandas dataframe using Python, I notice that all the features are categorical. Even the target column, “class” uses either the string ‘p’ for poisonous or ‘e’ for edible. Someone clearly knew their mushrooms back then!
# Load the data
df = pd.read_csv("mushrooms.csv")
Nevertheless, I know that if I’m going to build a model, I will have to turn these values into numerical data. Now the two ways that I am familiar with, and there may be others I haven’t researched yet, are Label Encoding and One Hot Encoding.
I first decided to try the label_encoding() method because I had seen a few notebooks online where other aspiring data scientist like myself had converted each column to numbers based on the number of independent attributes that it contained.
For instance, ‘e’ which stands for edible would become a ‘0’ and ‘p’ for poisonous would become ‘1’.
Side note here… I’m still confused as to whether the target class of ‘edible’ should be a 0 or 1. Usually the thing we want would become a ‘1’. Such as in a feature that has ‘yes’/’no’ variables or ‘win’/’loss’ items. I would think these would be 1/0 but I’m not sure if it matters to the supervised model.
Now this Label Encoding works great when there are just two attributes. But in many of the other columns, such as in the mushroom ‘cap-color’, we see that there many different possibilities. Take a look at all the different possibilities:
# cap-color before Label Encoding
In the case of this particular column there are ten different options — which you can find in the documentation as to what each string represents. Not sure why they chose these but: [‘cap-color’ = [ ‘brown’=n, ‘buff’=b, ‘cinnamon’=c, ‘gray’=g, ‘green’=r, ‘pink’=p, ‘purple’=u, ‘red’=e, ‘white’=w, ‘yellow’=y]].
What I found was that when I used the label encoder method it transformed these attributes to an integer according to their order in the alphabet (from 0–9) which is to be expected.
# cap-color after label encoding
What I didn’t want to happen was have a color have a value associated with it that was worth more than another. I didn’t want ’n’ which became a 4 to be worth twice as much as ‘e’ which became a 2 — since ‘brown’ isn’t more important than ‘red’. The more I researched Label Encoding I found that it is most often used when there is a hierarchy to the data which is clearly not present here. Therefore, before building my models and doing much more EDA, I decided to try the other method.
One Hot Encoding
The method of One Hot Encoding uses a technique called .get_dummies(). Basically this creates a new column for every attribute for every feature. So in the feature called ‘oder’ with the following attributes [odor = [‘almond’=a, ‘anise’=l, ‘creosote’=c, ‘fishy’=y, ‘foul’=f, ‘musty’=m, ‘none’=n, ‘pungent’=p, ‘spicy’=s], One Hot Encoding would create 8 columns. Each column would be completely filled with zeros except where the specific attribute is present, where you would instead see a one.
# odor attributes with values dataframe 'X'
Now I knew that one hot encoding would cause the number of columns to increase dramatically. And while I read the documentation about each feature I still wasn’t exactly sure how many total columns I would get but there was a sure way to find out!
# Use One Hot Encoding to change all categorical data to 0 and 1's
X = pd.concat([pd.get_dummies(X[col], drop_first = True) for col in X], axis=1, keys=X.columns)
So I went from 23 columns up to 95 (this no longer including the target variable which I set as my dependent target variable ‘y’). This does not take into account that I dropped the first attribute for each column, helping to further reduce the total number. Meaning that if I had not set the method .get_dummies() to “drop_first=True” I would have had even more columns. Fortunately the “drop_first” parameter eliminates the first attribute from each column even though the rows that are influenced by that feature will still be present only represented by a zero.
In the end I decided to go with One Hot Encoding in the hopes that I could reduce the number of columns before modeling, without introducing any unintentional order to the structure. Both the Label Encoding and the One Hot Encoding yielded models that over-fit the data and had extremely high validation scores.
I would like to play with these methods more in relation to the dataset but there is already so much information about transforming categorical data that I need to do more research and continue to learn about the other options available to data scientists.