Identifying the cute but very poisonous Fly Agarics — Source: ‘walkman200’, freeimages.com

Mushroom Labelling

Label Encoding verses One Hot Encoding with categorical data.

I recently started working on an independent data science project using a dataset I found on Kaggle. The object is to build a model that will help classify mushrooms as either poisonous or edible based on a number of visually describable characteristics. This blog will focus on turning the categorical data used to identify different parts of mushrooms into numerical data that can then be used in a classification model.

To eat or not to eat

But as much as I would enjoy discussing what I have learned about how to tell apart different varieties of mushrooms, today I would like to discuss an interesting issue that arose during my data analysis.

Can someone actually identify each of these mushrooms? — Source: Andrew Ridley, Upsplash.com

The Data

# Load the data
df = pd.read_csv("mushrooms.csv")
df.head()

Nevertheless, I know that if I’m going to build a model, I will have to turn these values into numerical data. Now the two ways that I am familiar with, and there may be others I haven’t researched yet, are Label Encoding and One Hot Encoding.

Label Encoding

For instance, ‘e’ which stands for edible would become a ‘0’ and ‘p’ for poisonous would become ‘1’.

Side note here… I’m still confused as to whether the target class of ‘edible’ should be a 0 or 1. Usually the thing we want would become a ‘1’. Such as in a feature that has ‘yes’/’no’ variables or ‘win’/’loss’ items. I would think these would be 1/0 but I’m not sure if it matters to the supervised model.

Now this Label Encoding works great when there are just two attributes. But in many of the other columns, such as in the mushroom ‘cap-color’, we see that there many different possibilities. Take a look at all the different possibilities:

# cap-color before Label Encoding
df['cap-color'].value_counts()

In the case of this particular column there are ten different options — which you can find in the documentation as to what each string represents. Not sure why they chose these but: [‘cap-color’ = [ ‘brown’=n, ‘buff’=b, ‘cinnamon’=c, ‘gray’=g, ‘green’=r, ‘pink’=p, ‘purple’=u, ‘red’=e, ‘white’=w, ‘yellow’=y]].

What I found was that when I used the label encoder method it transformed these attributes to an integer according to their order in the alphabet (from 0–9) which is to be expected.

# cap-color after label encoding
data['cap-color'].value_counts()

What I didn’t want to happen was have a color have a value associated with it that was worth more than another. I didn’t want ’n’ which became a 4 to be worth twice as much as ‘e’ which became a 2 — since ‘brown’ isn’t more important than ‘red’. The more I researched Label Encoding I found that it is most often used when there is a hierarchy to the data which is clearly not present here. Therefore, before building my models and doing much more EDA, I decided to try the other method.

One Hot Encoding

# odor attributes with values dataframe 'X'
X[['odor']].head()

Now I knew that one hot encoding would cause the number of columns to increase dramatically. And while I read the documentation about each feature I still wasn’t exactly sure how many total columns I would get but there was a sure way to find out!

# Use One Hot Encoding to change all categorical data to 0 and 1's
X = pd.concat([pd.get_dummies(X[col], drop_first = True) for col in X], axis=1, keys=X.columns)
X.head()

So I went from 23 columns up to 95 (this no longer including the target variable which I set as my dependent target variable ‘y’). This does not take into account that I dropped the first attribute for each column, helping to further reduce the total number. Meaning that if I had not set the method .get_dummies() to “drop_first=True” I would have had even more columns. Fortunately the “drop_first” parameter eliminates the first attribute from each column even though the rows that are influenced by that feature will still be present only represented by a zero.

Summary

I would like to play with these methods more in relation to the dataset but there is already so much information about transforming categorical data that I need to do more research and continue to learn about the other options available to data scientists.

Aspiring Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store