Visualizing wine quality through data. Source: mirofoto, Upsplash.com

Wine Quality EDA

Examining different ‘quality’ visualizations

TJ Whipple
7 min readNov 24, 2020

--

How do you know what wine to choose? Some people may look at the picture on the label. Some may choose by the color of the bottle. Others may follow the blog of a sommelier and read wine reviews. What if we could use data?While doing some exploratory data analysis (EDA) on a dataset found on Kaggle, I looked at some interesting plots using Seaborn’s Python data visualization library based on matplotlib’s informative statistical graphics. The plots below are by no means meant to provide a once-and-for-all understanding of all wines in general let alone this small sample.

The dataset

The dataset is from the UCI machine learning repository and considered an open database for educational purposes. There are two datasets — one for red and one for white varieties of the Portuguese wine called “Vinho Verde”, though I will be using the red variant. I was particularly excited to find this dataset since this grape seems to be pretty hot these days among wine connoisseurs and is sometimes slightly effervescent.

https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

I loaded the dataset and then checked to see if there were any missing/null values or duplicates. It turns out the version I downloaded was pretty clean and didn’t need a lot of data manipulation. You can see below the different columns that were included (I believe the original dataset also has a column labeled ‘red’ or ‘white’ but this version has already been filtered for some reason).

# Load the csv file into a pandas dataframe
df = pd.read_csv("winequality-red.csv")
df.head()

Target variable

I decided to look at number of the features that I felt might contribute to the quality of wine. It is important to also understand how the target feature is described in the dataset. You can see above that a number of the wines have an integer for the ‘quality’ rating. Below I show all the different ‘quality’ possibilities — of which the description states that the wines were rated between 1 and 10, with no perfect scores and none so horrible as to not be able to drink. Though you can tell from the array below that the worst and the best (number 4 and number 8 respectively) are also the least common scores.

# Different rating scale of wine
df['quality'].unique()

It also turned out that the majority of the wines received a score of either 5 or 6 making this sample a fairly normal distribution.

Visualizations

The first couple of visuals that I looked at were just straight forward bar plots. I wanted to know if there was a relationship between the ‘quality’ and a different features. Actually, the first graphs I made were a correlation matrix and a histogram grid for all columns. These two graphs allowed me to look at the multicollinearity of features and the assumption of normality in the curves. Though after those graphs I wanted to explore features that might effect the quantity rating.

# Get a visual of how important alcohol content is?
fig = plt.figure(figsize = (9,5))
sns.barplot(x = 'quality', y = 'alcohol', data = df, palette='Reds', ci = None)
plt.xlabel("Quality Rating", fontsize = 16)
plt.ylabel("Alcohol Percentage", fontsize = 16)
plt.title("Alcohol by Wine Quality", fontsize = 20)
plt.show()

You can see from the bar graph above that the higher the alcohol percentage, the higher the quality rating. While most of the alcohol percents were all within a relatively small range (around 2–2.5%) I found it interesting to see that at least in this dataset that on average, the better quality wines had slightly more alcohol.

Using similar code as above, I decided to check out chlorides. I read online that chloride is part of the chemical we use as table salt (sodium chloride). The chlorides in wine are used to help adjust acidity and taste. Apparently, according this sample dataset, wines with lower chloride levels meant a higher rating as you can see in the bar plot below.

While bar plots are some of the easiest graphs to understand, I also needed to get a better look at how the data was spread out. The following two box plots give a great visual of not only the mean for each quality rating, but also how the data is spread out within each rating. After looking at a few of these box plots I decided to get rid of some of the outliers in the data. The plots below still show a few outliers (the black dots above and below each of the whiskers) but for the most part the graphs look much better.

# Boxplot of wine Density
plt.figure(figsize=(15,15))
sns.catplot(x = 'quality', y = 'pH', data = df, palette='Reds')
plt.xlabel("Quality Rating", fontsize = 16)
plt.ylabel("density", fontsize = 16)
plt.title("Density by Wine Quality", fontsize = 20)
plt.show()

You can see that the density of wine also seems to play a part in how a wine was rated. According to the description in the dataset, the density of water is close to that of water depending on the percent alcohol and sugar content. So there must be a balance between these two features that effects the density — perhaps the higher alcohol makes the density go down. Again, the difference in the medians here (the horizontal line in the center of each box) is small, but clearly they go down as the quality goes up.

Below, I used similar code to look at citric acid. Citric acid is added to give ‘freshness’ and flavor to wines. I’m not sure how it’s measured, but it looks as the more the merrier! High rated wines seem to have higher citric acid levels and tend to be in a tighter range than the lower rated wines (notice the length of the rectangles for lower rated wines).

Next I decided to try some fun Violin plots. To me, these are just box plots show how the data is spread out all the way to the end of the whiskers. Below is the code I used to create these graphs. The white dots in the middle of each plot represent the median values where the width of each ‘violin’ shows the probability distribution.

# Violin Catplot of wine Fixed Acidity
plt.figure(figsize=(20,20))
sns.catplot(x = 'quality', y = 'fixed acidity', kind="violin", data = df, palette='Reds')
plt.xlabel("Quality Rating", fontsize = 16)
plt.ylabel("Fixed Acidity", fontsize = 16)
plt.title("Fixed Acidity of Wine Quality", fontsize = 20)
plt.show()

For ‘fixed acidity’ it hard to state any obvious conclusions. Perhaps fixed acidity is dependent on numerous other features and is less of a signal as it is just a measurable feature. The wine with a higher quality rating seem to be spread out all over the map — meaning that good tasting wines can have a high or a low fixed acidity. Whereas the lower rated wines tend to be more clustered together around the median.

This violin plot of the pH of wine does show a slight downward trend, especially looking at the medians and inter-quartile ranges (the long thin black box in the middle of each plot) — though again, the difference is pretty small. I remember from science class that water tends to have a pH close to 7 and lemons have a pH around 2. So most of these wines are definitely on the acidic side, though again, I imagine it’s pretty hard to tell the difference.

While many of the features were more difficult to discern any type of relationship to the quality rating, I’ve tried to share some of the more prominent columns here. I was convinced that ‘residual sugars’ would be in this list of notable features but it turned out that the different plots showed that sugar was less important — even though I prefer dry wines and find sweet wines to be less sophisticated. But the truth is that it may not be the sweetness, but the different acidity levels, additives, and alcohol content that are what I find more appealing in a good bottle of wine.

After all this EDA I needed to build some predictive models. With the data cleaned and outliers removed I still needed to split my data into the train/test sets and then scale the data. Hopefully I’ll be able to build a model that can help me accurately pick out a tasty bottle of wine — you’ll have to check out my GitHub repo for the full conclusion!

--

--