Ternary of Zebras in Tanzania Drinking at a Water-well! — Photo Credit: Gene Taylor

A Ternary Classification

My First Multi-class Python Classification Model

In data science, we build models to help us understand data. Classification models help us to predict how non-continuous data will behave or can be grouped in a way that is useful.

We begin learning about binary models which group the target class into two categories based on the independent variables, often called ‘values’. The target class or the y-variable, is the feature of the supervised learning model that we hope to be able to predict.

A ternary classification groups the dependent, or ‘target’ variable into three groups — ternary, coming from the latin root similar to ‘tri’. There are many different models that can perform this operation though they sometimes differ slightly from binary problems. Ternary classifications, as well as any model that has more than two targets, are generally considered ‘Multiclass’ or multinomial problems.

In order to demonstrate a ternary classification I will build a few different models and use the dataset from the Tanzanian Pump-It-Up water-point competition. This project involved using information given about water-points in Tanzania to predict whether or not a given water source (wells, pumps, standpipes, fountains, boreholes, etc.) was working correctly.

The given data included a target with three classes — ‘functioning’, ‘non-functioning’, and ‘functioning needs repair’. The idea was to build a model that could predict if a given water-point would fall into one of these three classes. The main two classes were fairly balanced, though the third class was lacking representation with less than 10 percent of the total target. The feature was labeled, ‘status_group’ in the dataset as seen below.

# Load the target
target = pd.read_csv("train_set_labels.csv")
# Check out a few of the rows

While this was presented as a fun data science competition, the implication of building a productive model with high accuracy could be very beneficial. Not only could the communities of Tanzania be positively impacted but useful to anywhere water-points are inconsistent or where ever water is not provided by modern standards. A data science model could be used to help decide where communities need clean, potable water which is considered to be a vital aspect of health.

In order for the classification model to correctly predict the target, we need to change the labels from strings to integers.

# Replace target values - there are three classes
target = target.replace({'status_group': {'functional' : 1,
'non functional' : 0,
'functional needs repair' : 2}})
# Check to see that it worked

The dataset also included 40 independent features which we will use as our independent ‘values’. We then need to merge the target and the values into one dataset in order to properly clean the data.

# Load the rest of the data
values = pd.read_csv("train_set_values.csv")
# Merge the target and values into one dataset
df = pd.concat([values, target], axis=1)

Eleven of the columns had data that was continuous, entered as numbers on a given scale — such as height of the water-point, amount of water available, the latitudinal/longitudinal location, construction year, etc. You will notice that many of the features need to be examined in more detail due to a large percentage of zeros, outliers, and irrelevant information.

# View continuous features

The majority of the values were represented by categorical data, many of which contained administrative information such as who installed the water-point, who manages it, whether or not the water-point has a permit, etc. Unfortunately most of this data will prove to be irrelevant to the model and will be deleted for my model. The dataset also contained a lot of information that related to the location of the water-points, such as the region, district, ward, and water-basin. While many of these features will also be deleted due to the over-whelming number of categories per feature (and not being useful for classification) we will see it turns out that location does plays an important role for building the model.

# View all categorical features

After extensive research and cleaning, I decided that the following columns needed to be dropped either due to irrelevant information or due to redundant information that was found in another column (such as ‘payment’ and ‘payment_type’).

# Drop all columns with redundant information
col_to_delete = ['id', 'recorded_by', 'funder', 'installer',
'lga', 'ward', 'region', 'scheme_management', 'wpt_name','scheme_name', 'extraction_type', 'extraction_type_group','management', 'payment_type', 'quality_group', 'source_type', 'source', 'waterpoint_type_group', 'quantity_group', 'subvillage']

df = df.drop(col_to_delete, axis=1)

I also needed to do some further cleaning of the data. I first dropped all duplicate rows that still remained and then dropped all columns with missing information.

Next, I used one-hot-encoding on the remaining categorical features in order to use them in the classification model — many of these features had up to a half dozen different possible categories which will increase the shape of my dataset greatly but is still necessary. By dropping the first new column created I can reduce the total number of columns without sacrificing any information from the model.

# Drop all duplicates
# Drop all columns with missing information
df = df.dropna(axis = 0)
# One Hot Encoding with get_dummy variables and drop first
df = pd.get_dummies(df, drop_first=True)

The data was then randomly split into two groups - a training group and a testing group. The machine learning algorithms will first be applied to the training group using Python and SciKit learn libraries. This gives the model a chance to categorize the data and ‘learn’ how to justify the classes. Since we keep some of the information in a testing group, we can use this data to accurately rate the performance of the model. The results from different models were analyzed and compared in order to determine the best overall fit.

# Three classes of independent target variable
y = df.status_group
# Drop target and set the dependent values variables
X = df.drop('status_group', axis=1)
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

I plan to build at least three models: (Logistic , Random Forest, and XGBoost)

# Logistic model
log_clf = LogisticRegression(random_state=123, multi_class='multinomial', solver='newton-cg')
log_model = log_clf.fit(X_train, y_train)
log_training_preds = log_clf.predict(X_train)
log_training_accuracy = accuracy_score(y_train, log_training_preds)
log_val_preds = log_clf.predict(X_test) # y_hat
log_val_accuracy = accuracy_score(y_test, log_val_preds)

The Log Training Accuracy was 73.79% — not that great but a good start. The results of the Log Validation Accuracy were similar 73.7%- which is not surprising but also not as strong as we are looking to get. It is important to note that with the Logistic model we need to set the parameter multi_class = ‘multinomial’ for cross-entropy loss. There is also the option of multi_class = ‘ovr’ or (one_verses_rest).

#Confusion matrix for Logistic Regression
log_matrix = confusion_matrix(y_test, log_val_preds)
print('Confusion Matrix:\n', log_matrix)

If our model was able to correctly identify all of the three classes our Confusion Matrix would look like the one below. All the numbers that are not along the diagonal are misclassified. Therefore, in the Logistic Model above, you can see that 2761 of the 4809 functioning water-points were correctly identified by the model. Whereas 6411 of the 7062 non-functioning water-points where correctly identified. The number of correct observations divided by the total number of instances gives us the Accuracy for the model.

Next I tried the Random Forest model — which is a collection of Decision Trees that are built at random and then averaged to get the best overall model. Random Forests can take a lot of time to build but are necessary when doing a multi-class classification.

# Forest Model
forest_clf = RandomForestClassifier()
forest_model = forest_clf.fit(X_train, y_train)
forest_training_preds = forest_clf.predict(X_train)
forest_training_accuracy = accuracy_score(y_train, forest_training_preds)
forest_val_preds = forest_clf.predict(X_test) # y_hat
forest_val_accuracy = accuracy_score(y_test, forest_val_preds)

The accuracy of the Random Forest was better than the Logistic model. The Forest Training accuracy was 98.41% — possibly overfitting somewhat. And the Forest Validation accuracy for the tests was 78.48%.

# Confusion matrix for Random Forest
forest_matrix = confusion_matrix(y_test, forest_val_preds)
print('Confusion Matrix:\n', forest_matrix)

The XGBoost Classifier

# XGB classifier
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train, y_train)
xgb_training_preds = xgb_clf.predict(X_train)
xgb_training_accuracy = accuracy_score(y_train, xgb_training_preds)
xgb_val_preds = xgb_clf.predict(X_test)
xgb_val_accuracy = accuracy_score(y_test, xgb_val_preds)

The XGBoost Classifier didn’t do as well as the Random Forest. The XGB Training Accuracy was 73.72% and the XGB Validation accuracy was 73.87%. Below is the confusion matrix showing the correctly classified data.

# XGBoost confusion matrix - 
xgb_matrix = confusion_matrix(y_test, xgb_val_preds)
print('Confusion Matrix:\n', xgb_matrix)

There are also the Precision and Recall metrics which can be used to help validate the performance of the model. Precision is the fraction of relevant items among all the retrieved items, meaning the functioning water-points, to all the water-points that the model believe to be working. Recall is the fraction of relevant items that have been retrieved over the total amount of relevant instances — in other words, the number of water-points that the model determined to be functioning divided by the total number of actually functioning water-points.

Precision and Recall are used as measurements of the models performance in addition to the Accuracy, which is basically the percentage of correctly identified items. I found that all of my models had difficulty getting the accuracy higher than 80% due to all the missing information and high number of zeros in the dataset. Though due to the limited number of data points in the third class (functioning needs repair) my models had difficulty determining labels for this class.

Furthermore, I felt like for this data set, it was best to build a model based on accuracy since we aren’t really concerned with false positives or false negatives as in some models. While it might be a waste of time and resources to visit a possibly non-functioning water-point only to find that it was in fact still working — I believe it is better to error on the side of having too many false positives as opposed to letting people go without a safe drinking water source.

Aspiring Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store