Was it the ice burg or the lack of lifeboats that defined the Titanic? Source: Annie Spratt, Unsplash.com

My First Kaggle Competition

Building a model of Survivors!

This past week I decided to enter my first Kaggle Competition. The competition can be found here on the Kaggle website. While there were many different competitions to join, I decided that “Titanic: Machine Learning from Disaster” was a good place to start. The object of this competition is to predict the type of people who survived the Titanic sinking as well as get familiar with basics Machine Learning concepts.

While I had already worked with this dataset within my course at the Flatiron School, it had been over a year ago and I was now so much more prepared. Plus, I would need to complete this notebook online in a Kaggle Kernel and need to submit my final model to reveal a score. A score based on the number of correct survivors I could identify using the model I built.

The Kaggle Kernel is an online platform where you can run your Python code in a Jupyter notebook. Thankfully, it was very similar to the Jupyter notebook I run on my own machine and pretty straight forward. There were even a few things that I like even better than using my Anaconda version.

On the right side of the screen there is a dropdown box where you can easily add data. I clicked on the + symbol and chose the titanic data from the competition. While I was a little confused at first about the contents of this file, a little research helped sort things out. The ‘gener_submission.csv’ file is a sample of what the solution needs to look like before submitting your answer to the competition. Then there is the ‘test.csv’ and the ‘train.csv’. The feature “Copy file path” is very user-friendly and makes importing the data into the notebook a breeze!

The submission file caused me some difficulty because the format of my data was incorrect. After running the data through my model I realized that I was missing two things — First, my target column had the wrong name. I guess that either when I ran my model through the Test/Train Split or through the StandardScaler — it converted the data frame into a numpy array, and I lost the column name ‘Survived’.

This was an easy fix using the ‘.rename’ method.

forest_output = forest_output.rename(columns = {0:"Survived"})

Second, I had deleted the “PassengerId” and needed to somehow get it back into my solution! This was more complicated and would love to find a work-around.

Finally, after some guess-n-check and a little bit of searching the web I found a way using the ‘.merge’ method. I also tried ‘.concat’ but wasn’t successful.

forest_output = pd.merge(test_set['PassengerId'], forest_target, how        = 'left', left_index = True, right_index = True)

Another thing I about Kaggle Kernels like is the quick ability to add either a ‘Code’ cell or a ‘Markdown’ cell — not that it’s not easy to do in my version of Jupyter notebook, I just prefer the ‘hover over’ of this web-based style as opposed to a keyboard short-cut or using the ‘Insert’ cell below and then having to declare what type it is. You can see below the cell of necessary libraries that I imported the two + signs for adding new cells.

While I knew that I needed a test set and a training set of data — I wasn’t sure what to do with the two different files (‘train.csv’ and ‘test.csv’). Should I combine them to do my exploratory research, even though the ‘test.csv’ file didn’t contain the target variable? Would it help or make a difference? And would I need to perform all of the same cleaning on the ‘test.csv’ file before running it through my model as I didn’t with the ‘train.csv’ file?

There are obviously answers out there — and I could have easily have looked up any countless number of notebooks that have already been completed. But instead I wanted to make my way through this process on my own. I decided that keeping the two datasets separate would be more similar to a real life scenario. Upon closer inspection of the two datasets, there are missing values in different columns — meaning that the two datasets would require two unique cleaning solutions.

For sake of time, I do not plan to go through the whole notebook. I did some EDA (Exploratory Data Analysis) of the ‘train.csv’ features, made some fun graphs, and then cleaned up the columns with missing data. Next I converted the categorical columns to 1’s and 0’s using One Hot Encoding, separated the target value from the labels, and then applied the test_train_split method. Finally, I used Standard Scaler to scaled the data and then fit the data to a variety of models.

For this project I decided to try a bunch of classifier models like KNN, RandomForest, LogisticRegression, SVC, and XGBoost. The Random Forest model gave me the highest model accuracy so I decided to move forward. Up until now, I have mostly just built models and then tried to evaluate their performance using a variety of different metrics such as RMSE scores, Cross Val Scores, or (PCA) Principal Components Analysis. Here is an example of my Random Forest code:

from sklearn.ensemble import RandomForestClassifier# Forest Model
forest_clf = RandomForestClassifier()
forest_model = forest_clf.fit(X_train, y_train)
forest_training_preds = forest_clf.predict(X_train)
forest_training_accuracy = accuracy_score(y_train, forest_training_preds)
forest_val_preds = forest_clf.predict(X_test)
forest_val_accuracy = accuracy_score(y_test, forest_val_preds)
print("Forest Training Accuracy: {:.4}%".format(forest_training_accuracy * 100))
print("Forest Validation accuracy: {:.4}%".format(forest_val_accuracy * 100))

Below is the Training Accuracy as well as the Validation score.

Now it was time to import the ‘test.csv’ data and really check out how my model performed. First I needed to clean up the data (which I mentioned earlier was slightly different than the ‘train.csv’ file) and fill in any missing information. I also used One Hot Encoding on the categorical after deleting the same features that I had deleted and deemed unhelpful in the original dataset. Next I ran the data through the Standard Scaler and got ready to use my model predictor.

The ‘forest_clf’ is my model classifier. (test) is the scrubbed data that goes into the .predict() method. Since I used a number of different models I decided to name this one ‘forest_test_preds’!

# Predict target values on test set
forest_test_preds = forest_clf.predict(test)

Since the prediction returns an array, I wanted to put it back into a data frame and view the results.

# Change array to a data frame
forest_target = pd.DataFrame(forest_test_preds)

For fun I wanted to see how this model compared to my training data in terms of output of survivors. In the ‘train.csv’ file I discovered that only about 38% of the passengers lived.

# Percentage of survived to those that perished

My model shows about 35% of the passengers will live.

Now I just need to complete the steps that I cut above — merging the ‘PassengerId’, renaming the column ‘Survived’, and finally converting this data frame back into a .csv file for submission.

forest_output.to_csv('whipple_titanic_forest_submission.csv',  index=False)

My submission scored a 76.55%. Not bad for my first model! I was hoping to get at least as good as my own validation score of 80.72% but it just means that I must have missed something. Now it’s time to go back and tweak my project to see if I can improve my results.

All in all this was a fun project. I was a little discouraged to discover that tons of people have actually scored 100% — correctly identifying all of the survivors in the ‘test.csv’ file, even if the majority of submissions have results more similar to mine. Thank you to Kaggle for this opportunity and I look forward to trying out some more difficult ones in the future.

Aspiring Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store