Predicting Titanic Survivors¶
The Titanic was the famous cruise ship that crashed into an iceberg and sank, killing most of its passengers. What if I told you we could figure out who was going to die and who wasnt, just based on the passengers' social standing, ticket price, gender, and age. I bet the passengers would have liked to have this type of technology before they went on the cruise.
We're going to look at a dataset describing all of these passengers, and we'll try and train a model to predict whether a passenger will survive or not.
Lets load our training data and take a look at it.
import pandas as pd
train = pd.read_csv("train.csv")
train.head(6)
Data Dictionary¶
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Overview¶
One of the reasons so many people died on the titanic was because there weren't enough lifeboats for everyone. You have to wonder how they decided who got a raft and who didn't. My guess would be that they let women and children on the lifeboats first, and then it was probably the upper class that were allowed on. By the end of this post we'll be able to compare a test-set of passengers with our classification results and see how well we can predict survivors.
In the end we can also create fake passengers to see what types of passengers are likely to survive. For instance, we could make a 25 year old man of low class, with no family on board, whose ticket cost a very low amount. We could see what our model predicted for this type of passenger, and draw some conclusions about the situation as a whole. Maybe low class passengers were left to die because they were poor, or maybe young men honorably let the women and children go before them. Lets start getting into the data to see what we can find.
Preprocessing¶
To process this data we're going to have to go through 4 steps.
- removing unnecessary attributes
- dealing with missing values
- encoding categorical attributes
- normalizing continuous attributes
Removing unnecessary attributes¶
It seems like cabin won't be of much use. We could do some research about cabin naming conventions and try to extract some features from it, but we'll leave that for later.
For now, we'll remove PassengerID, Name, Ticket Number, and Cabin Number. Everything else is either a continuous variable, or a categorical with 2 or 3 categories.
We'll also store Survived in a separate series as the classifier. We'll encode Sex, Pclass and Embarked as one-hot encoded dummy variables.
removed_columns = ["PassengerId", "Name","Cabin","Ticket"]
train = train.drop(removed_columns, axis=1)
train.head(5)
Checking for missing values¶
Lets check if there are any missing values for any of our variables.
print(train["Survived"].isnull().value_counts())
print(train["Pclass"].isnull().value_counts())
print(train["Sex"].isnull().value_counts())
print(train["Age"].isnull().value_counts())
print(train["Parch"].isnull().value_counts())
print(train["SibSp"].isnull().value_counts())
print(train["Fare"].isnull().value_counts())
print(train["Embarked"].isnull().value_counts())
There are only 2 entries where embarked is empty, so we can just remove those to clean up the data. The 177 missing values for age are a bigger problem. We could remove this data to make sure it remains pure and unbiased, but we would lose a significant portion of the data. lets look at some of the entries where age is unavailable.
#remove entries with no Embarked value
train = train.loc[train["Embarked"].isnull() == False]
train.loc[train["Age"].isnull() == True].head(5)
print("Class - Missing Age")
print(train.loc[train["Age"].isnull() == True]["Pclass"].value_counts())
print("Percent lower class (no age): %f.2%%" % (136/(136+30+11)))
print("\nClass - All")
print(train["Pclass"].value_counts())
print("Percent lower class (all): %f.2%%" % (491/(491+216+184)))
Lower Clas Misrepresented¶
Among those who were missing their age in the records, most of them were lower class 76%. The lower class made up about half of the total passengers, but the majority of the passengers who were missing their age were lower class. This means that if we just drop all those who are missing their age, we'll be losing a lot of information about the lower class passengers. For this reason, it should be better to fill NaN with an average age based on the passengers' class.
avg_lower_age = train.loc[train["Pclass"] == 3].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 3] = train.loc[train["Pclass"] == 3].fillna(avg_lower_age)
avg_middle_age = train.loc[train["Pclass"] == 2].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 2] = train.loc[train["Pclass"] == 2].fillna(avg_middle_age)
avg_high_age = train.loc[train["Pclass"] == 1].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 1] = train.loc[train["Pclass"] == 1].fillna(avg_high_age)
Seperating Classifiers¶
We saved this step until now because we had to remove some entries. Now we will separate the classifiers into a separate list.
train_y = train["Survived"]
train = train.drop(["Survived"], axis=1)
Encoding Categorical Variables¶
train["Pclass"] = train["Pclass"].astype("category")
train = pd.get_dummies(train)
train.head(5)
Normalizing Continuous Attributes¶
Now that we've removed unnecessary data, and converted the categorical data to one-hot encoded dummy variables, we need to deal with our continuous variables. Age, SibSp, Parch, and Fare need to be normalized.
norm_columns = ["Age","Parch","Fare","SibSp"]
for n in norm_columns:
train[n] = ((train[n] - min(train[n])) / (max(train[n]) - min(train[n])))
train.head(5)
Process Test Data¶
Now that we did all our processing on the testing data, we'll do the same thing to the testing data so we can use it to test the accuracy of our model.
test = pd.read_csv("test.csv")
classifiers = pd.read_csv("gender_submission.csv")
test_y = classifiers["Survived"]
test = pd.read_csv("test.csv")
classifiers = pd.read_csv("gender_submission.csv")
test_y = classifiers["Survived"]
removed_columns = ["PassengerId", "Name","Cabin","Ticket"]
test = test.drop(removed_columns, axis = 1)
#remove entries with no Embarked value
test = test.loc[test["Embarked"].isnull() == False]
avg_lower_age = test.loc[test["Pclass"] == 3].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 3] = test.loc[test["Pclass"] == 3].fillna(avg_lower_age)
avg_middle_age = test.loc[test["Pclass"] == 2].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 2] = test.loc[test["Pclass"] == 2].fillna(avg_middle_age)
avg_high_age = test.loc[test["Pclass"] == 1].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 1] = test.loc[test["Pclass"] == 1].fillna(avg_high_age)
test["Pclass"] = test["Pclass"].astype("category")
test = pd.get_dummies(test)
for n in norm_columns:
test[n] = ((test[n] - min(test[n])) / (max(test[n]) - min(test[n])))
Training Model and Classifying¶
Now we'll train our model with our prepared data. We're going to use a random forest for our model.
from sklearn.ensemble import RandomForestClassifier
# Initialize the model with some parameters.
model = RandomForestClassifier(n_estimators=5, min_samples_leaf=5, random_state=1)
# Fit the model to the data.
model.fit(train, train_y)
# Make predictions.
predictions = model.predict(test)
num_correct = 0
for (i,t) in enumerate(predictions):
if(predictions[i] == test_y[i]):
num_correct += 1
print("Percent Correct: %.2f%%" % (num_correct*100/predictions.size))
Results¶
These are some good initial results. We were able to predict which passengers would survive with 91% accuracy. There are a few more things we can try and do to improve our accuracy:
- Extracting features from removed attributes (name, ticket, cabin)
- Try out some different machine learning algorithms (neural network)
- Fill in missing age values by training a regression model to predict age based on other attributes.
All in all I'm pleased with the results we got from this dataset, According to the rankings on https://www.kaggle.com/c/titanic/leaderboard my accuracy would rank me 22nd out of 6000 competitors. I can see that there are some people who achieved 100% so that is definitely something to strive for.