Predicting Titanic Survivors

Predicting Titanic Survivors

The Titanic was the famous cruise ship that crashed into an iceberg and sank, killing most of its passengers. What if I told you we could figure out who was going to die and who wasnt, just based on the passengers' social standing, ticket price, gender, and age. I bet the passengers would have liked to have this type of technology before they went on the cruise.

We're going to look at a dataset describing all of these passengers, and we'll try and train a model to predict whether a passenger will survive or not.

Lets load our training data and take a look at it.

In [2]:
import pandas as pd

train = pd.read_csv("train.csv")
train.head(6)
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

Data Dictionary

survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton






Overview

One of the reasons so many people died on the titanic was because there weren't enough lifeboats for everyone. You have to wonder how they decided who got a raft and who didn't. My guess would be that they let women and children on the lifeboats first, and then it was probably the upper class that were allowed on. By the end of this post we'll be able to compare a test-set of passengers with our classification results and see how well we can predict survivors.

In the end we can also create fake passengers to see what types of passengers are likely to survive. For instance, we could make a 25 year old man of low class, with no family on board, whose ticket cost a very low amount. We could see what our model predicted for this type of passenger, and draw some conclusions about the situation as a whole. Maybe low class passengers were left to die because they were poor, or maybe young men honorably let the women and children go before them. Lets start getting into the data to see what we can find.

Preprocessing

To process this data we're going to have to go through 4 steps.

  • removing unnecessary attributes
  • dealing with missing values
  • encoding categorical attributes
  • normalizing continuous attributes

Removing unnecessary attributes

It seems like cabin won't be of much use. We could do some research about cabin naming conventions and try to extract some features from it, but we'll leave that for later.

For now, we'll remove PassengerID, Name, Ticket Number, and Cabin Number. Everything else is either a continuous variable, or a categorical with 2 or 3 categories.

We'll also store Survived in a separate series as the classifier. We'll encode Sex, Pclass and Embarked as one-hot encoded dummy variables.

In [3]:
removed_columns = ["PassengerId", "Name","Cabin","Ticket"]

train = train.drop(removed_columns, axis=1)

train.head(5)
Out[3]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S

Checking for missing values

Lets check if there are any missing values for any of our variables.

In [4]:
print(train["Survived"].isnull().value_counts())
print(train["Pclass"].isnull().value_counts())
print(train["Sex"].isnull().value_counts())
print(train["Age"].isnull().value_counts())
print(train["Parch"].isnull().value_counts())
print(train["SibSp"].isnull().value_counts())
print(train["Fare"].isnull().value_counts())
print(train["Embarked"].isnull().value_counts())
False    891
Name: Survived, dtype: int64
False    891
Name: Pclass, dtype: int64
False    891
Name: Sex, dtype: int64
False    714
True     177
Name: Age, dtype: int64
False    891
Name: Parch, dtype: int64
False    891
Name: SibSp, dtype: int64
False    891
Name: Fare, dtype: int64
False    889
True       2
Name: Embarked, dtype: int64

There are only 2 entries where embarked is empty, so we can just remove those to clean up the data. The 177 missing values for age are a bigger problem. We could remove this data to make sure it remains pure and unbiased, but we would lose a significant portion of the data. lets look at some of the entries where age is unavailable.

In [5]:
#remove entries with no Embarked value
train = train.loc[train["Embarked"].isnull() == False]
In [6]:
train.loc[train["Age"].isnull() == True].head(5)
Out[6]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
5 0 3 male NaN 0 0 8.4583 Q
17 1 2 male NaN 0 0 13.0000 S
19 1 3 female NaN 0 0 7.2250 C
26 0 3 male NaN 0 0 7.2250 C
28 1 3 female NaN 0 0 7.8792 Q
In [7]:
print("Class - Missing Age")
print(train.loc[train["Age"].isnull() == True]["Pclass"].value_counts())
print("Percent lower class (no age): %f.2%%" % (136/(136+30+11)))
print("\nClass - All")
print(train["Pclass"].value_counts())
print("Percent lower class (all): %f.2%%" % (491/(491+216+184)))
Class - Missing Age
3    136
1     30
2     11
Name: Pclass, dtype: int64
Percent lower class (no age): 0.768362.2%

Class - All
3    491
1    214
2    184
Name: Pclass, dtype: int64
Percent lower class (all): 0.551066.2%

Lower Clas Misrepresented

Among those who were missing their age in the records, most of them were lower class 76%. The lower class made up about half of the total passengers, but the majority of the passengers who were missing their age were lower class. This means that if we just drop all those who are missing their age, we'll be losing a lot of information about the lower class passengers. For this reason, it should be better to fill NaN with an average age based on the passengers' class.

In [8]:
avg_lower_age = train.loc[train["Pclass"] == 3].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 3] = train.loc[train["Pclass"] == 3].fillna(avg_lower_age)

avg_middle_age = train.loc[train["Pclass"] == 2].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 2] = train.loc[train["Pclass"] == 2].fillna(avg_middle_age)

avg_high_age = train.loc[train["Pclass"] == 1].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 1] = train.loc[train["Pclass"] == 1].fillna(avg_high_age)

Seperating Classifiers

We saved this step until now because we had to remove some entries. Now we will separate the classifiers into a separate list.

In [9]:
train_y = train["Survived"]
train = train.drop(["Survived"], axis=1)

Encoding Categorical Variables

In [10]:
train["Pclass"] = train["Pclass"].astype("category")
train = pd.get_dummies(train)

train.head(5)
Out[10]:
Age SibSp Parch Fare Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 22.0 1 0 7.2500 0 0 1 0 1 0 0 1
1 38.0 1 0 71.2833 1 0 0 1 0 1 0 0
2 26.0 0 0 7.9250 0 0 1 1 0 0 0 1
3 35.0 1 0 53.1000 1 0 0 1 0 0 0 1
4 35.0 0 0 8.0500 0 0 1 0 1 0 0 1

Normalizing Continuous Attributes

Now that we've removed unnecessary data, and converted the categorical data to one-hot encoded dummy variables, we need to deal with our continuous variables. Age, SibSp, Parch, and Fare need to be normalized.

In [11]:
norm_columns = ["Age","Parch","Fare","SibSp"]
for n in norm_columns:
    train[n] = ((train[n] - min(train[n])) / (max(train[n]) - min(train[n])))
train.head(5)
Out[11]:
Age SibSp Parch Fare Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 0.271174 0.125 0.0 0.014151 0 0 1 0 1 0 0 1
1 0.472229 0.125 0.0 0.139136 1 0 0 1 0 1 0 0
2 0.321438 0.000 0.0 0.015469 0 0 1 1 0 0 0 1
3 0.434531 0.125 0.0 0.103644 1 0 0 1 0 0 0 1
4 0.434531 0.000 0.0 0.015713 0 0 1 0 1 0 0 1

Process Test Data

Now that we did all our processing on the testing data, we'll do the same thing to the testing data so we can use it to test the accuracy of our model.

In [12]:
test = pd.read_csv("test.csv")

classifiers = pd.read_csv("gender_submission.csv")

test_y = classifiers["Survived"]
In [13]:
test = pd.read_csv("test.csv")

classifiers = pd.read_csv("gender_submission.csv")

test_y = classifiers["Survived"]

removed_columns = ["PassengerId", "Name","Cabin","Ticket"]
test = test.drop(removed_columns, axis = 1)

#remove entries with no Embarked value
test = test.loc[test["Embarked"].isnull() == False]

avg_lower_age = test.loc[test["Pclass"] == 3].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 3] = test.loc[test["Pclass"] == 3].fillna(avg_lower_age)

avg_middle_age = test.loc[test["Pclass"] == 2].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 2] = test.loc[test["Pclass"] == 2].fillna(avg_middle_age)

avg_high_age = test.loc[test["Pclass"] == 1].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 1] = test.loc[test["Pclass"] == 1].fillna(avg_high_age)

test["Pclass"] = test["Pclass"].astype("category")
test = pd.get_dummies(test)

for n in norm_columns:
    test[n] = ((test[n] - min(test[n])) / (max(test[n]) - min(test[n])))

Training Model and Classifying

Now we'll train our model with our prepared data. We're going to use a random forest for our model.

In [14]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model with some parameters.
model = RandomForestClassifier(n_estimators=5, min_samples_leaf=5, random_state=1)
# Fit the model to the data.
model.fit(train, train_y)
# Make predictions.
predictions = model.predict(test)
In [15]:
num_correct = 0
for (i,t) in enumerate(predictions):
    if(predictions[i] == test_y[i]):
        num_correct += 1
        
print("Percent Correct: %.2f%%" % (num_correct*100/predictions.size))
Percent Correct: 91.39%

Results

These are some good initial results. We were able to predict which passengers would survive with 91% accuracy. There are a few more things we can try and do to improve our accuracy:

  • Extracting features from removed attributes (name, ticket, cabin)
  • Try out some different machine learning algorithms (neural network)
  • Fill in missing age values by training a regression model to predict age based on other attributes.

All in all I'm pleased with the results we got from this dataset, According to the rankings on https://www.kaggle.com/c/titanic/leaderboard my accuracy would rank me 22nd out of 6000 competitors. I can see that there are some people who achieved 100% so that is definitely something to strive for.

In [ ]:
 

blogroll

social