Predicting Titanic Survivors¶

The Titanic was the famous cruise ship that crashed into an iceberg and sank, killing most of its passengers. What if I told you we could figure out who was going to die and who wasnt, just based on the passengers' social standing, ticket price, gender, and age. I bet the passengers would have liked to have this type of technology before they went on the cruise.

We're going to look at a dataset describing all of these passengers, and we'll try and train a model to predict whether a passenger will survive or not.

Lets load our training data and take a look at it.

In [2]:

import pandas as pd

train = pd.read_csv("train.csv")
train.head(6)

Out[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	330877	8.4583	NaN	Q

Data Dictionary¶

survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Overview¶

One of the reasons so many people died on the titanic was because there weren't enough lifeboats for everyone. You have to wonder how they decided who got a raft and who didn't. My guess would be that they let women and children on the lifeboats first, and then it was probably the upper class that were allowed on. By the end of this post we'll be able to compare a test-set of passengers with our classification results and see how well we can predict survivors.

In the end we can also create fake passengers to see what types of passengers are likely to survive. For instance, we could make a 25 year old man of low class, with no family on board, whose ticket cost a very low amount. We could see what our model predicted for this type of passenger, and draw some conclusions about the situation as a whole. Maybe low class passengers were left to die because they were poor, or maybe young men honorably let the women and children go before them. Lets start getting into the data to see what we can find.

Preprocessing¶

To process this data we're going to have to go through 4 steps.

removing unnecessary attributes
dealing with missing values
encoding categorical attributes
normalizing continuous attributes

Removing unnecessary attributes¶

It seems like cabin won't be of much use. We could do some research about cabin naming conventions and try to extract some features from it, but we'll leave that for later.

For now, we'll remove PassengerID, Name, Ticket Number, and Cabin Number. Everything else is either a continuous variable, or a categorical with 2 or 3 categories.

We'll also store Survived in a separate series as the classifier. We'll encode Sex, Pclass and Embarked as one-hot encoded dummy variables.

In [3]:

removed_columns = ["PassengerId", "Name","Cabin","Ticket"]

train = train.drop(removed_columns, axis=1)

train.head(5)

Out[3]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	0	3	male	22.0	1	7.2500	S
1	1	1	female	38.0	1	71.2833	C
2	1	3	female	26.0	0	7.9250	S
3	1	1	female	35.0	1	53.1000	S
4	0	3	male	35.0	0	8.0500	S

Checking for missing values¶

Lets check if there are any missing values for any of our variables.

In [4]:

print(train["Survived"].isnull().value_counts())
print(train["Pclass"].isnull().value_counts())
print(train["Sex"].isnull().value_counts())
print(train["Age"].isnull().value_counts())
print(train["Parch"].isnull().value_counts())
print(train["SibSp"].isnull().value_counts())
print(train["Fare"].isnull().value_counts())
print(train["Embarked"].isnull().value_counts())

False    891
Name: Survived, dtype: int64
False    891
Name: Pclass, dtype: int64
False    891
Name: Sex, dtype: int64
False    714
True     177
Name: Age, dtype: int64
False    891
Name: Parch, dtype: int64
False    891
Name: SibSp, dtype: int64
False    891
Name: Fare, dtype: int64
False    889
True       2
Name: Embarked, dtype: int64

There are only 2 entries where embarked is empty, so we can just remove those to clean up the data. The 177 missing values for age are a bigger problem. We could remove this data to make sure it remains pure and unbiased, but we would lose a significant portion of the data. lets look at some of the entries where age is unavailable.

In [5]:

#remove entries with no Embarked value
train = train.loc[train["Embarked"].isnull() == False]

In [6]:

train.loc[train["Age"].isnull() == True].head(5)

Out[6]:

	Survived	Pclass	Sex	Age	Fare	Embarked
5	0	3	male	NaN	8.4583	Q
17	1	2	male	NaN	13.0000	S
19	1	3	female	NaN	7.2250	C
26	0	3	male	NaN	7.2250	C
28	1	3	female	NaN	7.8792	Q

In [7]:

print("Class - Missing Age")
print(train.loc[train["Age"].isnull() == True]["Pclass"].value_counts())
print("Percent lower class (no age): %f.2%%" % (136/(136+30+11)))
print("\nClass - All")
print(train["Pclass"].value_counts())
print("Percent lower class (all): %f.2%%" % (491/(491+216+184)))

Class - Missing Age
3    136
1     30
2     11
Name: Pclass, dtype: int64
Percent lower class (no age): 0.768362.2%

Class - All
3    491
1    214
2    184
Name: Pclass, dtype: int64
Percent lower class (all): 0.551066.2%

Lower Clas Misrepresented¶

Among those who were missing their age in the records, most of them were lower class 76%. The lower class made up about half of the total passengers, but the majority of the passengers who were missing their age were lower class. This means that if we just drop all those who are missing their age, we'll be losing a lot of information about the lower class passengers. For this reason, it should be better to fill NaN with an average age based on the passengers' class.

In [8]:

avg_lower_age = train.loc[train["Pclass"] == 3].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 3] = train.loc[train["Pclass"] == 3].fillna(avg_lower_age)

avg_middle_age = train.loc[train["Pclass"] == 2].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 2] = train.loc[train["Pclass"] == 2].fillna(avg_middle_age)

avg_high_age = train.loc[train["Pclass"] == 1].loc[train["Age"].isnull() == False]["Age"].mean()
train.loc[train["Pclass"] == 1] = train.loc[train["Pclass"] == 1].fillna(avg_high_age)

Seperating Classifiers¶

We saved this step until now because we had to remove some entries. Now we will separate the classifiers into a separate list.

In [9]:

train_y = train["Survived"]
train = train.drop(["Survived"], axis=1)

Encoding Categorical Variables¶

In [10]:

train["Pclass"] = train["Pclass"].astype("category")
train = pd.get_dummies(train)

train.head(5)

Out[10]:

	Age	SibSp	Fare	Pclass_1	Pclass_3	Sex_female	Sex_male	Embarked_C	Embarked_S
0	22.0	1	7.2500	0	1	0	1	0	1
1	38.0	1	71.2833	1	0	1	0	1	0
2	26.0	0	7.9250	0	1	1	0	0	1
3	35.0	1	53.1000	1	0	1	0	0	1
4	35.0	0	8.0500	0	1	0	1	0	1

Normalizing Continuous Attributes¶

Now that we've removed unnecessary data, and converted the categorical data to one-hot encoded dummy variables, we need to deal with our continuous variables. Age, SibSp, Parch, and Fare need to be normalized.

In [11]:

norm_columns = ["Age","Parch","Fare","SibSp"]
for n in norm_columns:
    train[n] = ((train[n] - min(train[n])) / (max(train[n]) - min(train[n])))
train.head(5)

Out[11]:

	Age	SibSp	Fare	Pclass_1	Pclass_3	Sex_female	Sex_male	Embarked_C	Embarked_S
0	0.271174	0.125	0.014151	0	1	0	1	0	1
1	0.472229	0.125	0.139136	1	0	1	0	1	0
2	0.321438	0.000	0.015469	0	1	1	0	0	1
3	0.434531	0.125	0.103644	1	0	1	0	0	1
4	0.434531	0.000	0.015713	0	1	0	1	0	1

Process Test Data¶

Now that we did all our processing on the testing data, we'll do the same thing to the testing data so we can use it to test the accuracy of our model.

In [12]:

test = pd.read_csv("test.csv")

classifiers = pd.read_csv("gender_submission.csv")

test_y = classifiers["Survived"]

In [13]:

test = pd.read_csv("test.csv")

classifiers = pd.read_csv("gender_submission.csv")

test_y = classifiers["Survived"]

removed_columns = ["PassengerId", "Name","Cabin","Ticket"]
test = test.drop(removed_columns, axis = 1)

#remove entries with no Embarked value
test = test.loc[test["Embarked"].isnull() == False]

avg_lower_age = test.loc[test["Pclass"] == 3].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 3] = test.loc[test["Pclass"] == 3].fillna(avg_lower_age)

avg_middle_age = test.loc[test["Pclass"] == 2].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 2] = test.loc[test["Pclass"] == 2].fillna(avg_middle_age)

avg_high_age = test.loc[test["Pclass"] == 1].loc[test["Age"].isnull() == False]["Age"].mean()
test.loc[test["Pclass"] == 1] = test.loc[test["Pclass"] == 1].fillna(avg_high_age)

test["Pclass"] = test["Pclass"].astype("category")
test = pd.get_dummies(test)

for n in norm_columns:
    test[n] = ((test[n] - min(test[n])) / (max(test[n]) - min(test[n])))

Training Model and Classifying¶

Now we'll train our model with our prepared data. We're going to use a random forest for our model.

In [14]:

from sklearn.ensemble import RandomForestClassifier

# Initialize the model with some parameters.
model = RandomForestClassifier(n_estimators=5, min_samples_leaf=5, random_state=1)
# Fit the model to the data.
model.fit(train, train_y)
# Make predictions.
predictions = model.predict(test)

In [15]:

num_correct = 0
for (i,t) in enumerate(predictions):
    if(predictions[i] == test_y[i]):
        num_correct += 1
        
print("Percent Correct: %.2f%%" % (num_correct*100/predictions.size))

Percent Correct: 91.39%

Results¶

These are some good initial results. We were able to predict which passengers would survive with 91% accuracy. There are a few more things we can try and do to improve our accuracy:

Extracting features from removed attributes (name, ticket, cabin)
Try out some different machine learning algorithms (neural network)
Fill in missing age values by training a regression model to predict age based on other attributes.

All in all I'm pleased with the results we got from this dataset, According to the rankings on https://www.kaggle.com/c/titanic/leaderboard my accuracy would rank me 22nd out of 6000 competitors. I can see that there are some people who achieved 100% so that is definitely something to strive for.

In [ ]: