Classification Probelm: Predict the chance of survival of a voager on Titanic based on the voager’s information

Author

Rafiq Islam

Published

October 15, 2021

The Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from mywebstyle import plot_style
plot_style('#f4f4f4')

titanic = pd.read_csv('titanic_train.csv')

Exploratory Data Analysis

Descriptive Statistics

titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Seems like there are some missing data for the Age, Cabin, and Emberked features. To see with visualization

sns.heatmap(titanic.isnull(), yticklabels=False, cbar=False, cmap='viridis')

Approximately \(20\%\) of the Age variable is missing. For the feature Cabin, it’s too many observations missing. For the Emberked, there are only two missing observations. So, we need to take extra care of these features in the data cleaning and preparation stage.

Data Visualization

sns.countplot(x='Survived', hue='Sex', data= titanic, palette='RdBu_r')

Looks like maximum of the passenger who didn’t survived are male.

sns.countplot(x='Survived', hue='Pclass', data=titanic, palette='rainbow')

From this plot we see that people from class 3 has the highest proportion who didn’t survive. In the survival class, passenger class 1 has the highest proportion.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7.9, 4))
titanic['Age'].hist(bins=35, color='darkred', alpha=0.6, ax=ax1)
ax1.set_xlabel('Age')
ax1.set_title('Age Distribution')
titanic['Fare'].hist(bins=30, color='darkred', alpha=0.6, ax=ax2)
ax2.set_xlabel('Fare')
ax2.set_title('Fare Distribution')
plt.tight_layout()
plt.show()

Seems like Age is almost normally distributed. However, the the Fare is positively skewed. Other categorical features

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7.9, 4))


sns.countplot(
    x='SibSp',data=titanic, ax=ax1
    )
ax1.set_title('Number of Siblings/Spouse')


sns.countplot(
    x='Parch', data=titanic, ax=ax2
    )
ax2.set_title('Number of Parents/Children')

plt.tight_layout()
plt.show()

Data Cleaning and Preparation

Handling Missing Data

Here, the Age feature is a continuous feature and almost normally distributed. So we can impute this by the mean of the Age variable. However, this feature can be classified by other categorical features such as Sex, Pclass, SibSp, or Perch. But we can be smarter by taking consideration of greater and homogeneously diversified categorical feature. In this case, Pclass is the perfect one.

sns.boxplot(
    x='Pclass', y='Age', hue='Pclass',
    data=titanic, palette='winter'
    )

So, whenever a passenger is in the 1st class, the mean Age is around 37 and for the 2nd class and 3rd class the mean Age are 29 and 24, respectively.

def age_imputation(cols):
    Age = cols[0]
    Pclass = cols[1]
    if pd.isnull(Age):
        if Pclass==1:
            return 37
        elif Pclass==2:
            return 29
        else:
            return 24
    else:
        return Age
titanic.Age = titanic[['Age','Pclass']].apply(age_imputation, axis=1)
sns.heatmap(titanic.isnull(), yticklabels=False, cbar=False, cmap='viridis')

Since there are too many missing in Cabin, so we can drop it along with two missing values from the Emberked feature.

titanic.drop('Cabin', axis=1, inplace=True)
titanic.dropna(inplace=True)
sns.heatmap(titanic.isnull(), yticklabels=False, cbar=False, cmap='viridis')

So there is no missing value in any column. Next we convert the categorical features

Converting the Categorical Features

titanic['Male'] = pd.get_dummies(titanic.Sex,dtype=int)['male']
emb = pd.get_dummies(titanic['Embarked'],drop_first=True, dtype=int)
titanic = pd.concat([titanic, emb], axis=1)
titanic.drop(['Sex','Embarked','Name','Ticket'], axis = 1, inplace=True)
titanic.head()
PassengerId Survived Pclass Age SibSp Parch Fare Male Q S
0 1 0 3 22.0 1 0 7.2500 1 0 1
1 2 1 1 38.0 1 0 71.2833 0 0 0
2 3 1 3 26.0 0 0 7.9250 0 0 1
3 4 1 1 35.0 1 0 53.1000 0 0 1
4 5 0 3 35.0 0 0 8.0500 1 0 1

Modeling

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(
    titanic.drop('Survived', axis=1),
    titanic.Survived, test_size=0.30,
    random_state=123
    )
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Evaluation

print(metrics.classification_report(y_test,pred))
              precision    recall  f1-score   support

           0       0.79      0.89      0.84       161
           1       0.79      0.64      0.71       106

    accuracy                           0.79       267
   macro avg       0.79      0.76      0.77       267
weighted avg       0.79      0.79      0.79       267
Back to top

Citation

BibTeX citation:
@online{islam2021,
  author = {Islam, Rafiq},
  title = {Classification {Probelm:} {Predict} the Chance of Survival of
    a Voager on {Titanic} Based on the Voager’s Information},
  date = {2021-10-15},
  url = {https://mrislambd.github.io/codepages/titanic/},
  langid = {en}
}
For attribution, please cite this work as:
Islam, Rafiq. 2021. “Classification Probelm: Predict the Chance of Survival of a Voager on Titanic Based on the Voager’s Information.” October 15, 2021. https://mrislambd.github.io/codepages/titanic/.