import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mywebstyle import plot_style
'#f4f4f4')
plot_style(
= pd.read_csv('titanic_train.csv') titanic
Classification Probelm: Predict the chance of survival of a voager on Titanic based on the voager’s information
The Data
Exploratory Data Analysis
Descriptive Statistics
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Seems like there are some missing data for the Age
, Cabin
, and Emberked
features. To see with visualization
=False, cbar=False, cmap='viridis') sns.heatmap(titanic.isnull(), yticklabels
Approximately \(20\%\) of the Age
variable is missing. For the feature Cabin
, it’s too many observations missing. For the Emberked
, there are only two missing observations. So, we need to take extra care of these features in the data cleaning and preparation stage.
Data Visualization
='Survived', hue='Sex', data= titanic, palette='RdBu_r') sns.countplot(x
Looks like maximum of the passenger who didn’t survived are male.
='Survived', hue='Pclass', data=titanic, palette='rainbow') sns.countplot(x
From this plot we see that people from class 3 has the highest proportion who didn’t survive. In the survival class, passenger class 1 has the highest proportion.
= plt.subplots(1, 2, figsize=(7.9, 4))
fig, (ax1, ax2) 'Age'].hist(bins=35, color='darkred', alpha=0.6, ax=ax1)
titanic['Age')
ax1.set_xlabel('Age Distribution')
ax1.set_title('Fare'].hist(bins=30, color='darkred', alpha=0.6, ax=ax2)
titanic['Fare')
ax2.set_xlabel('Fare Distribution')
ax2.set_title(
plt.tight_layout() plt.show()
Seems like Age
is almost normally distributed. However, the the Fare
is positively skewed. Other categorical features
= plt.subplots(1, 2, figsize=(7.9, 4))
fig, (ax1, ax2)
sns.countplot(='SibSp',data=titanic, ax=ax1
x
)'Number of Siblings/Spouse')
ax1.set_title(
sns.countplot(='Parch', data=titanic, ax=ax2
x
)'Number of Parents/Children')
ax2.set_title(
plt.tight_layout() plt.show()
Data Cleaning and Preparation
Handling Missing Data
Here, the Age
feature is a continuous feature and almost normally distributed. So we can impute this by the mean of the Age
variable. However, this feature can be classified by other categorical features such as Sex
, Pclass
, SibSp
, or Perch
. But we can be smarter by taking consideration of greater and homogeneously diversified categorical feature. In this case, Pclass
is the perfect one.
sns.boxplot(='Pclass', y='Age', hue='Pclass',
x=titanic, palette='winter'
data )
So, whenever a passenger is in the 1st class, the mean Age
is around 37 and for the 2nd class and 3rd class the mean Age
are 29 and 24, respectively.
def age_imputation(cols):
= cols[0]
Age = cols[1]
Pclass if pd.isnull(Age):
if Pclass==1:
return 37
elif Pclass==2:
return 29
else:
return 24
else:
return Age
= titanic[['Age','Pclass']].apply(age_imputation, axis=1)
titanic.Age =False, cbar=False, cmap='viridis') sns.heatmap(titanic.isnull(), yticklabels
Since there are too many missing in Cabin
, so we can drop it along with two missing values from the Emberked
feature.
'Cabin', axis=1, inplace=True)
titanic.drop(=True)
titanic.dropna(inplace=False, cbar=False, cmap='viridis') sns.heatmap(titanic.isnull(), yticklabels
So there is no missing value in any column. Next we convert the categorical features
Converting the Categorical Features
'Male'] = pd.get_dummies(titanic.Sex,dtype=int)['male']
titanic[= pd.get_dummies(titanic['Embarked'],drop_first=True, dtype=int)
emb = pd.concat([titanic, emb], axis=1)
titanic 'Sex','Embarked','Name','Ticket'], axis = 1, inplace=True)
titanic.drop([ titanic.head()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | Male | Q | S | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 1 |
1 | 2 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 |
2 | 3 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 1 |
4 | 5 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 1 |
Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
= train_test_split(
X_train, X_test, y_train, y_test 'Survived', axis=1),
titanic.drop(=0.30,
titanic.Survived, test_size=123
random_state
)= LogisticRegression()
logreg
logreg.fit(X_train, y_train)= logreg.predict(X_test) pred
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Evaluation
print(metrics.classification_report(y_test,pred))
precision recall f1-score support
0 0.79 0.89 0.84 161
1 0.79 0.64 0.71 106
accuracy 0.79 267
macro avg 0.79 0.76 0.77 267
weighted avg 0.79 0.79 0.79 267
Citation
@online{islam2021,
author = {Islam, Rafiq},
title = {Classification {Probelm:} {Predict} the Chance of Survival of
a Voager on {Titanic} Based on the Voager’s Information},
date = {2021-10-15},
url = {https://mrislambd.github.io/codepages/titanic/},
langid = {en}
}