import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from mywebstyle import plot_style
plot_style('#f4f4f4')
titanic = pd.read_csv('titanic_train.csv')Classification Probelm: Predict the chance of survival of a voager on Titanic based on the voager’s information
The Data
Exploratory Data Analysis
Descriptive Statistics
titanic.head()| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 
titanic.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Seems like there are some missing data for the Age, Cabin, and Emberked features. To see with visualization
Approximately \(20\%\) of the Age variable is missing. For the feature Cabin, it’s too many observations missing. For the Emberked, there are only two missing observations. So, we need to take extra care of these features in the data cleaning and preparation stage.
Data Visualization
Looks like maximum of the passenger who didn’t survived are male.
From this plot we see that people from class 3 has the highest proportion who didn’t survive. In the survival class, passenger class 1 has the highest proportion.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7.9, 4))
titanic['Age'].hist(bins=35, color='darkred', alpha=0.6, ax=ax1)
ax1.set_xlabel('Age')
ax1.set_title('Age Distribution')
titanic['Fare'].hist(bins=30, color='darkred', alpha=0.6, ax=ax2)
ax2.set_xlabel('Fare')
ax2.set_title('Fare Distribution')
plt.tight_layout()
plt.show()Seems like Age is almost normally distributed. However, the the Fare is positively skewed. Other categorical features
Data Cleaning and Preparation
Handling Missing Data
Here, the Age feature is a continuous feature and almost normally distributed. So we can impute this by the mean of the Age variable. However, this feature can be classified by other categorical features such as Sex, Pclass, SibSp, or Perch. But we can be smarter by taking consideration of greater and homogeneously diversified categorical feature. In this case, Pclass is the perfect one.
So, whenever a passenger is in the 1st class, the mean Age is around 37 and for the 2nd class and 3rd class the mean Age are 29 and 24, respectively.
def age_imputation(cols):
    Age = cols[0]
    Pclass = cols[1]
    if pd.isnull(Age):
        if Pclass==1:
            return 37
        elif Pclass==2:
            return 29
        else:
            return 24
    else:
        return Age
titanic.Age = titanic[['Age','Pclass']].apply(age_imputation, axis=1)
sns.heatmap(titanic.isnull(), yticklabels=False, cbar=False, cmap='viridis')Since there are too many missing in Cabin, so we can drop it along with two missing values from the Emberked feature.
titanic.drop('Cabin', axis=1, inplace=True)
titanic.dropna(inplace=True)
sns.heatmap(titanic.isnull(), yticklabels=False, cbar=False, cmap='viridis')So there is no missing value in any column. Next we convert the categorical features
Converting the Categorical Features
titanic['Male'] = pd.get_dummies(titanic.Sex,dtype=int)['male']
emb = pd.get_dummies(titanic['Embarked'],drop_first=True, dtype=int)
titanic = pd.concat([titanic, emb], axis=1)
titanic.drop(['Sex','Embarked','Name','Ticket'], axis = 1, inplace=True)
titanic.head()| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | Male | Q | S | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 1 | 
| 1 | 2 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 | 
| 2 | 3 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 1 | 
| 3 | 4 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 1 | 
| 4 | 5 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 1 | 
Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(
    titanic.drop('Survived', axis=1),
    titanic.Survived, test_size=0.30,
    random_state=123
    )
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Evaluation
print(metrics.classification_report(y_test,pred))              precision    recall  f1-score   support
           0       0.79      0.89      0.84       161
           1       0.79      0.64      0.71       106
    accuracy                           0.79       267
   macro avg       0.79      0.76      0.77       267
weighted avg       0.79      0.79      0.79       267
Citation
@online{islam2021,
  author = {Islam, Rafiq},
  title = {Classification {Probelm:} {Predict} the Chance of Survival of
    a Voager on {Titanic} Based on the Voager’s Information},
  date = {2021-10-15},
  url = {https://mrislambd.github.io/codepages/titanic/},
  langid = {en}
}







