Lendingclub’s loan default prediction

Author

Rafiq Islam

Published

February 23, 2023

Load the data and primary library

Data

kaggle.api.authenticate()
kaggle.api.dataset_download_files(
    'jeandedieunyandwi/lending-club-dataset', path='.', unzip=True
)

Primary Libraries

from mywebstyle import plot_style
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import kaggle
plot_style('#f4f4f4')
loandata = pd.read_csv('lending_club_loan_two.csv')
first_13_cols = loandata.iloc[:, :13]
first_13_cols.head()

	loan_amnt	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status
0	10000	36 months	11.44	329.48	B	B4	Marketing	10+ years	RENT	117000.0	Not Verified	Jan-15	Fully Paid
1	8000	36 months	11.99	265.68	B	B5	Credit analyst	4 years	MORTGAGE	65000.0	Not Verified	Jan-15	Fully Paid
2	15600	36 months	10.49	506.97	B	B3	Statistician	< 1 year	RENT	43057.0	Source Verified	Jan-15	Fully Paid
3	7200	36 months	6.49	220.65	A	A2	Client Advocate	6 years	RENT	54000.0	Not Verified	Nov-14	Fully Paid
4	24375	60 months	17.27	609.33	C	C5	Destiny Management Inc.	9 years	MORTGAGE	55000.0	Verified	Apr-13	Charged Off

second_13_cols = loandata.iloc[:, 13:]
second_13_cols.head()

	purpose	title	dti	earliest_cr_line	open_acc	revol_bal	revol_util	total_acc	initial_list_status	application_type	mort_acc	address
0	vacation	Vacation	26.24	Jun-90	16	36369	41.8	25	w	INDIVIDUAL	0.0	0174 Michelle Gateway\nMendozaberg, OK 22690
1	debt_consolidation	Debt consolidation	22.05	Jul-04	17	20131	53.3	27	f	INDIVIDUAL	3.0	1076 Carney Fort Apt. 347\nLoganmouth, SD 05113
2	credit_card	Credit card refinancing	12.79	Aug-07	13	11987	92.2	26	f	INDIVIDUAL	0.0	87025 Mark Dale Apt. 269\nNew Sabrina, WV 05113
3	credit_card	Credit card refinancing	2.60	Sep-06	6	5472	21.5	13	f	INDIVIDUAL	0.0	823 Reid Ford\nDelacruzside, MA 00813
4	credit_card	Credit Card Refinance	33.95	Mar-99	13	24584	69.8	43	f	INDIVIDUAL	1.0	679 Luna Roads\nGreggshire, VA 11650

Exploratory Data Analysis

Descriptive Statistics

loandata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395900 entries, 0 to 395899
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             395900 non-null  int64  
 1   term                  395900 non-null  object 
 2   int_rate              395900 non-null  float64
 3   installment           395900 non-null  float64
 4   grade                 395900 non-null  object 
 5   sub_grade             395900 non-null  object 
 6   emp_title             372982 non-null  object 
 7   emp_length            377608 non-null  object 
 8   home_ownership        395900 non-null  object 
 9   annual_inc            395900 non-null  float64
 10  verification_status   395900 non-null  object 
 11  issue_d               395900 non-null  object 
 12  loan_status           395900 non-null  object 
 13  purpose               395900 non-null  object 
 14  title                 394145 non-null  object 
 15  dti                   395900 non-null  float64
 16  earliest_cr_line      395900 non-null  object 
 17  open_acc              395900 non-null  int64  
 18  pub_rec               395900 non-null  int64  
 19  revol_bal             395900 non-null  int64  
 20  revol_util            395624 non-null  float64
 21  total_acc             395900 non-null  int64  
 22  initial_list_status   395900 non-null  object 
 23  application_type      395900 non-null  object 
 24  mort_acc              358117 non-null  float64
 25  pub_rec_bankruptcies  395365 non-null  float64
 26  address               395900 non-null  object 
dtypes: float64(7), int64(5), object(15)
memory usage: 81.6+ MB

sns.heatmap(loandata.isnull(), yticklabels=False, cbar=False, cmap='viridis')

Seems like there are some missing data in the emp_length, emp_title, and mort_acc columns. Around 4.62$\%$ missing data in emp_length column, 5.79$\%$ missing data in emp_title column, and 9.54$\%$ in the mort_acc coulumn. Also, there is very small proportion, 0.07$\%$ of missing data in the revol_util column.

We need to take care of this missing observations. We may drop the missing values if it is insignificantly missing or doesn’t seem to be very strongly related to the predictive/dependent variable.

loandata.describe().T

	count	mean	std	min	25%	50%	75%	max
loan_amnt	395900.0	14114.249305	8357.637338	500.00	8000.00	12000.00	20000.00	40000.00
int_rate	395900.0	13.639385	4.472112	5.32	10.49	13.33	16.49	30.99
installment	395900.0	431.859947	250.733444	16.08	250.33	375.43	567.30	1533.81
annual_inc	395900.0	74206.819251	61645.032777	0.00	45000.00	64000.00	90000.00	8706582.00
dti	395900.0	17.379187	18.021550	0.00	11.28	16.91	22.98	9999.00
open_acc	395900.0	11.311081	5.137591	0.00	8.00	10.00	14.00	90.00
pub_rec	395900.0	0.178204	0.530716	0.00	0.00	0.00	0.00	86.00
revol_bal	395900.0	15844.331435	20589.846553	0.00	6026.00	11181.00	19620.00	1743266.00
revol_util	395624.0	53.793449	24.452575	0.00	35.80	54.80	72.90	892.30
total_acc	395900.0	25.414622	11.887279	2.00	17.00	24.00	32.00	151.00
mort_acc	358117.0	1.814091	2.148006	0.00	0.00	1.00	3.00	34.00
pub_rec_bankruptcies	395365.0	0.121647	0.356176	0.00	0.00	0.00	0.00	8.00

Data Visualization

Loan Status

First, let’s look at the class balance in the dependent variable

sns.countplot(x='loan_status', data=loandata)
fp = np.round(
    len(loandata[loandata['loan_status'] == 'Fully Paid'])/len(loandata)*100, 2
)

co = np.round(
    len(loandata[loandata['loan_status'] == 'Charged Off']) /
    len(loandata)*100, 2
)

So, 80.39$\%$ are Fully Paid category where as 19.61$\%$ is labeled as Charged Off or defaulter.

Insights: With 80.39 % labeled as “Fully Paid” and 19.61 % as “Charged Off,” this dataset qualifies as moderately imbalanced. The minority class (“Charged Off”) is well underrepresented, but not severely so. This level of imbalance can still impact model performance, especially if the model is biased towards predicting the majority class . Techniques like resampling(oversampling the minority class or undersampling the majority), using metrics like F1-score or AUC-ROC, or even experimenting with algorithms specifically designed for imbalanced datasets(such as XGBoost or balanced random forests) could be effective ways to address this imbalance in the predictive modeling process.

Loan Amount and Installment Interaction With Loan Status

Next, let’s see how the features interacts with the dependent variable

fig = plt.figure(figsize=(9, 4))
ax1 = fig.add_subplot(121)
loandata[loandata['loan_status'] == 'Charged Off']['loan_amnt'].hist(
    alpha=0.5, color='red', bins=30, ax=ax1,
    label='Charged Off'
)
loandata[loandata['loan_status'] == 'Fully Paid']['loan_amnt'].hist(
    alpha=0.5, color='blue', bins=30, ax=ax1,
    label='Fully Paid'
)
ax1.set_title('Loan Amount Distribution')
ax1.set_xlabel('Loan Amount')
ax1.legend()

ax2 = fig.add_subplot(122)
loandata[loandata['loan_status'] == 'Charged Off']['installment'].hist(
    alpha=0.5, color='red', bins=30, ax=ax2,
    label='Charged Off'
)
loandata[loandata['loan_status'] == 'Fully Paid']['installment'].hist(
    alpha=0.5, color='blue', bins=30, ax=ax2,
    label='Fully Paid'
)
ax2.set_title('Installment Distribution')
ax2.set_xlabel('Installment')
ax2.legend()

Both the loan_amnt and installment are slightly positively skewed. How about their mean, upper quartile, lower quartile based on loan_status?

fig = plt.figure(figsize=(8.8, 4))
ax1 = fig.add_subplot(121)
sns.boxplot(
    x='loan_status', y='loan_amnt', hue='loan_status',
    data=loandata, ax=ax1, palette='winter'
)
ax1.set_title('Loan Amount Boxplot')

ax2 = fig.add_subplot(122)
sns.boxplot(
    x='loan_status', y='installment', hue='loan_status',
    data=loandata, ax=ax2, palette='winter'
)
ax2.set_title('Installment Boxplot')

Text(0.5, 1.0, 'Installment Boxplot')

Insights: From the above plot we see that mean loan amount of the fully paid and charged off categories are $\$12, 500$ and $\$14, 000$, respectively.

Term, Grade, and Sub-grade Interaction With Loan Status

fig = plt.figure(figsize=(8.8, 4))
ax1 = fig.add_subplot(121)
sns.countplot(
    x='loan_status',
    hue='term', data=loandata,
    palette='RdBu_r', ax=ax1
)
ax2 = fig.add_subplot(122)
sns.countplot(
    x='loan_status',
    hue='grade', data=loandata,
    palette='winter', ax=ax2
)

fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(x='sub_grade', hue='loan_status', data=loandata, ax=ax)

Employment Title and Employment Length

loandata['emp_title'] = loandata['emp_title'].str.lower()
loandata.emp_title.value_counts()[:25]

emp_title
manager                     5635
teacher                     5426
registered nurse            2626
supervisor                  2589
sales                       2381
driver                      2306
owner                       2200
rn                          2072
project manager             1776
office manager              1638
general manager             1460
truck driver                1288
director                    1192
engineer                    1187
police officer              1041
vice president               961
sales manager                961
operations manager           960
store manager                941
president                    877
administrative assistant     865
accountant                   845
account manager              845
technician                   839
mechanic                     753
Name: count, dtype: int64

Let’s work with the employment length column

loandata['emp_length'] = loandata['emp_length'].replace({
    '< 1 year': 0,
    '1 year': 1,
    '2 years': 2,
    '3 years': 3,
    '4 years': 4,
    '5 years': 5,
    '6 years': 6,
    '7 years': 7,
    '8 years': 8,
    '9 years': 9,
    '10+ years': 10
}
).infer_objects(copy=False)

loandata['emp_length_group'] = pd.cut(
    loandata['emp_length'],
    bins=[-1, 2, 7, 10],  # Bins: <3 years, 3-7 years, > 7 years
    labels=['Short-term', 'Mid-term', 'Long-term']
)

sns.countplot(
    x='emp_length_group',
    hue='loan_status',
    data=loandata,
    palette='winter',
    stat='count'
)

/tmp/ipykernel_3062/3832649202.py:1: FutureWarning:

Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

Insights: So, from this plot we can see a trend. As employment length goes up, the chances of fully paid gets higher, but the charged of situation remains almost same, irrespective of the employment length. Therefore, it may not be very useful when we may think of data imputation for the missing values.

Home Ownership and Annual Income

fig = plt.figure(figsize=(8, 10))
ax1 = fig.add_subplot(211)
sns.boxplot(
    x='loan_status', y='annual_inc',
    hue='loan_status', palette='winter',
    data=loandata, ax=ax1
)
ax1.set_title('Income Distribution')
ax1.set_xlabel('Loan Status')
ax1.set_ylabel('Annual Income')

ax2 = fig.add_subplot(212)
sns.countplot(
    x='home_ownership', hue='loan_status',
    data=loandata, ax=ax2
)

Insights: From the Annual Income column, there is not enough insights based on the plot. But for the Home Ownership plot shows that, if the house is owned, it’s less likely to be charged off.

Issue Date and Verification Status

loandata['issue_d'] = pd.to_datetime(
    loandata['issue_d'], format='%b-%y'
)
loandata = loandata.sort_values('issue_d')
loan_status_trend = loandata.groupby(
    ['issue_d', 'loan_status']).size().unstack()

fig = plt.figure(figsize=(8, 10))
ax1 = fig.add_subplot(211)
loan_status_trend.plot(
    kind='line', marker='o', ax=ax1
)
ax1.set_title('Loan Status Over Time by Issue Date')
ax1.set_xlabel('Issue Date(mm-yyyy)')
ax1.set_ylabel('Number of Loans')
ax1.legend(title='Loan Status')

ax2 = fig.add_subplot(212)
sns.countplot(
    x='verification_status', hue='loan_status',
    data=loandata, palette='winter', ax=ax2
)

Insights: From issue date plot we see that most of loans that were marked as charged off happened during the year 2012 to 2016, with the pick in in 2015. For the verification status, we can see the less charged off incidents when the source was not varified.

Purpose and Debt-to-Income Ratio

loandata['purpose'].value_counts()
fig = plt.figure(figsize=(7.9, 4))
ax1 = fig.add_subplot(121)
sns.countplot(
    y='purpose', hue='loan_status',
    data=loandata, palette='coolwarm'
)
ax1.set_title('Loan Purpose')

ax2 = fig.add_subplot(122)
dti_threshold = loandata['dti'].quantile(0.95)
filtereddata = loandata[loandata['dti'] <= dti_threshold]

sns.boxplot(
    x='loan_status', y='dti',
    hue='loan_status', data=filtereddata,
    palette='coolwarm', ax=ax2
)
ax2.set_title('Debt-to-Income Ratio on Loan Status')

Text(0.5, 1.0, 'Debt-to-Income Ratio on Loan Status')

Insights: From the purpose column, we see that most of the loans that were charged off were used to make debt consolidation. Therefore, debt consolidation may have been a significant factor when a loan is charged off. Another insight we obtain from the debt-to-income ratio is that the charged off loans have higher dti ratio

Number of Credit Accounts and Number of Public Records

fig = plt.figure(figsize=(9, 4))
ax1 = fig.add_subplot(121)
filtered_open_account_threshold = loandata['open_acc'].quantile(0.98)
filtered_open_account = loandata[loandata['open_acc']
                                 <= filtered_open_account_threshold]
filtered_open_account[filtered_open_account['loan_status'] == 'Fully Paid']['open_acc'].hist(
    alpha=0.5, color='green', bins=30,label='Fully Paid' ,ax=ax1
)
filtered_open_account[filtered_open_account['loan_status'] == 'Charged Off']['open_acc'].hist(
    alpha=0.5, color='red', bins=30,label='Charged Off' ,ax=ax1
)
ax1.set_xlabel('Number of opened credit acc.')
ax1.legend()
ax1.set_title('Number of open credit accounts and Loan Status')

loandata['pub_rec_group'] = pd.cut(
    loandata['pub_rec'], bins=[-1, 0, 1, 3, loandata['pub_rec'].max()],
    labels=['0', '1', '2-3', '4+']
)
loan_status_by_pub_rec = loandata.groupby(
    ['pub_rec_group', 'loan_status'], observed=False
).size().unstack()

ax2 = fig.add_subplot(122)
loan_status_by_pub_rec.plot(
    kind='bar', stacked=False, edgecolor='black',
    color=['#1f77b4', '#ff7f0e'], ax=ax2
)
ax2.set_title('Public Records and Loan Status')
ax2.set_xlabel('Public Records')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)

sns.boxplot(
    y='open_acc', x='loan_status',
    hue='loan_status',
    data=filtered_open_account
)

Insights: Number of credit account seems normally distributed among both groups except for some outliers. However, the mean number of credit accounts are slightly higher for the charged off category than the fully paid category. So, higher credit account has some sort of relation with loan being charged off. Also, people who doesn’t have any public record seems to have higher chance of loan status being charged off

Revolving Balance and Utilization

fig = plt.figure(figsize=(9,4))
ax1 = fig.add_subplot(121)
revol_bal_threshold = loandata['revol_bal'].quantile(0.95)
filtered_revol_bal = loandata[loandata['revol_bal']<=revol_bal_threshold]
sns.boxplot(
    x='loan_status', y='revol_bal',
    data=filtered_revol_bal, hue='loan_status',
    palette='Set2',ax=ax1
)
ax1.set_xlabel('Loan Status')
ax1.set_ylabel('Revolving Balance')
ax1.set_title('Revolving Balance vs Loan Status')

ax2 = fig.add_subplot(122)
revol_util_threshold = loandata['revol_util'].quantile(0.95)
filtered_revol_util = loandata[loandata['revol_util']<=revol_util_threshold]
sns.boxplot(
    x='loan_status', y='revol_util',
    hue='loan_status', data=filtered_revol_util,
    palette='Set1',ax=ax2
)
ax2.set_xlabel('Loan Status')
ax2.set_ylabel('Revolving Utilization')
ax2.set_title('Revolving Utilization vs Loan Status')

Text(0.5, 1.0, 'Revolving Utilization vs Loan Status')

Insights: From the above plots, we don’t see much difference between fully paid or charged off status regarding the revolving balance, however, for revolving utilization, it’s higher for the charged off category.

Total Account and Initial List Status

fig = plt.figure(figsize=(9, 4))
ax1 = fig.add_subplot(121)
filtered_total_account_threshold = loandata['total_acc'].quantile(0.95)
filtered_total_account = loandata[loandata['total_acc']
                                 <= filtered_total_account_threshold]
filtered_total_account[filtered_total_account['loan_status'] == 'Fully Paid']['total_acc'].hist(
    alpha=0.5, color='green', bins=30,label='Fully Paid' ,ax=ax1
)
filtered_total_account[filtered_total_account['loan_status'] == 'Charged Off']['total_acc'].hist(
    alpha=0.5, color='red', bins=30,label='Charged Off' ,ax=ax1
)
ax1.set_xlabel('Number of Total Credit Acc.')
ax1.legend()
ax1.set_title('Number of Total credit accounts and Loan Status')


ax2 = fig.add_subplot(122)
sns.countplot(
    x='initial_list_status',hue='loan_status',
    data=loandata, palette='winter'
)
ax2.set_title('Initial Status and Loan Status')
ax2.set_xlabel('Initial Status (Funded or Withdrawn)')

Text(0.5, 0, 'Initial Status (Funded or Withdrawn)')

next,

sns.boxplot(
    y='total_acc', x='loan_status',
    hue='loan_status',
    data=filtered_total_account
)

Insights: The number of total credit account seems to have no impact on loan status based on the plottings above. Similarly, Initial Loan Status also doesn’t give much information.

Number of Mortgage Account and Number of Public Record of Bankruptcies

fig = plt.figure(figsize=(9,4))

ax1 = fig.add_subplot(121)

loandata['mort_acc_group'] = pd.cut(
    loandata['mort_acc'], bins=[-1, 0, 2, 5, 10, loandata['mort_acc'].max()],
    labels=['0', '1-2', '3-5', '6-10', '10+']
)

mort_acc_counts = loandata.groupby(
    ['mort_acc_group', 'loan_status'],observed=False
).size().unstack()

mort_acc_counts.plot(
    kind='bar', stacked=False, 
    colormap='viridis', ax=ax1
)
ax1.set_title('Loan Status by the # of Mort. Acc')
ax1.set_xlabel('Number of Mortgage Accounts (Grouped)')
ax1.set_ylabel('Number of Loans')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)
ax1.legend(title='Loan Status')

ax2 = fig.add_subplot(122)

loandata['pub_rec_bankruptcies_group'] = pd.cut(
    loandata['pub_rec_bankruptcies'], bins=[-1, 0, loandata['pub_rec_bankruptcies'].max()],
    labels=['0', '1 or More']
)

# Plot the grouped bar chart
pub_rec_bankruptcies_counts = loandata.groupby(
    ['pub_rec_bankruptcies_group', 'loan_status'],observed=False
).size().unstack()

pub_rec_bankruptcies_counts.plot(
    kind='bar', stacked=False,
    colormap='coolwarm', ax=ax2
)
ax2.set_title('Loan Status by the # of Pub Rec of Bankruptcies')
ax2.set_xlabel('Public Record of Bankruptcies (Grouped)')
ax2.set_ylabel('Number of Loans')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
ax2.legend(title='Loan Status')
plt.show()

Insigths: Again, loans where the applicants has less number of mortgage acc or public record of bankruptcies are more likely to be tagged as charged off.

Application Type

sns.countplot(
    x='loan_status', hue= 'application_type', 
    data=loandata, palette='winter'
)

We see only the individual type applications have impact on loans being charged off.

Now that we have analyzed all the features, how about the correlations among the features?

numeric_loandata = loandata.select_dtypes(include=['float64','int64'])
plt.figure(figsize=(10,8))
sns.heatmap(numeric_loandata.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()

Conclusion

Citation

BibTeX citation:

@online{islam2023,
  author = {Islam, Rafiq},
  title = {Lendingclub’s Loan Default Prediction},
  date = {2023-02-23},
  url = {https://mrislambd.github.io/codepages/lendingclub/},
  langid = {en}
}

For attribution, please cite this work as:

Islam, Rafiq. 2023. “Lendingclub’s Loan Default Prediction.” February 23, 2023. https://mrislambd.github.io/codepages/lendingclub/.