Insurance Cost Forecast by using Linear Regression
Project Overview
This predictive modeling project involves personal medical data to predict the medical insurance charge by using a linear regression model.
Dataset
The dataset used in this project is collected from Kaggle
Columns
age
: age of primary beneficiary
sex
: insurance contractor gender, female, male
bmi
: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight \(\frac{kg}{m^ 2}\) using the ratio of height to weight, ideally \(18.5\) to \(24.9\)
children
: Number of children covered by health insurance / Number of dependents
smoker
: Smoking
region
: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges
: Individual medical costs billed by health insurance
Acknowledgements
The dataset is available on GitHub here.
Stakeholders
Can we accurately predict insurance costs?
Key Performance Indicators (KPIs)
All the features were considered for the modeling purposes. However, from the exploratory data analysis and mathematical analysis, it was found that the charges
usually goes up for the factors such as increase in age
, living in certain region
, having certain number of children
. But this is not always the same depending on the smoker
variable. Also, there is a strong correlation between age
and bmi
variable. Age a result new features such as age_bmi
and age_bmi_smoker
features were created to see how the charges
interact.
Modeling
Modeling Approaches
We consider the following models
Baseline model: Assumption that the
charges
variable can be modeled with the mean value of thischarges
variable.
\[ \text{charges}=\mathbb{E}[\text{charges}]+\xi \]Linear Regression with
age-bmi-smoke
interaction
\[ \text{charges}=\beta_0+\beta_1 (\text{age\_bmi})+\beta_2 (\text{male})+\beta_3 (\text{smoke})+\beta_4 (\text{children})+\beta_5 (\text{region})+\beta_6 (\text{age-bmi-smoke})+\xi \]K-Neighbor Regression
\(k\)NN using all the original feature with \(k=10\)
Final Model
Finally the modeling was done based on the lowest MSE
value found from the 5-fold cross validation and the model has the following form
\[\begin{align*} \text{charges} &=10621.25+ 3346.14\times \text{Age\_BMI}+4570.76\times \text{Male}+ 479.61\times \text{Smoke}-315.12\times \text{Children}\\ &+13274.48\times \text{Region}-212.22\times \text{Age\_BMI\_Smoke} \end{align*}\]
Results and Outcomes
Model Accuracy
The model above returns an RMSE of \(5853.0\) on the training set and an RMSE of \(5600.0\) on the test set with an \(R^2=80\%\).
Web Application
The final model was developed and deployed using Streamlit
. To try a single instance, fill out the following form and then click predict charges.
Future Directions
Future project on the same data could be adding a neural network and compare the relative performances of the two models.
Share on
Citation
@online{islam2024,
author = {Islam, Rafiq},
title = {Insurance {Cost} {Forecast} by Using {Linear} {Regression}},
date = {2024-08-30},
url = {https://mrislambd.github.io/portfolio/dsp/medicalcost/},
langid = {en}
}