Insurance Cost Forecast by using Linear Regression
Project Overview
This predictive modeling project involves personal medical data to predict the medical insurance charge by using a linear regression model.
Dataset
The dataset used in this project is collected from Kaggle
Columns
age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight \(\frac{kg}{m^ 2}\) using the ratio of height to weight, ideally \(18.5\) to \(24.9\)
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
Acknowledgements
The dataset is available on GitHub here.
Stakeholders
Can we accurately predict insurance costs?
Key Performance Indicators (KPIs)
All the features were considered for the modeling purposes. However, from the exploratory data analysis and mathematical analysis, it was found that the charges usually goes up for the factors such as increase in age, living in certain region, having certain number of children. But this is not always the same depending on the smoker variable. Also, there is a strong correlation between age and bmi variable. Age a result new features such as age_bmi and age_bmi_smoker features were created to see how the charges interact.
Modeling
Modeling Approaches
We consider the following models
Baseline model: Assumption that the
chargesvariable can be modeled with the mean value of thischargesvariable.
\[ \text{charges}=\mathbb{E}[\text{charges}]+\xi \]Linear Regression with
age-bmi-smokeinteraction
\[ \text{charges}=\beta_0+\beta_1 (\text{age\_bmi})+\beta_2 (\text{male})+\beta_3 (\text{smoke})+\beta_4 (\text{children})+\beta_5 (\text{region})+\beta_6 (\text{age-bmi-smoke})+\xi \]K-Neighbor Regression
\(k\)NN using all the original feature with \(k=10\)
Final Model
Finally the modeling was done based on the lowest MSE value found from the 5-fold cross validation and the model has the following form
\[\begin{align*} \text{charges} &=10621.25+ 3346.14\times \text{Age\_BMI}+4570.76\times \text{Male}+ 479.61\times \text{Smoke}-315.12\times \text{Children}\\ &+13274.48\times \text{Region}-212.22\times \text{Age\_BMI\_Smoke} \end{align*}\]
Results and Outcomes
Model Accuracy
The model above returns an RMSE of \(5853.0\) on the training set and an RMSE of \(5600.0\) on the test set with an \(R^2=80\%\).
Web Application
The final model was developed and deployed using Streamlit. To try a single instance, fill out the following form and then click predict charges.
Future Directions
Future project on the same data could be adding a neural network and compare the relative performances of the two models.
Share on
Citation
@online{islam2024,
author = {Islam, Rafiq},
title = {Insurance {Cost} {Forecast} by Using {Linear} {Regression}},
date = {2024-08-30},
url = {https://mrislambd.github.io/portfolio/dsp/medicalcost/},
langid = {en}
}