Insurance Cost Forecast by using Linear Regression

Author

Rafiq Islam

Published

August 30, 2024

Notebook GitHub WebApp

Project Overview

This predictive modeling project involves personal medical data to predict the medical insurance charge by using a linear regression model.

Dataset

The dataset used in this project is collected from Kaggle

Columns

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight \(\frac{kg}{m^ 2}\) using the ratio of height to weight, ideally \(18.5\) to \(24.9\)

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

Acknowledgements

The dataset is available on GitHub here.

Stakeholders

Can we accurately predict insurance costs?

Key Performance Indicators (KPIs)

All the features were considered for the modeling purposes. However, from the exploratory data analysis and mathematical analysis, it was found that the charges usually goes up for the factors such as increase in age, living in certain region, having certain number of children. But this is not always the same depending on the smoker variable. Also, there is a strong correlation between age and bmi variable. Age a result new features such as age_bmi and age_bmi_smoker features were created to see how the charges interact.

Modeling

Modeling Approaches

We consider the following models

  1. Baseline model: Assumption that the charges variable can be modeled with the mean value of this charges variable.
    \[ \text{charges}=\mathbb{E}[\text{charges}]+\xi \]

  2. Linear Regression with age-bmi-smoke interaction
    \[ \text{charges}=\beta_0+\beta_1 (\text{age\_bmi})+\beta_2 (\text{male})+\beta_3 (\text{smoke})+\beta_4 (\text{children})+\beta_5 (\text{region})+\beta_6 (\text{age-bmi-smoke})+\xi \]

  3. K-Neighbor Regression
    \(k\)NN using all the original feature with \(k=10\)

Final Model

Finally the modeling was done based on the lowest MSE value found from the 5-fold cross validation and the model has the following form

\[\begin{align*} \text{charges} &=10621.25+ 3346.14\times \text{Age\_BMI}+4570.76\times \text{Male}+ 479.61\times \text{Smoke}-315.12\times \text{Children}\\ &+13274.48\times \text{Region}-212.22\times \text{Age\_BMI\_Smoke} \end{align*}\]

Results and Outcomes

Model Accuracy

The model above returns an RMSE of \(5853.0\) on the training set and an RMSE of \(5600.0\) on the test set with an \(R^2=80\%\).

Web Application

The final model was developed and deployed using Streamlit. To try a single instance, fill out the following form and then click predict charges.

Future Directions

Future project on the same data could be adding a neural network and compare the relative performances of the two models.

Share on

Back to top

Citation

BibTeX citation:
@online{islam2024,
  author = {Islam, Rafiq},
  title = {Insurance {Cost} {Forecast} by Using {Linear} {Regression}},
  date = {2024-08-30},
  url = {https://mrislambd.github.io/portfolio/dsp/medicalcost/},
  langid = {en}
}
For attribution, please cite this work as:
Islam, Rafiq. 2024. “Insurance Cost Forecast by Using Linear Regression.” August 30, 2024. https://mrislambd.github.io/portfolio/dsp/medicalcost/.