How do we treat categorical features for our data science project?
Statistics
Data Science
Data Engineering
Machine Learning
Artificial Intelligence
Author
Rafiq Islam
Published
September 12, 2024
Introduction
Suppose we are working on a data science project and the data contains both contineous and categorical variables. For example, we want to build a predictive model for a life insurance company. The model will predict the annual company spending on individuals depending on their age, bmi, sex, smoking habit, number of children, and region in the US where the belong. So here, our target variable is a contineous variable and the feature variables contain both contineous and categorical variables.
To understand how important each feature is, there are many possible ways. For example, when we do the exploratory data analysis (EDA) we can do some plotting to see how each feature inteacts with the target variable, or maybe calculating correlations of the features and target variables. However, when the feature is contineous it is not a big issue to calculate the correlation matrix. But when the feature is categorical or ordinal, for example, in this predictive modeling case, how do we know if the number of children or smoking habit have impact on insurance charges? Plotting boxplot or countplot from the seaborn or any other library may help, give some primary idea. But how do we quantify the correlations?
Here comes the statistical method one-way Analysis of Variances (ANOVA) among many other alternatives. Machine libraries like scipy has built-in functions that can compute the ANOVA’s for each categorical feature. We will see the implementation of this at the end of this post. This blog post is about the simple explanation of the mathematics behind the ANOVA method.
ANOVA
This is the 5 random sample data that we are talking about. We will use this data to explain the mathematical formulation of the model.
import pandas as pddata = pd.read_csv('insurance.csv')print(data.sample(5, random_state=111))
age sex bmi children smoker region charges
1000 30 male 22.99 2 yes northwest 17361.7661
53 36 male 34.43 0 yes southeast 37742.5757
432 42 male 26.90 0 no southwest 5969.7230
162 54 male 39.60 1 no southwest 10450.5520
1020 51 male 37.00 0 no southwest 8798.5930
We will explain the method using the feature children.
A one-way analysis of variance is a method to compare \(k\) homogenous groups when the experiment has \(n_i\) response values for each each group \(i\). Therefore, total data \(n=\sum_{i} n_i\) and \(y_{ij}\) represent the \(j\)th observation of the \(i\)th group. For our example above, we have \(\sum_{i=1}^{6}n_i=\)(574+324+240+157+25+18)= 1338 and \(y_{12}=\) 21984.47061 meaning, group 1 and second element.
Since all the groups are coming from the same sample/population, we must assume that they all have common variance. This \(\textcolor{red}{\text{homogeneity assumption is crucial}}\) for ANOVA analysis. So, irrespective of their group assignment, each \(y_{ij}\sim (\mu_i, \sigma^2)\)
What does one-sided ANOVA do?
The main purpose of one-sided ANOVA is to act as a judge like in a court house. It assumes that there is no variation in any group. All group has the same mean. So it sets a null hypthesis and declares that there is no difference in the groups whereas the alternative is set to the opposite. Let’s see what happens to our data
@online{islam2024,
author = {Islam, Rafiq},
title = {How Do We Treat Categorical Features for Our Data Science
Project?},
date = {2024-09-12},
url = {https://mrislambd.github.io/dsandml/dataengineering/},
langid = {en}
}
---title: "How do we treat categorical features for our data science project?"date: "2024-09-12"author: Rafiq Islamcategories: [Statistics, Data Science, Data Engineering, Machine Learning, Artificial Intelligence]citation: truesearch: trueimage: sm.pnglightbox: truelisting: contents: "/../../dsandml" max-items: 3 type: grid categories: false date-format: full fields: [image, date, title, author, reading-time]---<p align="center"> <img src="/_assets/images/uc.jpeg" alt="Post under construction" width="400" height="400"/></p>## Introduction <p style="text-align: justify"> <img src="/dsandml/dataengineering/sm.png" align="right" width="400" height= "450" style="align-right; margin-left:20px"/> Suppose we are working on a data science project and the data contains both contineous and categorical variables. For example, we want to build a predictive model for a life insurance company. The model will predict the annual company spending on individuals depending on their age, bmi, sex, smoking habit, number of children, and region in the US where the belong. So here, our target variable is a contineous variable and the feature variables contain both contineous and categorical variables.<br> <br> To understand how important each feature is, there are many possible ways. For example, when we do the exploratory data analysis (EDA) we can do some plotting to see how each feature inteacts with the target variable, or maybe calculating correlations of the features and target variables. However, when the feature is contineous it is not a big issue to calculate the correlation matrix. But when the feature is categorical or ordinal, for example, in this predictive modeling case, how do we know if the number of children or smoking habit have impact on insurance charges? Plotting `boxplot` or `countplot` from the `seaborn` or any other library may help, give some primary idea. But how do we quantify the correlations?<br> <br> Here comes the statistical method one-way ***Analysis of Variances (ANOVA)*** among many other alternatives. Machine libraries like `scipy` has built-in functions that can compute the ANOVA's for each categorical feature. We will see the implementation of this at the end of this post. This blog post is about the simple explanation of the mathematics behind the ANOVA method.</p> ## ANOVA This is the 5 random sample [data]([www.kaggle.com/datasets/mirichoi0218/insurance/data](https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout)){target="_blank" style="text-decoration:none"} that we are talking about. We will use this data to explain the mathematical formulation of the model. ```{python}import pandas as pddata = pd.read_csv('insurance.csv')print(data.sample(5, random_state=111))```We will explain the method using the feature `children`. ```{python}child = data.children.value_counts().sort_index()c0=data[data['children']==0].charges.values.tolist()c1=data[data['children']==1].charges.values.tolist()c2=data[data['children']==2].charges.values.tolist()c3=data[data['children']==3].charges.values.tolist()c4=data[data['children']==4].charges.values.tolist()c5=data[data['children']==5].charges.values.tolist()```| Children 0 | Children 1 | Children 2 | Children 3 | Children 4 | Children 5 | | -----------| ---------- | ---------- | ---------- | ---------- | ---------- | | `{python} c0[:6]` | `{python} c1[:4]` | `{python} c2[:3]` | `{python} c3[:2]` | `{python} c4[:3]` | `{python} c5[:2]` | | Total `{python} child[0]` | Total `{python} child[1]` | Total `{python} child[2]` | Total `{python} child[3]` | Total `{python} child[4]` | Total `{python} child[5]` | A one-way analysis of variance is a method to compare $k$ homogenous groups when the experiment has $n_i$ response values for each each group $i$. Therefore, total data $n=\sum_{i} n_i$ and $y_{ij}$ represent the $j$th observation of the $i$th group. For our example above, we have $\sum_{i=1}^{6}n_i=$(`{python} child[0]`+`{python} child[1]`+`{python} child[2]`+`{python} child[3]`+`{python} child[4]`+`{python} child[5]`)= `{python} child[0]+child[1]+child[2]+child[3]+child[4]+child[5]` and $y_{12}=$ `{python} c0[1]` meaning, group 1 and second element. Now let's define $$\mu_i = \frac{1}{n_i}\sum_{j=1}^{n_i}\frac{y_{ij}}{n_i};\hspace{4mm}\text{for } i=1,2,\cdots, 6$$ Since all the groups are coming from the same sample/population, we must assume that they all have common variance. This $\textcolor{red}{\text{homogeneity assumption is crucial}}$ for ANOVA analysis. So, irrespective of their group assignment, each $y_{ij}\sim (\mu_i, \sigma^2)$ ### What does one-sided ANOVA do? <p style="text-align: justify"> The main purpose of one-sided ANOVA is to act as a judge like in a court house. It assumes that there is no variation in any group. All group has the same mean. So it sets a null hypthesis and declares that there is no difference in the groups whereas the alternative is set to the opposite. Let's see what happens to our data </p>```{python}import numpy as np ```| $\mu$ | Values | | ----- | ------ | | $\mu_1$ | `{python} round(np.mean(c0),2)` | | $\mu_2$ | `{python} round(np.mean(c1),2)` | | $\mu_3$ | `{python} round(np.mean(c2),2)` | | $\mu_4$ | `{python} round(np.mean(c3),2)` | | $\mu_5$ | `{python} round(np.mean(c4),2)` | | $\mu_6$ | `{python} round(np.mean(c5),2)` | **Share on** <div id="fb-root"></div><script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v20.0"></script><div class="share-buttons"><div class="fb-share-button" data-href="https://mrislambd.github.io/posts/dataengineering/"data-layout="button_count" data-size="small"><a target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fmrislambd.github.io%2Fposts%2Fdataengineering%2F&src=sdkpreparse" class="fb-xfbml-parse-ignore">Share</a></div><script src="https://platform.linkedin.com/in.js" type="text/javascript">lang: en_US</script><script type="IN/Share" data-url="https://mrislambd.github.io/posts/dataengineering/"></script> <a href="https://twitter.com/share?ref_src=twsrc%5Etfw" class="twitter-share-button" data-url="https://mrislambd.github.io/posts/dataengineering/" data-show-count="true">Tweet</a><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div><div class="fb-comments" data-href="https://mrislambd.github.io/posts/dataengineering/" data-width="" data-numposts="5"></div>**You may also like**