import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
Simple Linear Regression
Simple Linear Regression
A simple linear regression in multiple predictors/input variables/features/independent variables/ explanatory variables/regressors/ covariates (many names) often takes the form
\[ y=f(\mathbf{x})+\epsilon =\mathbf{\beta}\mathbf{x}+\epsilon \]
where \(\mathbf{\beta} \in \mathbb{R}^d\) are regression parameters or constant values that we aim to estimate and \(\epsilon \sim \mathcal{N}(0,1)\) is a normally distributed error term independent of \(x\) or also called the white noise.
In this case, the model:
\[ y=f(x)+\epsilon=\beta_0+\beta_1 x+\epsilon \]
Therefore, in our model we need to estimate the parameters \(\beta_0,\beta_1\). The true relationship between the explanatory variables and the dependent variable is \(y=f(x)\). But our model is \(y=f(x)+\epsilon\). Here, this \(f(x)\) is the working model with the data. In other words, \(\hat{y}=f(x)=\hat{\beta}_0+\hat{\beta}_1 x\). Therefore, there should be some error in the model prediction which we are calling \(\epsilon=\|y-\hat{y}\|\) where \(y\) is the true value and \(\hat{y}\) is the predicted value. This error term is normally distributed with mean 0 and variance 1. To get the best estimate of the parameters \(\beta_0,\beta_1\) we can minimize the error term as much as possible. So, we define the residual sum of squares (RSS) as:
\[\begin{align} RSS &=\epsilon_1^2+\epsilon_2^2+\cdots+\epsilon_{10}^2\\ &= \sum_{i=1}^{10}(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)^2\\ \hat{\mathcal{l}}(\bar{\beta})&=\sum_{i=1}^{10}(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)^2\\ \end{align}\]
Using multivariate calculus we see
\[\begin{align} \frac{\partial l}{\partial \beta_0}&=\sum_{i=1}^{10} 2(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(-1)\\ \frac{\partial l}{\partial \beta_1}&= \sum_{i=1}^{10} 2(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(-x_i) \end{align}\]
Setting the partial derivatives to zero we solve for \(\hat{\beta_0},\hat{\beta_1}\) as follows
\[\begin{align*} \frac{\partial l}{\partial \beta_0}&=0\\ \implies \sum_{i=1}^{10} y_i-10 \hat{\beta_0}-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i\right)&=0\\ \implies \hat{\beta_0}&=\bar{y}-\hat{\beta_1}\bar{x} \end{align*}\]
and,
\[\begin{align*} \frac{\partial l}{\partial \beta_1}&=0\\ \implies \sum_{i=1}^{10} 2(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(-x_i)&=0\\ \implies \sum_{i=1}^{10} (y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(x_i)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\hat{\beta_0}\left(\sum_{i=1}^{10} x_i\right)-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\left(\bar{y}-\hat{\beta_1}\bar{x}\right)\left(\sum_{i=1}^{10} x_i\right)-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right)+\hat{\beta_1}\bar{x}\left(\sum_{i=1}^{10} x_i\right)-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right) -\hat{\beta_1}\left(\sum_{i=1}^{10}x_i^2-\bar{x}\sum_{i=1}^{10}x_i\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right) -\hat{\beta_1}\left(\sum_{i=1}^{10}x_i^2-10\bar{x}^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right) -\hat{\beta_1}\left(\sum_{i=1}^{10}x_i^2-2\times 10\times \bar{x}^2+10\bar{x}^2\right)&=0\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10} x_iy_i-10\bar{x}\bar{y}}{\sum_{i=1}^{10}x_i^2-2\times 10\times \bar{x}^2+10\bar{x}^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10} x_iy_i -10\bar{x}\bar{y}-10\bar{x}\bar{y}+10\bar{x}\bar{y}}{\sum_{i=1}^{10}x_i^2-2\bar{x}\times 10\times\frac{1}{10}\sum_{i=1}^{10}x_i +10\bar{x}^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right)-\bar{x}\left(\sum_{i=1}^{10} y_i\right)+10\bar{x}\bar{y}}{\sum_{i=1}^{10}(x_i-\bar{x})^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10}\left(x_iy_i-x_i\bar{y}-\bar{x}y_i+\bar{x}\bar{y}\right)}{\sum_{i=1}^{10}(x_i-\bar{x})^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{10}(x_i-\bar{x})^2}\\ \end{align*}\]
Therefore, we have the following
\[\begin{align*} \hat{\beta_0}&=\bar{y}-\hat{\beta_1}\bar{x}\\ \hat{\beta_1}&=\frac{\sum_{i=1}^{10}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{10}(x_i-\bar{x})^2} \end{align*}\]
Simple Linear Regression slr
is applicable for a single feature data set with contineous response variable.
Assumptions of Linear Regressions
- Linearity: The relationship between the feature set and the target variable has to be linear.
- Homoscedasticity: The variance of the residuals has to be constant.
- Independence: All the observations are independent of each other.
- Normality: The distribution of the dependent variable \(y\) has to be normal.
Synthetic Data
To implement the algorithm, we need some synthetic data. To generate the synthetic data we use the linear equation \(y(x)=2x+\frac{1}{2}+\xi\) where \(\xi\sim \mathbf{N}(0,1)\)
=np.random.random(100)
X=2*X+0.5+np.random.randn(100) y
Note that we used two random number generators, np.random.random(n)
and np.random.randn(n)
. The first one generates \(n\) random numbers of values from the range (0,1) and the second one generates values from the standard normal distribution with mean 0 and variance or standard deviation 1.
=(9,6))
plt.figure(figsize
plt.scatter(X,y)'$X$')
plt.xlabel('y')
plt.ylabel('#f4f4f4')
plt.gca().set_facecolor('#f4f4f4')
plt.gcf().patch.set_facecolor( plt.show()
Model
We want to fit a simple linear regression to the above data.
=LinearRegression() slr
Now to fit our data \(X\) and \(y\) we need to reshape the input variable. Because if we look at \(X\),
X
array([0.56856587, 0.17423288, 0.40129224, 0.03280717, 0.54098864,
0.29660473, 0.66391506, 0.89033492, 0.17885744, 0.95990687,
0.36462402, 0.40674152, 0.14675139, 0.87909539, 0.63080773,
0.53839877, 0.12473846, 0.11900568, 0.07201608, 0.58065377,
0.04431626, 0.0072257 , 0.37659324, 0.49757598, 0.28649567,
0.33284351, 0.57301211, 0.62663095, 0.50147347, 0.04433713,
0.26319543, 0.61344242, 0.67052889, 0.89647799, 0.85831712,
0.17178016, 0.18087074, 0.65129641, 0.72596824, 0.30622122,
0.75513251, 0.16522543, 0.61771188, 0.18175136, 0.0647351 ,
0.88276012, 0.37657094, 0.06991887, 0.86900206, 0.87705882,
0.95791386, 0.35986784, 0.19088845, 0.80896819, 0.69386082,
0.30152154, 0.15326753, 0.18509181, 0.9961451 , 0.14013671,
0.19277641, 0.24059626, 0.53998499, 0.32534802, 0.79087255,
0.13104557, 0.28326053, 0.56381408, 0.20079243, 0.32677786,
0.93752833, 0.95799509, 0.73057342, 0.19006122, 0.13442495,
0.8295378 , 0.47808489, 0.15775223, 0.78753582, 0.33932299,
0.73967636, 0.74865527, 0.94241147, 0.578305 , 0.8819345 ,
0.41292441, 0.36738979, 0.6988793 , 0.41269004, 0.51400896,
0.32262575, 0.94121051, 0.58636257, 0.23706789, 0.78174534,
0.24518401, 0.18770689, 0.74447288, 0.36082694, 0.24436498])
It is a one-dimensional array/vector but the slr
object accepts input variable as matrix or two-dimensional format.
=X.reshape(-1,1)
X10] X[:
array([[0.56856587],
[0.17423288],
[0.40129224],
[0.03280717],
[0.54098864],
[0.29660473],
[0.66391506],
[0.89033492],
[0.17885744],
[0.95990687]])
Now we fit the data to our model
slr.fit(X,y)2],[3]]) slr.predict([[
array([4.43623082, 6.34522069])
We have our \(X=2,3\) and the corresponding \(y\) values are from the above cell output, which are pretty close to the model \(y=2x+\frac{1}{2}\).
= round(slr.intercept_,4)
intercept = slr.coef_ slope
Now our model parameters are: intercept \(\beta_0=\) 0.6183 and slope \(\beta_1=\) array([1.90898987]).
=(9,6))
plt.figure(figsize=0.7,label="Sample Data")
plt.scatter(X,y, alpha0,1,100),
plt.plot(np.linspace(0,1,100).reshape(-1,1)),
slr.predict(np.linspace('k',
='Model $\hat{f}$'
label
)0,1,100),
plt.plot(np.linspace(2*np.linspace(0,1,100)+0.5,
'r--',
='$f$'
label
)'$X$')
plt.xlabel('y')
plt.ylabel(=10)
plt.legend(fontsize'#f4f4f4')
plt.gca().set_facecolor('#f4f4f4')
plt.gcf().patch.set_facecolor( plt.show()
So the model fits the data almost perfectly.
Up next multiple linear regression.
Share on
You may also like
Citation
@online{islam2024,
author = {Islam, Rafiq},
title = {Simple {Linear} {Regression}},
date = {2024-08-29},
url = {https://mrislambd.github.io/dsandml/simplelinreg/},
langid = {en}
}