Simple Linear Regression

Data Science
Machine Learning
Artificial Intelligence
Author

Rafiq Islam

Published

August 29, 2024

Simple Linear Regression

A simple linear regression in multiple predictors/input variables/features/independent variables/ explanatory variables/regressors/ covariates (many names) often takes the form

\[ y=f(\mathbf{x})+\epsilon =\mathbf{\beta}\mathbf{x}+\epsilon \]

where \(\mathbf{\beta} \in \mathbb{R}^d\) are regression parameters or constant values that we aim to estimate and \(\epsilon \sim \mathcal{N}(0,1)\) is a normally distributed error term independent of \(x\) or also called the white noise.

In this case, the model:

\[ y=f(x)+\epsilon=\beta_0+\beta_1 x+\epsilon \]

Therefore, in our model we need to estimate the parameters \(\beta_0,\beta_1\). The true relationship between the explanatory variables and the dependent variable is \(y=f(x)\). But our model is \(y=f(x)+\epsilon\). Here, this \(f(x)\) is the working model with the data. In other words, \(\hat{y}=f(x)=\hat{\beta}_0+\hat{\beta}_1 x\). Therefore, there should be some error in the model prediction which we are calling \(\epsilon=\|y-\hat{y}\|\) where \(y\) is the true value and \(\hat{y}\) is the predicted value. This error term is normally distributed with mean 0 and variance 1. To get the best estimate of the parameters \(\beta_0,\beta_1\) we can minimize the error term as much as possible. So, we define the residual sum of squares (RSS) as:

\[\begin{align} RSS &=\epsilon_1^2+\epsilon_2^2+\cdots+\epsilon_{10}^2\\ &= \sum_{i=1}^{10}(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)^2\\ \hat{\mathcal{l}}(\bar{\beta})&=\sum_{i=1}^{10}(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)^2\\ \end{align}\]

Using multivariate calculus we see

\[\begin{align} \frac{\partial l}{\partial \beta_0}&=\sum_{i=1}^{10} 2(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(-1)\\ \frac{\partial l}{\partial \beta_1}&= \sum_{i=1}^{10} 2(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(-x_i) \end{align}\]

Setting the partial derivatives to zero we solve for \(\hat{\beta_0},\hat{\beta_1}\) as follows

\[\begin{align*} \frac{\partial l}{\partial \beta_0}&=0\\ \implies \sum_{i=1}^{10} y_i-10 \hat{\beta_0}-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i\right)&=0\\ \implies \hat{\beta_0}&=\bar{y}-\hat{\beta_1}\bar{x} \end{align*}\]

and,

\[\begin{align*} \frac{\partial l}{\partial \beta_1}&=0\\ \implies \sum_{i=1}^{10} 2(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(-x_i)&=0\\ \implies \sum_{i=1}^{10} (y_i-\hat{\beta}_0-\hat{\beta}_1 x_i)(x_i)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\hat{\beta_0}\left(\sum_{i=1}^{10} x_i\right)-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\left(\bar{y}-\hat{\beta_1}\bar{x}\right)\left(\sum_{i=1}^{10} x_i\right)-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right)+\hat{\beta_1}\bar{x}\left(\sum_{i=1}^{10} x_i\right)-\hat{\beta_1}\left(\sum_{i=1}^{10} x_i^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right) -\hat{\beta_1}\left(\sum_{i=1}^{10}x_i^2-\bar{x}\sum_{i=1}^{10}x_i\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right) -\hat{\beta_1}\left(\sum_{i=1}^{10}x_i^2-10\bar{x}^2\right)&=0\\ \implies \sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right) -\hat{\beta_1}\left(\sum_{i=1}^{10}x_i^2-2\times 10\times \bar{x}^2+10\bar{x}^2\right)&=0\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10} x_iy_i-10\bar{x}\bar{y}}{\sum_{i=1}^{10}x_i^2-2\times 10\times \bar{x}^2+10\bar{x}^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10} x_iy_i -10\bar{x}\bar{y}-10\bar{x}\bar{y}+10\bar{x}\bar{y}}{\sum_{i=1}^{10}x_i^2-2\bar{x}\times 10\times\frac{1}{10}\sum_{i=1}^{10}x_i +10\bar{x}^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10} x_iy_i-\bar{y}\left(\sum_{i=1}^{10} x_i\right)-\bar{x}\left(\sum_{i=1}^{10} y_i\right)+10\bar{x}\bar{y}}{\sum_{i=1}^{10}(x_i-\bar{x})^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10}\left(x_iy_i-x_i\bar{y}-\bar{x}y_i+\bar{x}\bar{y}\right)}{\sum_{i=1}^{10}(x_i-\bar{x})^2}\\ \implies \hat{\beta_1}&=\frac{\sum_{i=1}^{10}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{10}(x_i-\bar{x})^2}\\ \end{align*}\]

Therefore, we have the following

\[\begin{align*} \hat{\beta_0}&=\bar{y}-\hat{\beta_1}\bar{x}\\ \hat{\beta_1}&=\frac{\sum_{i=1}^{10}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{10}(x_i-\bar{x})^2} \end{align*}\]

Simple Linear Regression slr is applicable for a single feature data set with contineous response variable.

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LinearRegression

Assumptions of Linear Regressions

  • Linearity: The relationship between the feature set and the target variable has to be linear.
  • Homoscedasticity: The variance of the residuals has to be constant.
  • Independence: All the observations are independent of each other.
  • Normality: The distribution of the dependent variable \(y\) has to be normal.

Synthetic Data

To implement the algorithm, we need some synthetic data. To generate the synthetic data we use the linear equation \(y(x)=2x+\frac{1}{2}+\xi\) where \(\xi\sim \mathbf{N}(0,1)\)

X=np.random.random(100)
y=2*X+0.5+np.random.randn(100)

Note that we used two random number generators, np.random.random(n) and np.random.randn(n). The first one generates \(n\) random numbers of values from the range (0,1) and the second one generates values from the standard normal distribution with mean 0 and variance or standard deviation 1.

plt.figure(figsize=(9,6))
plt.scatter(X,y)
plt.xlabel('$X$')
plt.ylabel('y')
plt.gca().set_facecolor('#f4f4f4') 
plt.gcf().patch.set_facecolor('#f4f4f4')
plt.show()

Model

We want to fit a simple linear regression to the above data.

slr=LinearRegression()

Now to fit our data \(X\) and \(y\) we need to reshape the input variable. Because if we look at \(X\),

X
array([0.56856587, 0.17423288, 0.40129224, 0.03280717, 0.54098864,
       0.29660473, 0.66391506, 0.89033492, 0.17885744, 0.95990687,
       0.36462402, 0.40674152, 0.14675139, 0.87909539, 0.63080773,
       0.53839877, 0.12473846, 0.11900568, 0.07201608, 0.58065377,
       0.04431626, 0.0072257 , 0.37659324, 0.49757598, 0.28649567,
       0.33284351, 0.57301211, 0.62663095, 0.50147347, 0.04433713,
       0.26319543, 0.61344242, 0.67052889, 0.89647799, 0.85831712,
       0.17178016, 0.18087074, 0.65129641, 0.72596824, 0.30622122,
       0.75513251, 0.16522543, 0.61771188, 0.18175136, 0.0647351 ,
       0.88276012, 0.37657094, 0.06991887, 0.86900206, 0.87705882,
       0.95791386, 0.35986784, 0.19088845, 0.80896819, 0.69386082,
       0.30152154, 0.15326753, 0.18509181, 0.9961451 , 0.14013671,
       0.19277641, 0.24059626, 0.53998499, 0.32534802, 0.79087255,
       0.13104557, 0.28326053, 0.56381408, 0.20079243, 0.32677786,
       0.93752833, 0.95799509, 0.73057342, 0.19006122, 0.13442495,
       0.8295378 , 0.47808489, 0.15775223, 0.78753582, 0.33932299,
       0.73967636, 0.74865527, 0.94241147, 0.578305  , 0.8819345 ,
       0.41292441, 0.36738979, 0.6988793 , 0.41269004, 0.51400896,
       0.32262575, 0.94121051, 0.58636257, 0.23706789, 0.78174534,
       0.24518401, 0.18770689, 0.74447288, 0.36082694, 0.24436498])

It is a one-dimensional array/vector but the slr object accepts input variable as matrix or two-dimensional format.

X=X.reshape(-1,1)
X[:10]
array([[0.56856587],
       [0.17423288],
       [0.40129224],
       [0.03280717],
       [0.54098864],
       [0.29660473],
       [0.66391506],
       [0.89033492],
       [0.17885744],
       [0.95990687]])

Now we fit the data to our model

slr.fit(X,y)
slr.predict([[2],[3]])
array([4.43623082, 6.34522069])

We have our \(X=2,3\) and the corresponding \(y\) values are from the above cell output, which are pretty close to the model \(y=2x+\frac{1}{2}\).

intercept = round(slr.intercept_,4)
slope = slr.coef_

Now our model parameters are: intercept \(\beta_0=\) 0.6183 and slope \(\beta_1=\) array([1.90898987]).

plt.figure(figsize=(9,6))
plt.scatter(X,y, alpha=0.7,label="Sample Data")
plt.plot(np.linspace(0,1,100),
    slr.predict(np.linspace(0,1,100).reshape(-1,1)),
    'k',
    label='Model $\hat{f}$'
)
plt.plot(np.linspace(0,1,100),
    2*np.linspace(0,1,100)+0.5,
    'r--',
    label='$f$'
)
plt.xlabel('$X$')
plt.ylabel('y')
plt.legend(fontsize=10)
plt.gca().set_facecolor('#f4f4f4') 
plt.gcf().patch.set_facecolor('#f4f4f4')
plt.show()

So the model fits the data almost perfectly.

Up next multiple linear regression.

Share on

You may also like

Back to top

Citation

BibTeX citation:
@online{islam2024,
  author = {Islam, Rafiq},
  title = {Simple {Linear} {Regression}},
  date = {2024-08-29},
  url = {https://mrislambd.github.io/dsandml/simplelinreg/},
  langid = {en}
}
For attribution, please cite this work as:
Islam, Rafiq. 2024. “Simple Linear Regression.” August 29, 2024. https://mrislambd.github.io/dsandml/simplelinreg/.