import numpy as np
from sklearn.linear_model import LinearRegression
=np.random.randn(1000,2)
X=3*X[:,0]+2*X[:,1]+1+np.random.randn(1000) y
Multiple Liear Regression
Multiple Linear Regression
The multiple linear regression takes the form
\[
y=\beta_0+\beta_1 x_1+\beta_2 x_2+\cdots +\beta_d x_d+\xi=\vec{x}\cdot \vec{\beta}+\xi
\]
with \(\{\beta_i\}_{i=0}^{d}\in \mathbb{R}\) constants or parameters of the model. In vector notation, \(\vec{\beta}\in \mathbb{R}^{d+1}\),
\[ \vec{\beta}=\begin{pmatrix}\beta_0\\ \beta_1\\ \vdots \\ \beta_d \end{pmatrix};\hspace{4mm}\vec{x}=\begin{pmatrix}1\\ x_1\\ x_2\\ \vdots\\ x_d\end{pmatrix} \]
For \(n\) data points, in matrix algebra notation, we can write \(y=X\vec{\beta}+\xi\) where \(X\in \mathcal{M}_{n\times (d+1)}\) and \(y\in \mathbb{R}^{d+1}\) with
\[ X=\begin{pmatrix}1&x_{11}&x_{12}&\cdots&x_{1d}\\1&x_{21}&x_{22}&\cdots&x_{2d}\\ \vdots& \vdots &\vdots&\ddots &\vdots\\1&x_{n1}&x_{n2}&\cdots&x_{nd} \end{pmatrix};\hspace{4mm} y=\begin{pmatrix}y_1\\y_2\\ \vdots\\ y_n\end{pmatrix};\hspace{4mm} \xi=\begin{pmatrix}\xi_1\\ \xi_2\\ \vdots\\ \xi_n\end{pmatrix} \]
We fit the \(n\) data points with the objective to minimize the loss function, mean squared error
\[ MSE(\vec{\beta})=\frac{1}{n}\sum_{i=1}^{n}\left(y_i-f_{\vec{\beta}}(\vec{x}_i)\right)^2=\frac{1}{n}\left|\vec{y}-X\vec{\beta}\right|^2 \]
Ordinary Least Square Method
The scikit-learn
library uses Ordinary Least Squares (OLS) method to find the parameters. This method is good for a simple and relatively smaller dataset. Here is a short note on this method. However, when the dimension is very high and the dataset is bigger, scikit-learn
uses another method called Stochastic Gradient Descent for optimization which is discussed in the next section.
The goal of OLS is to find the parameter vector \(\hat{\beta}\) that minimizes the sum of squared errors (SSE) between the observed target values \(y\) and the predicted values \(\hat{y}\):
\[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - X_i\beta)^2 \]
This can be expressed in matrix form as:
\[ \text{SSE} = (y - X\beta)^T(y - X\beta) \]
To minimize the SSE, let’s first expand the expression:
\[\begin{align} \text{SSE} &= (y - X\beta)^T(y - X\beta)\\ &=(y^T-\beta^TX^T)(y-X\beta)\\ & = y^T y - y^T X\beta - \beta^T X^T y + \beta^T X^T X \beta \end{align}\]
Since \(\beta^T X^T y\) is a scalar (a 1x1 matrix), it is equal to its transpose. That is
\[\begin{align*} \beta^TX^Ty&=\left(\beta^TX^Ty\right)^T\\ &= \left((\beta^TX^T)y\right)^T\\ &=y^T(\beta^TX^T)^T\\ &=y^T(\beta^TX^T)^T\\ &=y^T\left(X^T\right)^T\left(\beta^T\right)^T\\ &=y^TX\beta \end{align*}\]
and therefore,
\[ \text{SSE} = y^T y - 2\beta^T X^T y + \beta^T X^T X \beta \]
To find the minimum of the SSE, we take the derivative with respect to \(\beta\) and set it to zero:
\[ \frac{\partial \text{SSE}}{\partial \beta} = -2X^T y + 2X^T X \beta = 0 \]
Now, solve for \(\beta\):
\[ X^T X \beta = X^T y \]
To isolate \(\beta\), we multiply both sides by \((X^T X)^{-1}\) (assuming \(X^T X\) is invertible):
\[ \beta = (X^T X)^{-1} X^T y \]
The vector \(\hat{\beta} = (X^T X)^{-1} X^T y\) gives the estimated coefficients that minimize the sum of squared errors between the observed target values \(y\) and the predicted values \(\hat{y} = X\hat{\beta}\). This method is exact and works well when \(X^T X\) is invertible and the dataset size is manageable.
This method is very efficient for small to medium-sized datasets but can become computationally expensive for very large datasets due to the inversion of the matrix \(X^TX\).
Iterative Method
Gradient Descent
GIF Credit: gbhat.com
Gradient Descent is an optimization algorithm used to minimize the cost function. The cost function \(f(\beta)\) measures how well a model with parameters \(\beta\) fits the data. The goal is to find the values of \(\beta\) that minimize this cost function. In terms of the iterative method, we want to find \(\beta_{k+1}\) and \(\beta_k\) such that \(f(\beta_{k+1})<f(\beta_k)\).
For a small change in \(\beta\), we can approximate \(f(\beta)\) using Taylor series expansion
\[f(\beta_{k+1})=f(\beta_k +\Delta\beta_k)\approx f(\beta_k)+\nabla f(\beta_k)^T \Delta \beta_k+\text{higher-order terms}\]
The update rule for vanilla gradient descent is given by:
\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]
Where:
- \(\beta_k\) is the current estimate of the parameters at iteration \(k\).
- \(\eta\) is the learning rate, a small positive scalar that controls the step size.
- \(\nabla f(\beta_k)\) is the gradient of the cost function \(f\) with respect to \(\beta\) at the current point \(\beta_k\).
The update rule comes from the idea of moving the parameter vector \(\beta\) in the direction that decreases the cost function the most.
Gradient: The gradient \(\nabla f(\beta_k)\) represents the direction and magnitude of the steepest ascent of the function \(f\) at the point \(\beta_k\). Since we want to minimize the function, we move in the opposite direction of the gradient.
Step Size: The term \(\eta \nabla f(\beta_k)\) scales the gradient by the learning rate \(\eta\), determining how far we move in that direction. If \(\eta\) is too large, the algorithm may overshoot the minimum; if it’s too small, the convergence will be slow.
Iterative Update: Starting from an initial guess \(\beta_0\), we repeatedly apply the update rule until the algorithm converges, meaning that the changes in \(\beta_k\) become negligible, and \(\beta_k\) is close to the optimal value \(\beta^*\).
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent is a variation of the vanilla gradient descent. Instead of computing the gradient using the entire dataset, SGD updates the parameters using only a single data point or a small batch of data points at each iteration. The later one we call it mini batch SGD.
Suppose our cost function is defined as the average over a dataset of size \(n\):
\[ f(\beta) = \frac{1}{n} \sum_{i=1}^{n} f_i(\beta) \]
Where \(f_i(\beta)\) represents the contribution of the \(i\)-th data point to the total cost function. The gradient of the cost function with respect to \(\beta\) is:
\[ \nabla f(\beta) = \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(\beta) \]
Vanilla gradient descent would update the parameters as:
\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]
Instead of using the entire dataset to compute the gradient, SGD approximates the gradient by using only a single data point (or a small batch). The update rule for SGD is:
\[ \beta_{k+1} = \beta_k - \eta \nabla f_{i_k}(\beta_k) \]
Where:
- \(i_k\) is the index of a randomly selected data point at iteration \(k\).
- \(\nabla f_{i_k}(\beta_k)\) is the gradient of the cost function with respect to the parameter \(\beta_k\), evaluated only at the data point indexed by \(i_k\).
Python Execution
Synthetic Data
So for this project, our known relationship is \(y=1+3x_1+2x_2+\xi\).
Fit the data: Using scikit-learn Library
=LinearRegression()
mlr
mlr.fit(X,y)=mlr.coef_.tolist()
coefficients=mlr.intercept_.tolist() slope
So the model parameters: slope \(\beta_0=\) 0.9981 and coefficients \(\beta_1=\) 2.974, and \(\beta_2=\) 1.9881
Fit the data: Using Custom Library OLS
First we create our custom NewLinearRegression
using the OLS formula above and save this python
class as mlreg.py
import numpy as np
class NewLinearRegression:
def __init__(self) -> None:
self.beta = None
def fit(self, X, y):
= np.concatenate([np.ones((len(X), 1)), X], axis=1)
X = np.dot(X.transpose(), X)
X_transpose_X = np.linalg.inv(X_transpose_X)
X_transpose_X_inverse = np.dot(X.transpose(), y)
X_transpose_y self.beta = np.dot(X_transpose_X_inverse, X_transpose_y)
def predict(self, X):
= np.concatenate([np.ones((len(X), 1)), X], axis=1)
X return np.dot(X, self.beta)
def coeff_(self):
return self.beta[1:].tolist()
def interceptt_(self):
return self.beta[0].tolist()
Now it’s time to use the new class
from mlreg import NewLinearRegression
= NewLinearRegression()
mlr1
mlr1.fit(X,y)=mlr1.coeff_()
coefficients1=mlr1.interceptt_() slope1
So the model parameters: slope \(\beta_0=\) 0.9981 and coefficients \(\beta_1=\) 2.974, and \(\beta_2=\) 1.9881
Fit the data: Using Gradient Descent
We create the class
class GDLinearRegression:
def __init__(self, learning_rate=0.01, number_of_iteration=1000) -> None:
self.learning_rate = learning_rate
self.number_of_iteration = number_of_iteration
self.weights = None
self.bias = None
def fit(self, X, y):
= X.shape
num_of_samples, num_of_features self.weights = np.zeros(num_of_features)
self.bias = 0
for _ in range(self.number_of_iteration):
= np.dot(X, self.weights) + self.bias
y_predicted
= (1 / num_of_samples) * np.dot(X.T, (y_predicted - y))
d_weights = (1 / num_of_samples) * np.sum(y_predicted - y)
d_bias
self.weights -= self.learning_rate * d_weights
self.bias -= self.learning_rate * d_bias
def predict(self, X):
= np.dot(X, self.weights) + self.bias
y_predicted return y_predicted
def coefff_(self):
return self.weights.tolist()
def intercepttt_(self):
return self.bias
Now we use this similarly as before,
from mlreg import GDLinearRegression
= GDLinearRegression(learning_rate=0.008)
mlr2
mlr2.fit(X,y)=mlr2.coefff_()
coefficients2=mlr2.intercepttt_() slope2
So the model parameters: slope \(\beta_0=\) 0.9979 and coefficients \(\beta_1=\) 2.973, and \(\beta_2=\) 1.9872
Fit the data: Using Stochastic Gradient Descent
First we define the class
class SGDLinearRegression:
def __init__(self, learning_rate=0.01, num_iterations=1000, batch_size=1) -> None:
self.learning_rate = learning_rate
self.num_iterations = num_iterations
self.batch_size = batch_size
self.theta = None
self.mse_list = None # Initialize mse_list as an instance attribute
def _loss_function(self, X, y, beta):
= len(y)
num_samples = X.dot(beta)
y_predicted = (1/num_samples) * np.sum(np.square(y_predicted - y))
mse return mse
def _gradient_function(self, X, y, beta):
= len(y)
num_samples = X.dot(beta)
y_predicted = (1/num_samples) * X.T.dot(y_predicted - y)
grad return grad
def fit(self, X, y):
# Adding the intercept term (bias) as a column of ones
= np.concatenate([np.ones((len(X), 1)), X], axis=1)
X = X.shape[1]
num_features self.theta = np.zeros((num_features, 1))
self.mse_list = np.zeros(self.num_iterations) # Initialize mse_list
for i in range(self.num_iterations):
# Randomly select a batch of data points
= np.random.choice(
indices len(y), size=self.batch_size, replace=False)
= X[indices]
X_i = y[indices].reshape(-1, 1)
y_i
# Compute the gradient and update the weights
= self._gradient_function(X_i, y_i, self.theta)
gradient self.theta = self.theta - self.learning_rate * gradient
# Calculate loss for the entire dataset (optional)
self.mse_list[i] = self._loss_function(X, y, self.theta)
return self.theta, self.mse_list
def predict(self, X):
# Adding the intercept term (bias) as a column of ones
= np.concatenate([np.ones((len(X), 1)), X], axis=1)
X return X.dot(self.theta)
def coef_(self):
# Return the coefficients (excluding the intercept term)
return self.theta[1:].flatten().tolist()
def intercept_(self):
# Return the intercept term
return self.theta[0].item()
def mse_losses(self):
# Return the mse_list
return self.mse_list.tolist()
Now
import matplotlib.pyplot as plt
from mlreg import SGDLinearRegression
=SGDLinearRegression(learning_rate=0.01, num_iterations=1000, batch_size=10)
mlr3= mlr3.fit(X, y) theta, _
So the model parameters: slope \(\beta_0=\) array([0.99366046]) and coefficients \(\beta_1=\) array([2.98592199]), and \(\beta_2=\) array([1.99782022])
Up next knn regression
Share on
You may also like
Citation
@online{islam2024,
author = {Islam, Rafiq},
title = {Multiple {Liear} {Regression}},
date = {2024-08-29},
url = {https://mrislambd.github.io/dsandml/multiplelinreg/},
langid = {en}
}