import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def f(x,y, c, d):
return np.sqrt((x-c)**2+(y-d)**2)
= np.linspace(-10,10, 400)
x = np.linspace(-10,10, 400)
y = np.meshgrid(x,y)
x, y
= 0,0
c, d
= f(x, y, c, d)
z
= plt.figure()
fig = fig.add_subplot(111, projection ='3d')
ax = 'viridis')
ax.plot_surface(x, y, z, cmap'X')
ax.set_xlabel('Y')
ax.set_ylabel('Z')
ax.set_zlabel('#f4f4f4')
plt.gca().set_facecolor('#f4f4f4')
plt.gcf().patch.set_facecolor('sgd.png')
plt.savefig( plt.show()
Implementation of Gradient Descent (GD) and Stochastic Gradient Descent (SGD)
Gradient Descent
GIF Credit: gbhat.com
Gradient Descent is an optimization algorithm used to minimize the cost function. The cost function \(f(\beta)\) measures how well a model with parameters \(\beta\) fits the data. The goal is to find the values of \(\beta\) that minimize this cost function. In terms of the iterative method, we want to find \(\beta_{k+1}\) and \(\beta_k\) such that \(f(\beta_{k+1})<f(\beta_k)\).
For a small change in \(\beta\), we can approximate \(f(\beta)\) using Taylor series expansion
\[f(\beta_{k+1})=f(\beta_k +\Delta\beta_k)\approx f(\beta_k)+\nabla f(\beta_k)^T \Delta \beta_k+\text{higher-order terms}\]
The update rule for vanilla gradient descent is given by:
\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]
Where:
- \(\beta_k\) is the current estimate of the parameters at iteration \(k\).
- \(\eta\) is the learning rate, a small positive scalar that controls the step size.
- \(\nabla f(\beta_k)\) is the gradient of the cost function \(f\) with respect to \(\beta\) at the current point \(\beta_k\).
The update rule comes from the idea of moving the parameter vector \(\beta\) in the direction that decreases the cost function the most.
Gradient: The gradient \(\nabla f(\beta_k)\) represents the direction and magnitude of the steepest ascent of the function \(f\) at the point \(\beta_k\). Since we want to minimize the function, we move in the opposite direction of the gradient.
Step Size: The term \(\eta \nabla f(\beta_k)\) scales the gradient by the learning rate \(\eta\), determining how far we move in that direction. If \(\eta\) is too large, the algorithm may overshoot the minimum; if it’s too small, the convergence will be slow.
Iterative Update: Starting from an initial guess \(\beta_0\), we repeatedly apply the update rule until the algorithm converges, meaning that the changes in \(\beta_k\) become negligible, and \(\beta_k\) is close to the optimal value \(\beta^*\).
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent is a variation of the vanilla gradient descent. Instead of computing the gradient using the entire dataset, SGD updates the parameters using only a single data point or a small batch of data points at each iteration. The later one we call it mini batch SGD.
Suppose our cost function is defined as the average over a dataset of size \(n\):
\[ f(\beta) = \frac{1}{n} \sum_{i=1}^{n} f_i(\beta) \]
Where \(f_i(\beta)\) represents the contribution of the \(i\)-th data point to the total cost function. The gradient of the cost function with respect to \(\beta\) is:
\[ \nabla f(\beta) = \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(\beta) \]
Vanilla gradient descent would update the parameters as:
\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]
Instead of using the entire dataset to compute the gradient, SGD approximates the gradient by using only a single data point (or a small batch). The update rule for SGD is:
\[ \beta_{k+1} = \beta_k - \eta \nabla f_{i_k}(\beta_k) \]
Where:
- \(i_k\) is the index of a randomly selected data point at iteration \(k\).
- \(\nabla f_{i_k}(\beta_k)\) is the gradient of the cost function with respect to the parameter \(\beta_k\), evaluated only at the data point indexed by \(i_k\).
Implementation
Let’s imagine a hypothetical scenario, Walmart Inc. wants to explore their business in a new twon. They want to have their store in location so that the total distance of the store from all the houses in the neighborhood is the smallest possible. If they have the data of \(n\) houses with corresponding coordinates of the houses, return the optimized location for the store.
The Euclidean distance between two points \((x_1,y_1)\) and \((x_2,y_2)\) is given by
\[d=\sqrt{(x_1-x_2)^2+(y_1-y_2)}\]
Assume that \(P=(x,y)\) is the coordinate of Walmart. So for a total of \(n\) such points the total distance \(D\) from the point \(P\) is a function of two variable \(x\) and \(y\) of the following form
\[D=f(x,y)=\sum_{i=1}^{n}\sqrt{(x-x_i)^2+(y-y_i)^2}\]
all we need to do is to minimize the function \(f(x,y)\) and to do that we need to calculate the gradient vector which is the partial derivative of \(f(x,y)\) with respect to \(x\) and \(y\). So,
\[\begin{align*} \frac{\partial f}{\partial x}& = \sum_{i=1}^{n} \frac{x-x_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ \frac{\partial f}{\partial y}& = \sum_{i=1}^{n} \frac{y-y_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ & \\ \implies \nabla f(x,y) &= \begin{bmatrix}\frac{\partial f}{\partial x}\\\frac{\partial f}{\partial y}\end{bmatrix} \end{align*}\]
Then the algorithm
\[\begin{align*} \begin{bmatrix}x_{i+1}\\y_{i+1}\end{bmatrix}&= \begin{bmatrix}x_{i}\\y_{i}\end{bmatrix} - \eta_i \begin{bmatrix}\frac{\partial f}{\partial x}|_{x_i}\\\frac{\partial f}{\partial y}|_{y_i}\end{bmatrix} \end{align*}\]
where, the \(\eta\) is the step size or learning rate that scales the size of the move towards the opposite of the gradient direction.
Next, how do we control the numerical stability of the algorithm? We need to decrease the step size at each iteration which. This is called the rate of decay
. We also need a termination factor or tolerance
level that determines if we can stop the iteration. Sometimes, for a deep down convex function, the process oscillates back and forth around a range of values. In this case, applying a damping factor
increases the chance for a smooth convergence.
Gradient Descent (GD)
import numpy as np
import matplotlib.pyplot as plt
import math
class GDdistanceMin:
def __init__(self, step_size=1, decay_rate=0.99, tolerance=1e-7, damping_rate=0.75, points=[]):
self.step_size = step_size
self.decay_rate = decay_rate
self.tolerance = tolerance
self.damping_rate = damping_rate
self.points = points
self.x = sum(x for x, y in points) / len(points) # Initialization
self.y = sum(y for x, y in points) / len(points) # Initialization
self.x_updates = []
self.y_updates = []
def _partial_derivative_x(self, x, y):
= 0
grad_x for xi, yi in self.points:
if x != xi or y != yi:
+= (x - xi) / math.sqrt((x - xi)**2 + (y - yi)**2)
grad_x return grad_x
def _partial_derivative_y(self, x, y):
= 0
grad_y for xi, yi in self.points:
if x != xi or y != yi:
+= (y - yi) / math.sqrt((x - xi)**2 + (y - yi)**2)
grad_y return grad_y
def gradient_descent(self):
= 0, 0
dx, dy while self.step_size > self.tolerance:
= self._partial_derivative_x(self.x, self.y) + self.damping_rate * dx
dx = self._partial_derivative_y(self.x, self.y) + self.damping_rate * dy
dy self.x -= self.step_size * dx
self.x_updates.append(self.x)
self.y -= self.step_size * dy
self.y_updates.append(self.y)
self.step_size *= self.decay_rate
return (self.x, self.y)
def f(x, y, c, d):
return np.sqrt((x - c)**2 + (y - d)**2)
# Define points
= [(1, 3), (-2, 4), (3, 4), (-2, 1), (9, 2), (-5, 2)]
points = GDdistanceMin(points=points)
gd_min = gd_min.gradient_descent()
min_pt = gd_min.x_updates
xs = gd_min.y_updates
ys print("Minimum point:", min_pt)
= min_pt
c, d
# Create a grid for plotting
= np.linspace(-10, 10, 400)
x = np.linspace(-10, 10, 400)
y = np.meshgrid(x, y)
x_grid, y_grid = f(x_grid, y_grid, c, d)
z
# Calculate z values for the updates
= [f(xi, yi, c, d) for xi, yi in zip(xs, ys)]
zs
# Plotting
= plt.figure()
fig = fig.add_subplot(111, projection='3d')
ax ='viridis', alpha=0.6)
ax.plot_surface(x_grid, y_grid, z, cmap='red', s=50, label="Updates", marker='o')
ax.scatter(xs, ys, zs, color'X')
ax.set_xlabel('Y')
ax.set_ylabel('Z')
ax.set_zlabel('#f4f4f4')
plt.gca().set_facecolor('#f4f4f4')
plt.gcf().patch.set_facecolor( plt.show()
Minimum point: (0.9999997458022071, 2.9999999769909707)
Testing
Share on
You may also like
Citation
@online{islam2024,
author = {Islam, Rafiq},
title = {Implementation of {Gradient} {Descent} {(GD)} and
{Stochastic} {Gradient} {Descent} {(SGD)}},
date = {2024-09-19},
url = {https://mrislambd.github.io/posts/sgd/},
langid = {en}
}