Implementation of Gradient Descent (GD) and Stochastic Gradient Descent (SGD)
Gradient Descent
GIF Credit: gbhat.com
Gradient Descent is an optimization algorithm used to minimize the cost function. The cost function \(f(\beta)\) measures how well a model with parameters \(\beta\) fits the data. The goal is to find the values of \(\beta\) that minimize this cost function. In terms of the iterative method, we want to find \(\beta_{k+1}\) and \(\beta_k\) such that \(f(\beta_{k+1})<f(\beta_k)\).
For a small change in \(\beta\), we can approximate \(f(\beta)\) using Taylor series expansion
\[f(\beta_{k+1})=f(\beta_k +\Delta\beta_k)\approx f(\beta_k)+\nabla f(\beta_k)^T \Delta \beta_k+\text{higher-order terms}\]
The update rule for vanilla gradient descent is given by:
\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]
Where:
- \(\beta_k\) is the current estimate of the parameters at iteration \(k\).
- \(\eta\) is the learning rate, a small positive scalar that controls the step size.
- \(\nabla f(\beta_k)\) is the gradient of the cost function \(f\) with respect to \(\beta\) at the current point \(\beta_k\).
The update rule comes from the idea of moving the parameter vector \(\beta\) in the direction that decreases the cost function the most.
Gradient: The gradient \(\nabla f(\beta_k)\) represents the direction and magnitude of the steepest ascent of the function \(f\) at the point \(\beta_k\). Since we want to minimize the function, we move in the opposite direction of the gradient.
Step Size: The term \(\eta \nabla f(\beta_k)\) scales the gradient by the learning rate \(\eta\), determining how far we move in that direction. If \(\eta\) is too large, the algorithm may overshoot the minimum; if it’s too small, the convergence will be slow.
Iterative Update: Starting from an initial guess \(\beta_0\), we repeatedly apply the update rule until the algorithm converges, meaning that the changes in \(\beta_k\) become negligible, and \(\beta_k\) is close to the optimal value \(\beta^*\).
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent is a variation of the vanilla gradient descent. Instead of computing the gradient using the entire dataset, SGD updates the parameters using only a single data point or a small batch of data points at each iteration. The later one we call it mini batch SGD.
Suppose our cost function is defined as the average over a dataset of size \(n\):
\[ f(\beta) = \frac{1}{n} \sum_{i=1}^{n} f_i(\beta) \]
Where \(f_i(\beta)\) represents the contribution of the \(i\)-th data point to the total cost function. The gradient of the cost function with respect to \(\beta\) is:
\[ \nabla f(\beta) = \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(\beta) \]
Vanilla gradient descent would update the parameters as:
\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]
Instead of using the entire dataset to compute the gradient, SGD approximates the gradient by using only a single data point (or a small batch). The update rule for SGD is:
\[ \beta_{k+1} = \beta_k - \eta \nabla f_{i_k}(\beta_k) \]
Where:
- \(i_k\) is the index of a randomly selected data point at iteration \(k\).
- \(\nabla f_{i_k}(\beta_k)\) is the gradient of the cost function with respect to the parameter \(\beta_k\), evaluated only at the data point indexed by \(i_k\).
Implementation of GD
In 1D
Assume that we have a function \(f(x)=x^2-3x+\frac{13}{4}\) which we want to minimize. Meaning, we want to find \(x\) that minimizes the function.
Next, let’s implement the GD1
# Define the function
def f(x):
return x**2-3*x+(13/4)
# Define the gradient function
def grad_f(x):
return 2*x-3
# Take a random guess from the domain
= np.random.choice(x)
local_min print('Initial guess: ',local_min )
# Hyper-parameters
= 0.01
learning_rate = 200
training_epochs
= np.zeros((training_epochs,2))
model_params for i in range(training_epochs):
= grad_f(local_min)
grad =local_min - learning_rate*grad
local_min = [local_min, grad]
model_params[i,:]
print('Empirical/estimated local minimum:',local_min)
= plt.subplots(1,2, figsize=(8.5,3.8))
fig, axs for i in range(2):
'-')
axs[i].plot(model_params[:,i],'Iteration')
axs[i].set_xlabel(f'Final estimated Min: {local_min:.5f}')
axs[i].set_title(0].set_ylabel('Local Min')
axs[1].set_ylabel('Derivative')
axs[ plt.show()
Initial guess: 0.025025025025025016
Empirical/estimated local minimum: 1.4740582188953646
Alternatively, if we want to set a tolerance, this is how we set that
# Take a random guess from the domain
= np.random.choice(x)
local_min print('Initial guess:', local_min)
# Hyper-parameters
= 0.01
learning_rate = 1e-2 # Stop if gradient is smaller than this value
tolerance
# Lists to store optimization progress
= []
local_min_vals = []
grad_vals = 0 # Track the number of iterations
iteration
# Gradient Descent Loop (Runs until gradient is small enough)
while True:
= grad_f(local_min)
grad # Stop when gradient is smaller than tolerance
if abs(grad) < tolerance:
print(f"Stopping at iteration {iteration} (grad={grad:.5f})")
break
# Update local minimum
= local_min - learning_rate * grad
local_min # Store values
local_min_vals.append(local_min)
grad_vals.append(grad)+= 1 # Increment iteration count
iteration
print('Empirical/estimated local minimum:', local_min)
# Convert lists to numpy arrays
= np.array(local_min_vals)
local_min_vals = np.array(grad_vals)
grad_vals
# Plot Results
= plt.subplots(1, 2, figsize=(8.5, 3.8))
fig, axs
0].plot(local_min_vals, '.', label="Local Min Estimate")
axs[0].set_xlabel('Iteration')
axs[0].set_ylabel('Local Min')
axs[0].legend()
axs[
1].plot(grad_vals, '.', label="Gradient Value")
axs[1].set_xlabel('Iteration')
axs[1].set_ylabel('Derivative')
axs[1].legend()
axs[
f'Final estimated Min: {local_min:.5f}')
plt.suptitle( plt.show()
Initial guess: 2.701701701701702
Stopping at iteration 272 (grad=0.00987)
Empirical/estimated local minimum: 1.5049350239483508
Parametric variation
Let’s consider a different problem, \(f(x)=-e^{-\frac{x^2}{10}}(0.2x\cos x+\sin x)\). We want to find the \(x\) values and optimal hyper-parameters that minimizes \(f(x)\).
= np.linspace(-3*np.pi, 3*np.pi, 400)
x = -np.exp(-(x**2/10))*(0.2*x*np.cos(x)+np.sin(x))
y
= np.exp(-(x**2/10))*((0.04*x**2-1.2)*np.cos(x)+0.4*x*np.sin(x))
dy
plt.plot(x,y, x, dy)'f(x)','df'])
plt.legend([ plt.show()
Clearly, the global minimum is somewhere in between 0 to 2.5, maybe around 1.25. Now let’s apply GD with varying parameters
def f(x):
= -np.exp(-(x**2/10))*(0.2*x*np.cos(x)+np.sin(x))
f return f
def df(x):
= np.exp(-(x**2/10))*((0.04*x**2-1.2)*np.cos(x)+0.4*x*np.sin(x))
ddx_of_f return ddx_of_f
= 0.003
learning_rate = 1000
training_epochs = np.random.choice(x,1)
local_min print('The chosen initial point: ', local_min)
for i in range(training_epochs):
= df(local_min)
grad = local_min - learning_rate*grad
local_min
'--')
plt.plot(x,f(x), x, df(x),'ro')
plt.plot(local_min, df(local_min),'ro')
plt.plot(local_min, f(local_min),'x')
plt.xlabel('f(x)')
plt.ylabel('f(x)','df','f(x) min'])
plt.legend(['Iterated local min at x = %s'%local_min[0])
plt.title( plt.show()
The chosen initial point: [5.17299843]
Now let’s see how we can pick the right initial point
# Vary the starting point
= np.linspace(-5,5,50)
starting_points = np.zeros(len(starting_points))
stopping_points
= 1000
training_epochs = 0.01
learning_rate for idx, local_min in enumerate(starting_points):
for i in range(training_epochs):
= df(local_min)
grad = local_min - learning_rate * grad
local_min = local_min
stopping_points[idx] 's-')
plt.plot(starting_points, stopping_points, 'Starting points')
plt.xlabel('Stopping points')
plt.ylabel( plt.show()
Based on this, if we start with any point roughly between \((-1,3.5)\) we will end up with the global minimum. However, points outside this interval may take the iterative process to other local minima.
# varying learning rates
= np.linspace(1e-10, 1e-2, 30)
lrs = np.zeros(len(lrs))
stopping_points
= 1000
training_epochs
for idx, lr in enumerate(lrs):
# fixed initial point
= -0.03
local_min for i in range(training_epochs):
= df(local_min)
grad = local_min - lr*grad
local_min = local_min
stopping_points[idx] 's-')
plt.plot(lrs, stopping_points, 'learning rates')
plt.xlabel('Stopping points')
plt.ylabel( plt.show()
So this suggests that a learning rate between 0.003 to 0.005 is a good start.
= np.linspace(1e-10, 1e-2, 30)
lrs = np.round(np.linspace(10,1000,50))
training_epochs
= np.zeros((len(lrs), len(training_epochs)))
stopping_points
for lr_idx, lr in enumerate(lrs):
for tr_idx, tr_epoch in enumerate(training_epochs):
= -0.03
local_min
for i in range(int(tr_epoch)):
= df(local_min)
grad = local_min - lr * grad
local_min
= local_min
stopping_points[lr_idx, tr_idx] =[lrs[0], lrs[-1],\
plt.imshow(stopping_points, extent0], training_epochs[-1]], aspect='auto',\
training_epochs[=0, vmax=1.5)
vmin'learning rate')
plt.xlabel('training epochs')
plt.ylabel(
plt.colorbar() plt.show()
We know that the expected minimum at \(x \approx 1.17\) so the color corresponding to this is light. Therefore, if we choose a very small learning rate and/or a larger training epoch then we end up to the darker area which is not a good approximation.
In 2D
Let’s imagine a hypothetical scenario, Walmart Inc. wants to explore their business in a new twon. They want to have their store in location so that the total distance of the store from all the houses in the neighborhood is the smallest possible. If they have the data of \(n\) houses with corresponding coordinates of the houses, return the optimized location for the store.
The Euclidean distance between two points \((x_1,y_1)\) and \((x_2,y_2)\) is given by
\[d=\sqrt{(x_1-x_2)^2+(y_1-y_2)}\]
Assume that \(P=(x,y)\) is the coordinate of Walmart. So for a total of \(n\) such points the total distance \(D\) from the point \(P\) is a function of two variable \(x\) and \(y\) of the following form
\[D=f(x,y)=\sum_{i=1}^{n}\sqrt{(x-x_i)^2+(y-y_i)^2}\]
import plotly.offline as iplot
import plotly as py
=True)
py.offline.init_notebook_mode(connectedimport plotly.graph_objects as go
def f(x,y, c, d):
return np.sqrt((x-c)**2+(y-d)**2)
= np.linspace(-10,10, 400)
x = np.linspace(-10,10, 400)
y = np.meshgrid(x,y)
x, y
= 0,0
c, d
= f(x, y, c, d)
z
= go.Figure(data=[go.Surface(z=z, x=x, y=y)])
fig
fig.update_layout(='3D plot',
title=dict(
scene= 'x',
xaxis_title = 'y',
yaxis_title = 'z'
zaxis_title
)
)
fig.show()
all we need to do is to minimize the function \(f(x,y)\) and to do that we need to calculate the gradient vector which is the partial derivative of \(f(x,y)\) with respect to \(x\) and \(y\). So,
\[\begin{align*} \frac{\partial f}{\partial x}& = \sum_{i=1}^{n} \frac{x-x_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ \frac{\partial f}{\partial y}& = \sum_{i=1}^{n} \frac{y-y_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ & \\ \implies \nabla f(x,y) &= \begin{bmatrix}\frac{\partial f}{\partial x}\\\frac{\partial f}{\partial y}\end{bmatrix} \end{align*}\]
Then the algorithm
\[\begin{align*} \begin{bmatrix}x_{i+1}\\y_{i+1}\end{bmatrix}&= \begin{bmatrix}x_{i}\\y_{i}\end{bmatrix} - \eta_i \begin{bmatrix}\frac{\partial f}{\partial x}|_{x_i}\\\frac{\partial f}{\partial y}|_{y_i}\end{bmatrix} \end{align*}\]
where, the \(\eta\) is the step size or learning rate that scales the size of the move towards the opposite of the gradient direction.
Next, how do we control the numerical stability of the algorithm? We need to decrease the step size at each iteration which. This is called the rate of decay
. We also need a termination factor or tolerance
level that determines if we can stop the iteration. Sometimes, for a deep down convex function, the process oscillates back and forth around a range of values. In this case, applying a damping factor
increases the chance for a smooth convergence.
import numpy as np
import matplotlib.pyplot as plt
import math
class GDdistanceMin:
def __init__(self, step_size=1, decay_rate=0.99, tolerance=1e-7, damping_rate=0.75, points=[]):
self.step_size = step_size
self.decay_rate = decay_rate
self.tolerance = tolerance
self.damping_rate = damping_rate
self.points = points
self.x = sum(x for x, y in points) / len(points) # Initialization
self.y = sum(y for x, y in points) / len(points) # Initialization
self.x_updates = []
self.y_updates = []
def _partial_derivative_x(self, x, y):
= 0
grad_x for xi, yi in self.points:
if x != xi or y != yi:
+= (x - xi) / math.sqrt((x - xi)**2 + (y - yi)**2)
grad_x return grad_x
def _partial_derivative_y(self, x, y):
= 0
grad_y for xi, yi in self.points:
if x != xi or y != yi:
+= (y - yi) / math.sqrt((x - xi)**2 + (y - yi)**2)
grad_y return grad_y
def gradient_descent(self):
= 0, 0
dx, dy while self.step_size > self.tolerance:
= self._partial_derivative_x(self.x, self.y) + self.damping_rate * dx
dx = self._partial_derivative_y(self.x, self.y) + self.damping_rate * dy
dy self.x -= self.step_size * dx
self.x_updates.append(self.x)
self.y -= self.step_size * dy
self.y_updates.append(self.y)
self.step_size *= self.decay_rate
return (self.x, self.y)
def f(x, y, c, d):
return np.sqrt((x - c)**2 + (y - d)**2)
# Define points
= [(1, 3), (-2, 4), (3, 4), (-2, 1), (9, 2), (-5, 2)]
points = GDdistanceMin(points=points)
gd_min = gd_min.gradient_descent()
min_pt = gd_min.x_updates
xs = gd_min.y_updates
ys print("Minimum point:", min_pt)
= min_pt
c, d
# Create a grid for plotting
= np.linspace(-10, 10, 400)
x = np.linspace(-10, 10, 400)
y = np.meshgrid(x, y)
x_grid, y_grid = f(x_grid, y_grid, c, d)
z
# Calculate z values for the updates
= [f(xi, yi, c, d) for xi, yi in zip(xs, ys)]
zs
# Plotting
= plt.figure()
fig = fig.add_subplot(111, projection='3d')
ax ='viridis', alpha=0.6)
ax.plot_surface(x_grid, y_grid, z, cmap='red', s=50, label="Updates", marker='o')
ax.scatter(xs, ys, zs, color'X')
ax.set_xlabel('Y')
ax.set_ylabel('Z')
ax.set_zlabel('#f4f4f4')
plt.gca().set_facecolor('#f4f4f4')
plt.gcf().patch.set_facecolor( plt.show()
Minimum point: (0.9999997458022071, 2.9999999769909707)
Different approach2
import sympy as sym
sym.init_printing()from IPython.display import display, Math
from matplotlib_inline.backend_inline import set_matplotlib_formats
'svg')
set_matplotlib_formats(
def distance_function(x,y,c,d):
= np.meshgrid(x,y)
x,y = np.sqrt((x-c)**2+(y-d)**2)
z return z
= 0,0
c,d = np.linspace(-5,5, 400)
x = np.linspace(-5,5, 400)
y = distance_function(x,y,c,d)
z
=[x[0],x[-1], y[0],y[-1]],vmin=-5, vmax=5, origin='lower')
plt.imshow(z, extent plt.show()
So, the minimum is in the center of the deep color area. Let’s compute the derivatives using sympy
library
= sym.symbols('sx,sy')
sx, sy = sym.sqrt((sx-c)**2+(sy-d)**2)
sz
= sym.lambdify((sx,sy), sym.diff(sz, sx), 'sympy')
df_sx = sym.lambdify((sx,sy), sym.diff(sz, sy), 'sympy')
df_sy
f"\\frac{{\\partial f}}{{\\partial x}}={sym.latex(sym.diff(sz,sx))}\
display(Math( \\text{{ and }} \\frac{{\\partial f}}{{\\partial x}}\\mid_{{(1,1)}} = {df_sx(1,1).evalf()}"))
f"\\frac{{\\partial f}}{{\\partial y}}={sym.latex(sym.diff(sz,sy))} \
display(Math( \\text{{ and }} \\frac{{\\partial f}}{{\\partial y}}\\mid_{{(1,1)}} = {df_sy(1,1).evalf()}"))
\(\displaystyle \frac{\partial f}{\partial x}=\frac{sx}{\sqrt{sx^{2} + sy^{2}}} \text{ and } \frac{\partial f}{\partial x}\mid_{(1,1)} = 0.707106781186548\)
\(\displaystyle \frac{\partial f}{\partial y}=\frac{sy}{\sqrt{sx^{2} + sy^{2}}} \text{ and } \frac{\partial f}{\partial y}\mid_{(1,1)} = 0.707106781186548\)
Now let’s implement the GD
# Let's pick a random starting point(uniformly distributed between -2 and 2)
= np.random.rand(2)*4-2
local_min = local_min[:] # make a copy
start_point
= 0.01
learning_rate = 1000
training_epochs
= np.zeros((training_epochs,2))
trajectory
for i in range(training_epochs):
= np.array([
grad 0],local_min[1]).evalf(),
df_sx(local_min[0],local_min[1]).evalf()
df_sy(local_min[
])= local_min - learning_rate*grad
local_min = local_min
trajectory[i,:]
print('Starting point ',start_point)
print('Local min found ',local_min)
=[x[0],x[-1], y[0],y[-1]],vmin=-5, vmax=5, origin='lower')
plt.imshow(z, extent0], start_point[1], 'bs')
plt.plot(start_point[0],local_min[1],'rs')
plt.plot(local_min[0],trajectory[:,1], 'b', linewidth=3)
plt.plot(trajectory[:,'random start', 'local min'])
plt.legend([
plt.colorbar() plt.show()
Starting point [1.22754924 0.7182846 ]
Local min found [0.00194648371310253 0.00113895983415082]
Implementation of SGD
Personally, I think this webpage is one of the best resources for learning how to implement SGD.
Share on
You may also like
Footnotes
The implementation is taken from the the Udemy course A Deep Understanding of Deep Learning↩︎
The implementation is taken from the the Udemy course A Deep Understanding of Deep Learning↩︎
Citation
@online{islam2024,
author = {Islam, Rafiq},
title = {Implementation of {Gradient} {Descent} {(GD)} and
{Stochastic} {Gradient} {Descent} {(SGD)}},
date = {2024-09-19},
url = {https://mrislambd.github.io/posts/sgd/},
langid = {en}
}