Implementation of Gradient Descent (GD) and Stochastic Gradient Descent (SGD)

Data Science

Machine Learning

Stochastic Gradient Descent

Optimization

Author

Rafiq Islam

Published

September 19, 2024

Gradient Descent

Collected from gbhat.com
GIF Credit: gbhat.com

Gradient Descent is an optimization algorithm used to minimize the cost function. The cost function $f(\beta)$ measures how well a model with parameters $\beta$ fits the data. The goal is to find the values of $\beta$ that minimize this cost function. In terms of the iterative method, we want to find $\beta_{k+1}$ and $\beta_k$ such that $f(\beta_{k+1})<f(\beta_k)$.

For a small change in $\beta$, we can approximate $f(\beta)$ using Taylor series expansion
\[f(\beta_{k+1})=f(\beta_k +\Delta\beta_k)\approx f(\beta_k)+\nabla f(\beta_k)^T \Delta \beta_k+\text{higher-order terms}\]

The update rule for vanilla gradient descent is given by:

\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]

Where:

$\beta_k$ is the current estimate of the parameters at iteration $k$.
$\eta$ is the learning rate, a small positive scalar that controls the step size.
$\nabla f(\beta_k)$ is the gradient of the cost function $f$ with respect to $\beta$ at the current point $\beta_k$.

The update rule comes from the idea of moving the parameter vector $\beta$ in the direction that decreases the cost function the most.

Gradient: The gradient $\nabla f(\beta_k)$ represents the direction and magnitude of the steepest ascent of the function $f$ at the point $\beta_k$. Since we want to minimize the function, we move in the opposite direction of the gradient.
Step Size: The term $\eta \nabla f(\beta_k)$ scales the gradient by the learning rate $\eta$, determining how far we move in that direction. If $\eta$ is too large, the algorithm may overshoot the minimum; if it’s too small, the convergence will be slow.
Iterative Update: Starting from an initial guess $\beta_0$, we repeatedly apply the update rule until the algorithm converges, meaning that the changes in $\beta_k$ become negligible, and $\beta_k$ is close to the optimal value $\beta^*$.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is a variation of the vanilla gradient descent. Instead of computing the gradient using the entire dataset, SGD updates the parameters using only a single data point or a small batch of data points at each iteration. The later one we call it mini batch SGD.

Suppose our cost function is defined as the average over a dataset of size $n$:

\[ f(\beta) = \frac{1}{n} \sum_{i=1}^{n} f_i(\beta) \]

Where $f_i(\beta)$ represents the contribution of the $i$-th data point to the total cost function. The gradient of the cost function with respect to $\beta$ is:

\[ \nabla f(\beta) = \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(\beta) \]

Vanilla gradient descent would update the parameters as:

\[ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) \]

Instead of using the entire dataset to compute the gradient, SGD approximates the gradient by using only a single data point (or a small batch). The update rule for SGD is:

\[ \beta_{k+1} = \beta_k - \eta \nabla f_{i_k}(\beta_k) \]

Where:

$i_k$ is the index of a randomly selected data point at iteration $k$.
$\nabla f_{i_k}(\beta_k)$ is the gradient of the cost function with respect to the parameter $\beta_k$, evaluated only at the data point indexed by $i_k$.

Implementation of GD

In 1D

Assume that we have a function $f(x)=x^2-3x+\frac{13}{4}$ which we want to minimize. Meaning, we want to find $x$ that minimizes the function.

Next, let’s implement the GD¹

# Define the function
def f(x):
    return x**2-3*x+(13/4)
# Define the gradient function 
def grad_f(x):
    return 2*x-3

# Take a random guess from the domain
local_min = np.random.choice(x)
print('Initial guess: ',local_min )

# Hyper-parameters 
learning_rate = 0.01
training_epochs = 200

model_params = np.zeros((training_epochs,2))
for i in range(training_epochs):
    grad = grad_f(local_min)
    local_min =local_min - learning_rate*grad 
    model_params[i,:] = [local_min, grad]

print('Empirical/estimated local minimum:',local_min)

fig, axs = plt.subplots(1,2, figsize=(8.5,3.8))
for i in range(2):
    axs[i].plot(model_params[:,i],'-')
    axs[i].set_xlabel('Iteration')
    axs[i].set_title(f'Final estimated Min: {local_min:.5f}')
axs[0].set_ylabel('Local Min')
axs[1].set_ylabel('Derivative')
plt.show()

Initial guess:  0.025025025025025016
Empirical/estimated local minimum: 1.4740582188953646

Alternatively, if we want to set a tolerance, this is how we set that

# Take a random guess from the domain
local_min = np.random.choice(x)
print('Initial guess:', local_min)

# Hyper-parameters 
learning_rate = 0.01
tolerance = 1e-2  # Stop if gradient is smaller than this value

# Lists to store optimization progress
local_min_vals = []
grad_vals = []
iteration = 0  # Track the number of iterations

# Gradient Descent Loop (Runs until gradient is small enough)
while True:
    grad = grad_f(local_min)
    # Stop when gradient is smaller than tolerance
    if abs(grad) < tolerance:
        print(f"Stopping at iteration {iteration} (grad={grad:.5f})")
        break
    # Update local minimum
    local_min = local_min - learning_rate * grad  
    # Store values
    local_min_vals.append(local_min)
    grad_vals.append(grad)
    iteration += 1  # Increment iteration count

print('Empirical/estimated local minimum:', local_min)

# Convert lists to numpy arrays
local_min_vals = np.array(local_min_vals)
grad_vals = np.array(grad_vals)

# Plot Results
fig, axs = plt.subplots(1, 2, figsize=(8.5, 3.8))

axs[0].plot(local_min_vals, '.', label="Local Min Estimate")
axs[0].set_xlabel('Iteration')
axs[0].set_ylabel('Local Min')
axs[0].legend()

axs[1].plot(grad_vals, '.', label="Gradient Value")
axs[1].set_xlabel('Iteration')
axs[1].set_ylabel('Derivative')
axs[1].legend()

plt.suptitle(f'Final estimated Min: {local_min:.5f}')
plt.show()

Initial guess: 2.701701701701702
Stopping at iteration 272 (grad=0.00987)
Empirical/estimated local minimum: 1.5049350239483508

Parametric variation

Let’s consider a different problem, $f(x)=-e^{-\frac{x^2}{10}}(0.2x\cos x+\sin x)$. We want to find the $x$ values and optimal hyper-parameters that minimizes $f(x)$.

x = np.linspace(-3*np.pi, 3*np.pi, 400)
y = -np.exp(-(x**2/10))*(0.2*x*np.cos(x)+np.sin(x))

dy = np.exp(-(x**2/10))*((0.04*x**2-1.2)*np.cos(x)+0.4*x*np.sin(x))
plt.plot(x,y, x, dy)
plt.legend(['f(x)','df'])
plt.show()

Clearly, the global minimum is somewhere in between 0 to 2.5, maybe around 1.25. Now let’s apply GD with varying parameters

def f(x):
    f = -np.exp(-(x**2/10))*(0.2*x*np.cos(x)+np.sin(x))
    return f
def df(x):
    ddx_of_f = np.exp(-(x**2/10))*((0.04*x**2-1.2)*np.cos(x)+0.4*x*np.sin(x))
    return ddx_of_f

learning_rate = 0.003
training_epochs = 1000
local_min = np.random.choice(x,1)
print('The chosen initial point: ', local_min)

for i in range(training_epochs):
    grad = df(local_min)
    local_min = local_min - learning_rate*grad 

plt.plot(x,f(x), x, df(x),'--')
plt.plot(local_min, df(local_min),'ro')
plt.plot(local_min, f(local_min),'ro')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend(['f(x)','df','f(x) min'])
plt.title('Iterated local min at x = %s'%local_min[0])
plt.show()

The chosen initial point:  [5.17299843]

Now let’s see how we can pick the right initial point

# Vary the starting point 
starting_points = np.linspace(-5,5,50)
stopping_points = np.zeros(len(starting_points))

training_epochs = 1000
learning_rate = 0.01
for idx, local_min in enumerate(starting_points):
    for i in range(training_epochs):
        grad = df(local_min)
        local_min = local_min - learning_rate * grad 
    stopping_points[idx] = local_min
plt.plot(starting_points, stopping_points, 's-')
plt.xlabel('Starting points')
plt.ylabel('Stopping points')
plt.show()

Based on this, if we start with any point roughly between $(-1,3.5)$ we will end up with the global minimum. However, points outside this interval may take the iterative process to other local minima.

# varying learning rates 
lrs = np.linspace(1e-10, 1e-2, 30)
stopping_points = np.zeros(len(lrs))

training_epochs = 1000

for idx, lr  in enumerate(lrs):
    # fixed initial point 
    local_min = -0.03
    for i in range(training_epochs):
        grad = df(local_min)
        local_min = local_min - lr*grad 
    stopping_points[idx] = local_min
plt.plot(lrs, stopping_points, 's-')
plt.xlabel('learning rates')
plt.ylabel('Stopping points')
plt.show()

So this suggests that a learning rate between 0.003 to 0.005 is a good start.

lrs = np.linspace(1e-10, 1e-2, 30)
training_epochs = np.round(np.linspace(10,1000,50))

stopping_points = np.zeros((len(lrs), len(training_epochs)))

for lr_idx, lr in enumerate(lrs):
    for tr_idx, tr_epoch in enumerate(training_epochs):
        local_min = -0.03
        
        for i in range(int(tr_epoch)):
            grad = df(local_min)
            local_min = local_min - lr * grad 

        stopping_points[lr_idx, tr_idx] = local_min
plt.imshow(stopping_points, extent=[lrs[0], lrs[-1],\
     training_epochs[0], training_epochs[-1]], aspect='auto',\
        vmin=0, vmax=1.5)
plt.xlabel('learning rate')
plt.ylabel('training epochs')
plt.colorbar()
plt.show()

We know that the expected minimum at $x \approx 1.17$ so the color corresponding to this is light. Therefore, if we choose a very small learning rate and/or a larger training epoch then we end up to the darker area which is not a good approximation.

plt.plot(lrs, stopping_points)
plt.xlabel('learning rate')
plt.ylabel('Iterated solution')
plt.title('Each line is a training epoch N')
plt.show

In 2D

Let’s imagine a hypothetical scenario, Walmart Inc. wants to explore their business in a new twon. They want to have their store in location so that the total distance of the store from all the houses in the neighborhood is the smallest possible. If they have the data of $n$ houses with corresponding coordinates of the houses, return the optimized location for the store.

The Euclidean distance between two points $(x_1,y_1)$ and $(x_2,y_2)$ is given by
\[d=\sqrt{(x_1-x_2)^2+(y_1-y_2)}\]

Assume that $P=(x,y)$ is the coordinate of Walmart. So for a total of $n$ such points the total distance $D$ from the point $P$ is a function of two variable $x$ and $y$ of the following form

\[D=f(x,y)=\sum_{i=1}^{n}\sqrt{(x-x_i)^2+(y-y_i)^2}\]

import plotly.offline as iplot
import plotly as py
py.offline.init_notebook_mode(connected=True)
import plotly.graph_objects as go


def f(x,y, c, d):
    return np.sqrt((x-c)**2+(y-d)**2)

x = np.linspace(-10,10, 400)
y = np.linspace(-10,10, 400)
x, y = np.meshgrid(x,y)

c, d = 0,0

z = f(x, y, c, d)

fig = go.Figure(data=[go.Surface(z=z, x=x, y=y)])
fig.update_layout(
    title='3D plot',
    scene=dict(
        xaxis_title = 'x',
        yaxis_title = 'y',
        zaxis_title = 'z'
    )
)

fig.show()

all we need to do is to minimize the function $f(x,y)$ and to do that we need to calculate the gradient vector which is the partial derivative of $f(x,y)$ with respect to $x$ and $y$. So,

\[\begin{align*} \frac{\partial f}{\partial x}& = \sum_{i=1}^{n} \frac{x-x_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ \frac{\partial f}{\partial y}& = \sum_{i=1}^{n} \frac{y-y_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ & \\ \implies \nabla f(x,y) &= \begin{bmatrix}\frac{\partial f}{\partial x}\\\frac{\partial f}{\partial y}\end{bmatrix} \end{align*}\]

Then the algorithm

\[\begin{align*} \begin{bmatrix}x_{i+1}\\y_{i+1}\end{bmatrix}&= \begin{bmatrix}x_{i}\\y_{i}\end{bmatrix} - \eta_i \begin{bmatrix}\frac{\partial f}{\partial x}|_{x_i}\\\frac{\partial f}{\partial y}|_{y_i}\end{bmatrix} \end{align*}\]

where, the $\eta$ is the step size or learning rate that scales the size of the move towards the opposite of the gradient direction.

Next, how do we control the numerical stability of the algorithm? We need to decrease the step size at each iteration which. This is called the rate of decay. We also need a termination factor or tolerance level that determines if we can stop the iteration. Sometimes, for a deep down convex function, the process oscillates back and forth around a range of values. In this case, applying a damping factor increases the chance for a smooth convergence.

import numpy as np
import matplotlib.pyplot as plt
import math

class GDdistanceMin:
    def __init__(self, step_size=1, decay_rate=0.99, tolerance=1e-7, damping_rate=0.75, points=[]):
        self.step_size = step_size
        self.decay_rate = decay_rate
        self.tolerance = tolerance
        self.damping_rate = damping_rate
        self.points = points
        self.x = sum(x for x, y in points) / len(points)  # Initialization
        self.y = sum(y for x, y in points) / len(points)  # Initialization
        self.x_updates = []
        self.y_updates = []

    def _partial_derivative_x(self, x, y):
        grad_x = 0
        for xi, yi in self.points:
            if x != xi or y != yi:
                grad_x += (x - xi) / math.sqrt((x - xi)**2 + (y - yi)**2)
        return grad_x

    def _partial_derivative_y(self, x, y):
        grad_y = 0
        for xi, yi in self.points:
            if x != xi or y != yi:
                grad_y += (y - yi) / math.sqrt((x - xi)**2 + (y - yi)**2)
        return grad_y

    def gradient_descent(self):
        dx, dy = 0, 0
        while self.step_size > self.tolerance:
            dx = self._partial_derivative_x(self.x, self.y) + self.damping_rate * dx 
            dy = self._partial_derivative_y(self.x, self.y) + self.damping_rate * dy 
            self.x -= self.step_size * dx 
            self.x_updates.append(self.x)
            self.y -= self.step_size * dy 
            self.y_updates.append(self.y)
            self.step_size *= self.decay_rate
        return (self.x, self.y)

def f(x, y, c, d):
    return np.sqrt((x - c)**2 + (y - d)**2)

# Define points
points = [(1, 3), (-2, 4), (3, 4), (-2, 1), (9, 2), (-5, 2)]
gd_min = GDdistanceMin(points=points)
min_pt = gd_min.gradient_descent()
xs = gd_min.x_updates
ys = gd_min.y_updates
print("Minimum point:", min_pt)

c, d = min_pt

# Create a grid for plotting
x = np.linspace(-10, 10, 400)
y = np.linspace(-10, 10, 400)
x_grid, y_grid = np.meshgrid(x, y)
z = f(x_grid, y_grid, c, d)

# Calculate z values for the updates
zs = [f(xi, yi, c, d) for xi, yi in zip(xs, ys)]

# Plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x_grid, y_grid, z, cmap='viridis', alpha=0.6)
ax.scatter(xs, ys, zs, color='red', s=50, label="Updates", marker='o')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.gca().set_facecolor('#f4f4f4') 
plt.gcf().patch.set_facecolor('#f4f4f4')
plt.show()

Minimum point: (0.9999997458022071, 2.9999999769909707)

Different approach²

import sympy as sym 
sym.init_printing()
from IPython.display import display, Math
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('svg') 


def distance_function(x,y,c,d):
    x,y = np.meshgrid(x,y)
    z = np.sqrt((x-c)**2+(y-d)**2)
    return z

c,d = 0,0
x = np.linspace(-5,5, 400)
y = np.linspace(-5,5, 400)
z = distance_function(x,y,c,d)

plt.imshow(z, extent=[x[0],x[-1], y[0],y[-1]],vmin=-5, vmax=5, origin='lower')
plt.show()

So, the minimum is in the center of the deep color area. Let’s compute the derivatives using sympy library

sx, sy = sym.symbols('sx,sy')
sz = sym.sqrt((sx-c)**2+(sy-d)**2)

df_sx = sym.lambdify((sx,sy), sym.diff(sz, sx), 'sympy')
df_sy = sym.lambdify((sx,sy), sym.diff(sz, sy), 'sympy')

display(Math(f"\\frac{{\\partial f}}{{\\partial x}}={sym.latex(sym.diff(sz,sx))}\
     \\text{{ and }}  \\frac{{\\partial f}}{{\\partial x}}\\mid_{{(1,1)}} = {df_sx(1,1).evalf()}"))
display(Math(f"\\frac{{\\partial f}}{{\\partial y}}={sym.latex(sym.diff(sz,sy))} \
     \\text{{ and }}  \\frac{{\\partial f}}{{\\partial y}}\\mid_{{(1,1)}} = {df_sy(1,1).evalf()}"))

$\displaystyle \frac{\partial f}{\partial x}=\frac{sx}{\sqrt{sx^{2} + sy^{2}}} \text{ and } \frac{\partial f}{\partial x}\mid_{(1,1)} = 0.707106781186548$

$\displaystyle \frac{\partial f}{\partial y}=\frac{sy}{\sqrt{sx^{2} + sy^{2}}} \text{ and } \frac{\partial f}{\partial y}\mid_{(1,1)} = 0.707106781186548$

Now let’s implement the GD

# Let's pick a random starting point(uniformly distributed between -2 and 2)
local_min = np.random.rand(2)*4-2
start_point = local_min[:] # make a copy

learning_rate = 0.01
training_epochs = 1000

trajectory = np.zeros((training_epochs,2))

for i in range(training_epochs):
    grad = np.array([
        df_sx(local_min[0],local_min[1]).evalf(),
        df_sy(local_min[0],local_min[1]).evalf()
    ])
    local_min = local_min - learning_rate*grad
    trajectory[i,:] = local_min

print('Starting point ',start_point)
print('Local min found ',local_min)

plt.imshow(z, extent=[x[0],x[-1], y[0],y[-1]],vmin=-5, vmax=5, origin='lower')
plt.plot(start_point[0], start_point[1], 'bs')
plt.plot(local_min[0],local_min[1],'rs')
plt.plot(trajectory[:,0],trajectory[:,1], 'b', linewidth=3)
plt.legend(['random start', 'local min'])
plt.colorbar()
plt.show()

Starting point  [1.22754924 0.7182846 ]
Local min found  [0.00194648371310253 0.00113895983415082]

Implementation of SGD

Personally, I think this webpage is one of the best resources for learning how to implement SGD.

Share on

You may also like

Footnotes

The implementation is taken from the the Udemy course A Deep Understanding of Deep Learning ↩︎
The implementation is taken from the the Udemy course A Deep Understanding of Deep Learning ↩︎

Citation

BibTeX citation:

@online{islam2024,
  author = {Islam, Rafiq},
  title = {Implementation of {Gradient} {Descent} {(GD)} and
    {Stochastic} {Gradient} {Descent} {(SGD)}},
  date = {2024-09-19},
  url = {https://mrislambd.github.io/posts/sgd/},
  langid = {en}
}

For attribution, please cite this work as:

Islam, Rafiq. 2024. “Implementation of Gradient Descent (GD) and Stochastic Gradient Descent (SGD).” September 19, 2024. https://mrislambd.github.io/posts/sgd/.

--- title: 'Implementation of Gradient Descent (GD) and Stochastic Gradient Descent (SGD)' author: "Rafiq Islam" date: 2024-09-19 categories: [Data Science, Machine Learning, Stochastic Gradient Descent, Optimization] citation: true search: true lightbox: true image: sgd.png listing: contents: "/../../posts" max-items: 3 type: grid categories: false date-format: full fields: [image, date, title, author, reading-time] --- ## Gradient Descent <p style="text-align: justify"> <img align="right" height=350 width=450 src="/_assets/images/gradient_descent.gif" alt="Collected from gbhat.com" style="margin-left: 20px; margin-bottom: 20px"/> <br> GIF Credit: <a href="https://gbhat.com/machine_learning/gradient_descent_anim.html" target="_blank" style="text-decoration:none">gbhat.com</a> <br> Gradient Descent is an optimization algorithm used to minimize the cost function. The cost function $f(\beta)$ measures how well a model with parameters $\beta$ fits the data. The goal is to find the values of $\beta$ that minimize this cost function. In terms of the iterative method, we want to find $\beta_{k+1}$ and $\beta_k$ such that $f(\beta_{k+1})<f(\beta_k)$. <br> <br> For a small change in $\beta$, we can approximate $f(\beta)$ using Taylor series expansion $$f(\beta_{k+1})=f(\beta_k +\Delta\beta_k)\approx f(\beta_k)+\nabla f(\beta_k)^T \Delta \beta_k+\text{higher-order terms}$$ </p> The update rule for vanilla gradient descent is given by: $$ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) $$ Where: - $\beta_k$ is the current estimate of the parameters at iteration $k$. - $\eta$ is the learning rate, a small positive scalar that controls the step size. - $\nabla f(\beta_k)$ is the gradient of the cost function $f$ with respect to $\beta$ at the current point $\beta_k$. <p style="text-align: justify"> The update rule comes from the idea of moving the parameter vector $\beta$ in the direction that decreases the cost function the most. 1. **Gradient**: The gradient $\nabla f(\beta_k)$ represents the direction and magnitude of the steepest ascent of the function $f$ at the point $\beta_k$. Since we want to minimize the function, we move in the opposite direction of the gradient. 2. **Step Size**: The term $\eta \nabla f(\beta_k)$ scales the gradient by the learning rate $\eta$, determining how far we move in that direction. If $\eta$ is too large, the algorithm may overshoot the minimum; if it's too small, the convergence will be slow. 3. **Iterative Update**: Starting from an initial guess $\beta_0$, we repeatedly apply the update rule until the algorithm converges, meaning that the changes in $\beta_k$ become negligible, and $\beta_k$ is close to the optimal value $\beta^*$. ## Stochastic Gradient Descent (SGD) <p style="text-align: justify"> Stochastic Gradient Descent is a variation of the vanilla gradient descent. Instead of computing the gradient using the entire dataset, SGD updates the parameters using only a single data point or a small batch of data points at each iteration. The later one we call it mini batch SGD. </p> Suppose our cost function is defined as the average over a dataset of size $n$: $$ f(\beta) = \frac{1}{n} \sum_{i=1}^{n} f_i(\beta) $$ Where $f_i(\beta)$ represents the contribution of the $i$-th data point to the total cost function. The gradient of the cost function with respect to $\beta$ is: $$ \nabla f(\beta) = \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(\beta) $$ Vanilla gradient descent would update the parameters as: $$ \beta_{k+1} = \beta_k - \eta \nabla f(\beta_k) $$ Instead of using the entire dataset to compute the gradient, SGD approximates the gradient by using only a single data point (or a small batch). The update rule for SGD is: $$ \beta_{k+1} = \beta_k - \eta \nabla f_{i_k}(\beta_k) $$ Where: - $i_k$ is the index of a randomly selected data point at iteration $k$. - $\nabla f_{i_k}(\beta_k)$ is the gradient of the cost function with respect to the parameter $\beta_k$, evaluated only at the data point indexed by $i_k$. ## Implementation of GD ### In 1D Assume that we have a function $f(x)=x^2-3x+\frac{13}{4}$ which we want to minimize. Meaning, we want to find $x$ that minimizes the function. ```{python} #| echo: false import numpy as np import matplotlib.pyplot as plt from mywebstyle import plot_style # Apply custom plot style plot_style('#f4f4f4') # Generate data x = np.linspace(-2, 5, 1000) y = x**2 - 3*x + (13/4) # Create plot plt.plot(x, y, label='$y = x^2 - 3x + 13/4$') plt.plot(x,2*x-3, label="y'=2x-3") plt.ylim(-4,14) # Add X and Y axes plt.axhline(0, color='black', linewidth=1) plt.axvline(0, color='black', linewidth=1) # Show grid and legend plt.grid(True, linestyle='--', alpha=0.9) plt.legend() # Show plot plt.show() ``` Next, let's implement the GD[^1] ```{python} # Define the function def f(x): return x**2-3*x+(13/4) # Define the gradient function def grad_f(x): return 2*x-3 # Take a random guess from the domain local_min = np.random.choice(x) print('Initial guess: ',local_min ) # Hyper-parameters learning_rate = 0.01 training_epochs = 200 model_params = np.zeros((training_epochs,2)) for i in range(training_epochs): grad = grad_f(local_min) local_min =local_min - learning_rate*grad model_params[i,:] = [local_min, grad] print('Empirical/estimated local minimum:',local_min) fig, axs = plt.subplots(1,2, figsize=(8.5,3.8)) for i in range(2): axs[i].plot(model_params[:,i],'-') axs[i].set_xlabel('Iteration') axs[i].set_title(f'Final estimated Min: {local_min:.5f}') axs[0].set_ylabel('Local Min') axs[1].set_ylabel('Derivative') plt.show() ``` Alternatively, if we want to set a tolerance, this is how we set that ```{python} # Take a random guess from the domain local_min = np.random.choice(x) print('Initial guess:', local_min) # Hyper-parameters learning_rate = 0.01 tolerance = 1e-2 # Stop if gradient is smaller than this value # Lists to store optimization progress local_min_vals = [] grad_vals = [] iteration = 0 # Track the number of iterations # Gradient Descent Loop (Runs until gradient is small enough) while True: grad = grad_f(local_min) # Stop when gradient is smaller than tolerance if abs(grad) < tolerance: print(f"Stopping at iteration {iteration} (grad={grad:.5f})") break # Update local minimum local_min = local_min - learning_rate * grad # Store values local_min_vals.append(local_min) grad_vals.append(grad) iteration += 1 # Increment iteration count print('Empirical/estimated local minimum:', local_min) # Convert lists to numpy arrays local_min_vals = np.array(local_min_vals) grad_vals = np.array(grad_vals) # Plot Results fig, axs = plt.subplots(1, 2, figsize=(8.5, 3.8)) axs[0].plot(local_min_vals, '.', label="Local Min Estimate") axs[0].set_xlabel('Iteration') axs[0].set_ylabel('Local Min') axs[0].legend() axs[1].plot(grad_vals, '.', label="Gradient Value") axs[1].set_xlabel('Iteration') axs[1].set_ylabel('Derivative') axs[1].legend() plt.suptitle(f'Final estimated Min: {local_min:.5f}') plt.show() ``` #### Parametric variation Let's consider a different problem, $f(x)=-e^{-\frac{x^2}{10}}(0.2x\cos x+\sin x)$. We want to find the $x$ values and optimal hyper-parameters that minimizes $f(x)$. ```{python} x = np.linspace(-3*np.pi, 3*np.pi, 400) y = -np.exp(-(x**2/10))*(0.2*x*np.cos(x)+np.sin(x)) dy = np.exp(-(x**2/10))*((0.04*x**2-1.2)*np.cos(x)+0.4*x*np.sin(x)) plt.plot(x,y, x, dy) plt.legend(['f(x)','df']) plt.show() ``` Clearly, the global minimum is somewhere in between 0 to 2.5, maybe around 1.25. Now let's apply GD with varying parameters ```{python} def f(x): f = -np.exp(-(x**2/10))*(0.2*x*np.cos(x)+np.sin(x)) return f def df(x): ddx_of_f = np.exp(-(x**2/10))*((0.04*x**2-1.2)*np.cos(x)+0.4*x*np.sin(x)) return ddx_of_f learning_rate = 0.003 training_epochs = 1000 local_min = np.random.choice(x,1) print('The chosen initial point: ', local_min) for i in range(training_epochs): grad = df(local_min) local_min = local_min - learning_rate*grad plt.plot(x,f(x), x, df(x),'--') plt.plot(local_min, df(local_min),'ro') plt.plot(local_min, f(local_min),'ro') plt.xlabel('x') plt.ylabel('f(x)') plt.legend(['f(x)','df','f(x) min']) plt.title('Iterated local min at x = %s'%local_min[0]) plt.show() ``` Now let's see how we can pick the right initial point ```{python} # Vary the starting point starting_points = np.linspace(-5,5,50) stopping_points = np.zeros(len(starting_points)) training_epochs = 1000 learning_rate = 0.01 for idx, local_min in enumerate(starting_points): for i in range(training_epochs): grad = df(local_min) local_min = local_min - learning_rate * grad stopping_points[idx] = local_min plt.plot(starting_points, stopping_points, 's-') plt.xlabel('Starting points') plt.ylabel('Stopping points') plt.show() ``` Based on this, if we start with any point roughly between $(-1,3.5)$ we will end up with the global minimum. However, points outside this interval may take the iterative process to other local minima. ```{python} # varying learning rates lrs = np.linspace(1e-10, 1e-2, 30) stopping_points = np.zeros(len(lrs)) training_epochs = 1000 for idx, lr in enumerate(lrs): # fixed initial point local_min = -0.03 for i in range(training_epochs): grad = df(local_min) local_min = local_min - lr*grad stopping_points[idx] = local_min plt.plot(lrs, stopping_points, 's-') plt.xlabel('learning rates') plt.ylabel('Stopping points') plt.show() ``` So this suggests that a learning rate between 0.003 to 0.005 is a good start. ```{python} lrs = np.linspace(1e-10, 1e-2, 30) training_epochs = np.round(np.linspace(10,1000,50)) stopping_points = np.zeros((len(lrs), len(training_epochs))) for lr_idx, lr in enumerate(lrs): for tr_idx, tr_epoch in enumerate(training_epochs): local_min = -0.03 for i in range(int(tr_epoch)): grad = df(local_min) local_min = local_min - lr * grad stopping_points[lr_idx, tr_idx] = local_min plt.imshow(stopping_points, extent=[lrs[0], lrs[-1],\ training_epochs[0], training_epochs[-1]], aspect='auto',\ vmin=0, vmax=1.5) plt.xlabel('learning rate') plt.ylabel('training epochs') plt.colorbar() plt.show() ``` We know that the expected minimum at $x \approx 1.17$ so the color corresponding to this is light. Therefore, if we choose a very small learning rate and/or a larger training epoch then we end up to the darker area which is not a good approximation. ```{python} plt.plot(lrs, stopping_points) plt.xlabel('learning rate') plt.ylabel('Iterated solution') plt.title('Each line is a training epoch N') plt.show ``` ### In 2D <p style="text-align:justify"> Let's imagine a hypothetical scenario, Walmart Inc. wants to explore their business in a new twon. They want to have their store in location so that the total distance of the store from all the houses in the neighborhood is the smallest possible. If they have the data of $n$ houses with corresponding coordinates of the houses, return the optimized location for the store. </p> The Euclidean distance between two points $(x_1,y_1)$ and $(x_2,y_2)$ is given by $$d=\sqrt{(x_1-x_2)^2+(y_1-y_2)}$$ Assume that $P=(x,y)$ is the coordinate of Walmart. So for a total of $n$ such points the total distance $D$ from the point $P$ is a function of two variable $x$ and $y$ of the following form $$D=f(x,y)=\sum_{i=1}^{n}\sqrt{(x-x_i)^2+(y-y_i)^2}$$ ```{python} import plotly.offline as iplot import plotly as py py.offline.init_notebook_mode(connected=True) import plotly.graph_objects as go def f(x,y, c, d): return np.sqrt((x-c)**2+(y-d)**2) x = np.linspace(-10,10, 400) y = np.linspace(-10,10, 400) x, y = np.meshgrid(x,y) c, d = 0,0 z = f(x, y, c, d) fig = go.Figure(data=[go.Surface(z=z, x=x, y=y)]) fig.update_layout( title='3D plot', scene=dict( xaxis_title = 'x', yaxis_title = 'y', zaxis_title = 'z' ) ) fig.show() ``` all we need to do is to minimize the function $f(x,y)$ and to do that we need to calculate the gradient vector which is the partial derivative of $f(x,y)$ with respect to $x$ and $y$. So, \begin{align*} \frac{\partial f}{\partial x}& = \sum_{i=1}^{n} \frac{x-x_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ \frac{\partial f}{\partial y}& = \sum_{i=1}^{n} \frac{y-y_i}{\sqrt{(x-x_i)^2+(y-y_i)^2}}\\ & \\ \implies \nabla f(x,y) &= \begin{bmatrix}\frac{\partial f}{\partial x}\\\frac{\partial f}{\partial y}\end{bmatrix} \end{align*} Then the algorithm \begin{align*} \begin{bmatrix}x_{i+1}\\y_{i+1}\end{bmatrix}&= \begin{bmatrix}x_{i}\\y_{i}\end{bmatrix} - \eta_i \begin{bmatrix}\frac{\partial f}{\partial x}|_{x_i}\\\frac{\partial f}{\partial y}|_{y_i}\end{bmatrix} \end{align*} where, the $\eta$ is the step size or learning rate that scales the size of the move towards the opposite of the gradient direction. Next, how do we control the numerical stability of the algorithm? We need to decrease the step size at each iteration which. This is called the `rate of decay`. We also need a termination factor or `tolerance` level that determines if we can stop the iteration. Sometimes, for a deep down convex function, the process oscillates back and forth around a range of values. In this case, applying a `damping factor` increases the chance for a smooth convergence. ```{python} #| code-fold: false import numpy as np import matplotlib.pyplot as plt import math class GDdistanceMin: def __init__(self, step_size=1, decay_rate=0.99, tolerance=1e-7, damping_rate=0.75, points=[]): self.step_size = step_size self.decay_rate = decay_rate self.tolerance = tolerance self.damping_rate = damping_rate self.points = points self.x = sum(x for x, y in points) / len(points) # Initialization self.y = sum(y for x, y in points) / len(points) # Initialization self.x_updates = [] self.y_updates = [] def _partial_derivative_x(self, x, y): grad_x = 0 for xi, yi in self.points: if x != xi or y != yi: grad_x += (x - xi) / math.sqrt((x - xi)**2 + (y - yi)**2) return grad_x def _partial_derivative_y(self, x, y): grad_y = 0 for xi, yi in self.points: if x != xi or y != yi: grad_y += (y - yi) / math.sqrt((x - xi)**2 + (y - yi)**2) return grad_y def gradient_descent(self): dx, dy = 0, 0 while self.step_size > self.tolerance: dx = self._partial_derivative_x(self.x, self.y) + self.damping_rate * dx dy = self._partial_derivative_y(self.x, self.y) + self.damping_rate * dy self.x -= self.step_size * dx self.x_updates.append(self.x) self.y -= self.step_size * dy self.y_updates.append(self.y) self.step_size *= self.decay_rate return (self.x, self.y) def f(x, y, c, d): return np.sqrt((x - c)**2 + (y - d)**2) # Define points points = [(1, 3), (-2, 4), (3, 4), (-2, 1), (9, 2), (-5, 2)] gd_min = GDdistanceMin(points=points) min_pt = gd_min.gradient_descent() xs = gd_min.x_updates ys = gd_min.y_updates print("Minimum point:", min_pt) c, d = min_pt # Create a grid for plotting x = np.linspace(-10, 10, 400) y = np.linspace(-10, 10, 400) x_grid, y_grid = np.meshgrid(x, y) z = f(x_grid, y_grid, c, d) # Calculate z values for the updates zs = [f(xi, yi, c, d) for xi, yi in zip(xs, ys)] # Plotting fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.plot_surface(x_grid, y_grid, z, cmap='viridis', alpha=0.6) ax.scatter(xs, ys, zs, color='red', s=50, label="Updates", marker='o') ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_zlabel('Z') plt.gca().set_facecolor('#f4f4f4') plt.gcf().patch.set_facecolor('#f4f4f4') plt.show() ``` Different approach[^1] ```{python} import sympy as sym sym.init_printing() from IPython.display import display, Math from matplotlib_inline.backend_inline import set_matplotlib_formats set_matplotlib_formats('svg') def distance_function(x,y,c,d): x,y = np.meshgrid(x,y) z = np.sqrt((x-c)**2+(y-d)**2) return z c,d = 0,0 x = np.linspace(-5,5, 400) y = np.linspace(-5,5, 400) z = distance_function(x,y,c,d) plt.imshow(z, extent=[x[0],x[-1], y[0],y[-1]],vmin=-5, vmax=5, origin='lower') plt.show() ``` So, the minimum is in the center of the deep color area. Let's compute the derivatives using `sympy` library ```{python} sx, sy = sym.symbols('sx,sy') sz = sym.sqrt((sx-c)**2+(sy-d)**2) df_sx = sym.lambdify((sx,sy), sym.diff(sz, sx), 'sympy') df_sy = sym.lambdify((sx,sy), sym.diff(sz, sy), 'sympy') display(Math(f"\\frac{{\\partial f}}{{\\partial x}}={sym.latex(sym.diff(sz,sx))}\ \\text{{ and }} \\frac{{\\partial f}}{{\\partial x}}\\mid_{{(1,1)}} = {df_sx(1,1).evalf()}")) display(Math(f"\\frac{{\\partial f}}{{\\partial y}}={sym.latex(sym.diff(sz,sy))} \ \\text{{ and }} \\frac{{\\partial f}}{{\\partial y}}\\mid_{{(1,1)}} = {df_sy(1,1).evalf()}")) ``` Now let's implement the GD ```{python} # Let's pick a random starting point(uniformly distributed between -2 and 2) local_min = np.random.rand(2)*4-2 start_point = local_min[:] # make a copy learning_rate = 0.01 training_epochs = 1000 trajectory = np.zeros((training_epochs,2)) for i in range(training_epochs): grad = np.array([ df_sx(local_min[0],local_min[1]).evalf(), df_sy(local_min[0],local_min[1]).evalf() ]) local_min = local_min - learning_rate*grad trajectory[i,:] = local_min print('Starting point ',start_point) print('Local min found ',local_min) plt.imshow(z, extent=[x[0],x[-1], y[0],y[-1]],vmin=-5, vmax=5, origin='lower') plt.plot(start_point[0], start_point[1], 'bs') plt.plot(local_min[0],local_min[1],'rs') plt.plot(trajectory[:,0],trajectory[:,1], 'b', linewidth=3) plt.legend(['random start', 'local min']) plt.colorbar() plt.show() ``` ## Implementation of SGD Personally, I think this [webpage](https://sebastianraschka.com/faq/docs/sgd-methods.html){target="_blank" style="text-decoration:none"} is one of the best resources for learning how to implement SGD. [^1]: The implementation is taken from the the Udemy course [A Deep Understanding of Deep Learning](https://www.udemy.com/course/deeplearning_x/?couponCode=ST11MT170325G3){target="_blank" style="text-decoration:none"} --- **Share on** ::::{.columns} :::{.column width="33%"} <a href="https://www.facebook.com/sharer.php?u=https://mrislambd.github.io/posts/sgd/" target="_blank" style="color:#1877F2; text-decoration: none;"> {{< fa brands facebook size=3x >}} </a> ::: :::{.column width="33%"} <a href="https://www.linkedin.com/sharing/share-offsite/?url=https://mrislambd.github.io/posts/sgd/" target="_blank" style="color:#0077B5; text-decoration: none;"> {{< fa brands linkedin size=3x >}} </a> ::: :::{.column width="33%"} <a href="https://www.twitter.com/intent/tweet?url=https://mrislambd.github.io/posts/sgd/" target="_blank" style="color:#1DA1F2; text-decoration: none;"> {{< fa brands twitter size=3x >}} </a> ::: :::: <script src="https://giscus.app/client.js" data-repo="mrislambd/mrislambd.github.io" data-repo-id="R_kgDOMV8crA" data-category="Announcements" data-category-id="DIC_kwDOMV8crM4CjbQW" data-mapping="pathname" data-strict="0" data-reactions-enabled="1" data-emit-metadata="0" data-input-position="bottom" data-theme="light" data-lang="en" crossorigin="anonymous" async> </script> <div id="fb-root"></div> <script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v20.0"></script> <div class="fb-comments" data-href="https://mrislambd.github.io/posts/sgd/" data-width="750" data-numposts="5"></div> **You may also like**