Gradient Descent Algorithm

Gradient Descent is an optimization algorithm that finds the minimum of a function by iteratively moving in the negative gradient direction. This algorithm is most commonly applied to machine learning and deep learning applications for parameter tuning of the models.

1. Understanding the Concept

Suppose you are standing on a mountain and want to get to the lowest point, that is, global minimum. Since you cannot see the entire terrain, you move in the direction of the steepest downward slope in small steps. Similarly, gradient descent works by iteratively updating parameters in the direction of the negative gradient to find the minimum.

2. Mathematical Formulation

Let’s define a function f(x) that we want to minimize. The gradient descent algorithm follows these steps:

1. Initialize parameters x₀ randomly or with a given value.

2. Compute the gradient (derivative) of the function at the current point:

∇f(x)=df(x)/dx

3. Update x using the update rule:

x_t+1=x_t−α∇f(x_t)

where:

x_t is the current parameter value.
α is the learning rate (step size).
∇f(x_t) is the gradient at x_t.

4. Repeat the process until convergence (when the gradient is close to zero or the change in function value is small).

3. Gradient Descent Variants

There are three primary types of gradient descent based on the frequency at which we update the parameters:

A. Batch Gradient Descent

It uses the whole dataset to calculate the gradient.
The update is performed only after an epoch (a complete pass over the data).
Smooth convergence but is very slow for large datasets.

B. Stochastic Gradient Descent (SGD)

Parameters are updated after each training example.
Much faster but highly noisy, as it oscillates around the minimum.
Assists in escaping local minima.

C. Mini-Batch Gradient Descent

Uses a small batch of data (e.g., 32, 64, or 128 examples).
Best of both worlds: gives stability from batch GD and efficiency from SGD.
Most commonly used in deep learning.

4. Choosing the Learning Rate

If α is too small → Slow convergence.
If α is too large → May overshoot the minimum or diverge.
Solution: Use techniques like learning rate scheduling or adaptive optimizers (e.g., Adam, RMSprop).

5. Gradient Descent in Higher Dimensions

For multiple variables (x₁,x₂,…,x_n), the update rule generalizes to:

θ_t+1=θ_t−α∇J(θ_t)

where:

J(θ) is the cost function.
θ is the vector of parameters.

6. Gradient Descent with Python Example

Here’s an implementation of gradient descent for a simple function f(x)=x²:

import numpy as np
import matplotlib.pyplot as plt

# Define function and its gradient
def f(x):
    return x**2

def gradient(x):
    return 2*x

# Gradient Descent Algorithm
def gradient_descent(starting_point, learning_rate=0.1, iterations=20):
    x = starting_point
    history = [x]
    
    for _ in range(iterations):
        grad = gradient(x)
        x = x - learning_rate * grad
        history.append(x)
    
    return history

# Run gradient descent
history = gradient_descent(starting_point=10)

# Plot results
x_vals = np.linspace(-10, 10, 100)
y_vals = f(x_vals)

plt.plot(x_vals, y_vals, label="f(x) = x²")
plt.scatter(history, [f(x) for x in history], color="red", label="Steps")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.legend()
plt.show()

7. Challenges and Solutions

Challenge	Solution
Getting stuck in local minima	Use momentum or Adam optimizer
Slow convergence	Adaptive learning rate (e.g., RMSprop, Adam)
Vanishing gradients	Use ReLU activation in neural networks
Choosing the right learning rate	Hyperparameter tuning

8. Optimized Variants of Gradient Descent

Momentum: Uses past gradients to accelerate convergence.
RMSprop: Adapts the learning rate for each parameter.
Adam: Combines momentum and RMSprop, widely used in deep learning.

9. Conclusion

Gradient descent minimizes a function by moving in the direction of the negative gradient.
Three categories: Batch GD, Stochastic GD, and Mini-batch GD.
Proper selection of learning rate is critical.
Advanced optimizers like Adam improve performance.