Gradient Descent Algorithm

Gradient Descent is an optimization algorithm that finds the minimum of a function by iteratively moving in the negative gradient direction. This algorithm is most commonly applied to machine learning and deep learning applications for parameter tuning of the models.

1. Understanding the Concept

Suppose you are standing on a mountain and want to get to the lowest point, that is, global minimum. Since you cannot see the entire terrain, you move in the direction of the steepest downward slope in small steps. Similarly, gradient descent works by iteratively updating parameters in the direction of the negative gradient to find the minimum.

2. Mathematical Formulation

Let’s define a function f(x) that we want to minimize. The gradient descent algorithm follows these steps:

1. Initialize parameters x0 randomly or with a given value.

2. Compute the gradient (derivative) of the function at the current point:

∇f(x)=df(x)/dx

3. Update x using the update rule:

xt+1​=xt​−α∇f(xt​)

    where:

    • xt​ is the current parameter value.
    • α is the learning rate (step size).
    • ∇f(xt​) is the gradient at xt​.

    4. Repeat the process until convergence (when the gradient is close to zero or the change in function value is small).

    3. Gradient Descent Variants

    There are three primary types of gradient descent based on the frequency at which we update the parameters:

    A. Batch Gradient Descent

    • It uses the whole dataset to calculate the gradient.
    • The update is performed only after an epoch (a complete pass over the data).
    • Smooth convergence but is very slow for large datasets.

    B. Stochastic Gradient Descent (SGD)

    • Parameters are updated after each training example.
    • Much faster but highly noisy, as it oscillates around the minimum.
    • Assists in escaping local minima.

    C. Mini-Batch Gradient Descent

    • Uses a small batch of data (e.g., 32, 64, or 128 examples).
    • Best of both worlds: gives stability from batch GD and efficiency from SGD.
    • Most commonly used in deep learning.

    4. Choosing the Learning Rate

    • If α is too small → Slow convergence.
    • If α is too large → May overshoot the minimum or diverge.
    • Solution: Use techniques like learning rate scheduling or adaptive optimizers (e.g., Adam, RMSprop).

    5. Gradient Descent in Higher Dimensions

    For multiple variables (x1,x2,…,xn​), the update rule generalizes to:

    θt+1​t​−α∇J(θt​)

    where:

    • J(θ) is the cost function.
    • θ is the vector of parameters.

    6. Gradient Descent with Python Example

    Here’s an implementation of gradient descent for a simple function f(x)=x2:

    import numpy as np
    import matplotlib.pyplot as plt
    
    # Define function and its gradient
    def f(x):
        return x**2
    
    def gradient(x):
        return 2*x
    
    # Gradient Descent Algorithm
    def gradient_descent(starting_point, learning_rate=0.1, iterations=20):
        x = starting_point
        history = [x]
        
        for _ in range(iterations):
            grad = gradient(x)
            x = x - learning_rate * grad
            history.append(x)
        
        return history
    
    # Run gradient descent
    history = gradient_descent(starting_point=10)
    
    # Plot results
    x_vals = np.linspace(-10, 10, 100)
    y_vals = f(x_vals)
    
    plt.plot(x_vals, y_vals, label="f(x) = x²")
    plt.scatter(history, [f(x) for x in history], color="red", label="Steps")
    plt.xlabel("x")
    plt.ylabel("f(x)")
    plt.legend()
    plt.show()

    7. Challenges and Solutions

    ChallengeSolution
    Getting stuck in local minimaUse momentum or Adam optimizer
    Slow convergenceAdaptive learning rate (e.g., RMSprop, Adam)
    Vanishing gradientsUse ReLU activation in neural networks
    Choosing the right learning rateHyperparameter tuning

    8. Optimized Variants of Gradient Descent

    • Momentum: Uses past gradients to accelerate convergence.
    • RMSprop: Adapts the learning rate for each parameter.
    • Adam: Combines momentum and RMSprop, widely used in deep learning.

    9. Conclusion

    • Gradient descent minimizes a function by moving in the direction of the negative gradient.
    • Three categories: Batch GD, Stochastic GD, and Mini-batch GD.
    • Proper selection of learning rate is critical.
    • Advanced optimizers like Adam improve performance.