Gradient Descent Algorithm
Gradient Descent is an optimization algorithm that finds the minimum of a function by iteratively moving in the negative gradient direction. This algorithm is most commonly applied to machine learning and deep learning applications for parameter tuning of the models.
1. Understanding the Concept
Suppose you are standing on a mountain and want to get to the lowest point, that is, global minimum. Since you cannot see the entire terrain, you move in the direction of the steepest downward slope in small steps. Similarly, gradient descent works by iteratively updating parameters in the direction of the negative gradient to find the minimum.
2. Mathematical Formulation
Let’s define a function f(x) that we want to minimize. The gradient descent algorithm follows these steps:
1. Initialize parameters x0 randomly or with a given value.
2. Compute the gradient (derivative) of the function at the current point:
∇f(x)=df(x)/dx
3. Update x using the update rule:
xt+1=xt−α∇f(xt)
where:
- xt is the current parameter value.
- α is the learning rate (step size).
- ∇f(xt) is the gradient at xt.
4. Repeat the process until convergence (when the gradient is close to zero or the change in function value is small).
3. Gradient Descent Variants
There are three primary types of gradient descent based on the frequency at which we update the parameters:
A. Batch Gradient Descent
- It uses the whole dataset to calculate the gradient.
- The update is performed only after an epoch (a complete pass over the data).
- Smooth convergence but is very slow for large datasets.
B. Stochastic Gradient Descent (SGD)
- Parameters are updated after each training example.
- Much faster but highly noisy, as it oscillates around the minimum.
- Assists in escaping local minima.
C. Mini-Batch Gradient Descent
- Uses a small batch of data (e.g., 32, 64, or 128 examples).
- Best of both worlds: gives stability from batch GD and efficiency from SGD.
- Most commonly used in deep learning.
4. Choosing the Learning Rate
- If α is too small → Slow convergence.
- If α is too large → May overshoot the minimum or diverge.
- Solution: Use techniques like learning rate scheduling or adaptive optimizers (e.g., Adam, RMSprop).
5. Gradient Descent in Higher Dimensions
For multiple variables (x1,x2,…,xn), the update rule generalizes to:
θt+1=θt−α∇J(θt)
where:
- J(θ) is the cost function.
- θ is the vector of parameters.
6. Gradient Descent with Python Example
Here’s an implementation of gradient descent for a simple function f(x)=x2:
import numpy as np
import matplotlib.pyplot as plt
# Define function and its gradient
def f(x):
return x**2
def gradient(x):
return 2*x
# Gradient Descent Algorithm
def gradient_descent(starting_point, learning_rate=0.1, iterations=20):
x = starting_point
history = [x]
for _ in range(iterations):
grad = gradient(x)
x = x - learning_rate * grad
history.append(x)
return history
# Run gradient descent
history = gradient_descent(starting_point=10)
# Plot results
x_vals = np.linspace(-10, 10, 100)
y_vals = f(x_vals)
plt.plot(x_vals, y_vals, label="f(x) = x²")
plt.scatter(history, [f(x) for x in history], color="red", label="Steps")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.legend()
plt.show()
7. Challenges and Solutions
Challenge | Solution |
---|---|
Getting stuck in local minima | Use momentum or Adam optimizer |
Slow convergence | Adaptive learning rate (e.g., RMSprop, Adam) |
Vanishing gradients | Use ReLU activation in neural networks |
Choosing the right learning rate | Hyperparameter tuning |
8. Optimized Variants of Gradient Descent
- Momentum: Uses past gradients to accelerate convergence.
- RMSprop: Adapts the learning rate for each parameter.
- Adam: Combines momentum and RMSprop, widely used in deep learning.
9. Conclusion
- Gradient descent minimizes a function by moving in the direction of the negative gradient.
- Three categories: Batch GD, Stochastic GD, and Mini-batch GD.
- Proper selection of learning rate is critical.
- Advanced optimizers like Adam improve performance.