Gradient Descent#

Problem: minimize \(f(x)\), \(x = [x_1, x_2, \ldots, x_d] \in \mathbb{R}^d\)

Idea: the gradient of a function

\[\nabla f(x) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_d} \right]\]

at a point gives the direction of the steepest ascent of the function at that point.

If we move in the opposite direction of the gradient, we should be moving down the slope of the function.

\[x^{(i+1)} = x^{(i)} - \alpha \nabla f(x^{(i)})\]

where \(\alpha\) is the step size or learning rate. \(x^{(i)}\) is the current point and \(x^{(i+1)}\) is the next point. \(x^{(0)}\) is the initial guess

We repeat this process until \(|f'(x_i)|\) is close to 0 for some \(i\).

Note:

  • This algorithm tends to go to the closest local minimum.

  • If \(\nabla f(x_i)=0\), then we are at a stationary point and we won’t move.

  • In the simplest form \(\alpha\) is fixed. For more advanced versions, it is usually adaptive or randomized.

Interactive Visualization#

1D visualization

2D visualization

Example#

Use gradient descent to solve linear regression problem.

For a single-variable linear regression model \(f(x) = wx + b\) and a dataset \(\{ (x_i, y_i) \}\), the Mean Squared Error (MSE) loss function is defined as:

\[ L(w, b) = \frac{1}{N} \sum_{i=1}^{N} (w x_i + b - y_i)^2 \]

where \(N\) is the number of data points.

Derivatives of the Loss Function#

The gradients of the loss function with respect to \(w\) and \(b\) are:

\[ \frac{\partial L}{\partial w} = \frac{2}{N} \sum_{i=1}^{N} (w x_i + b - y_i) x_i \]
\[ \frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} (w x_i + b - y_i) \]

Using gradient descent, the parameters \(w\) and \(b\) are updated as:

\[ w_{\text{k+1}} = w_{\text{k}} - \alpha \frac{\partial L}{\partial w} \]
\[ b_{\text{k+1}} = b_{\text{k}} - \alpha \frac{\partial L}{\partial b} \]

where \(\alpha\) is the learning rate.

%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML, display

# Step 1: Generate synthetic data
np.random.seed(42)
x = 2 * np.random.rand(100, 1)
y = 4 + 3 * x + np.random.randn(100, 1)

# for plotting the straight line
gridx = np.linspace(0, 2, 100).reshape(-1, 1)

# Step 2: Define the loss function
def compute_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Create a grid of w and b values to visualize the loss landscape
w_values = np.linspace(0, 8, 200)
b_values = np.linspace(0, 8, 200)
w_grid, b_grid = np.meshgrid(w_values, b_values)
loss_grid = np.zeros_like(w_grid)

# Calculate the loss for each combination of w and b
for i in range(w_grid.shape[0]):
    for j in range(w_grid.shape[1]):
        w, b = w_grid[i, j], b_grid[i, j]
        y_pred = w * x + b
        loss_grid[i, j] = compute_loss(y, y_pred)

# Step 3: Implement gradient descent on w and b
learning_rate = 0.3
n_iterations = 50
w = 2
b = 1

# Store values for plotting the trajectory
trajectory = [(w, b)]
y_preds = [w * gridx + b]

# Precompute gradients and predictions
ws = [w]
bs = [b]
for iteration in range(n_iterations):
    # Predictions
    y_pred = ws[-1] * x + bs[-1]

    # Compute gradients
    w_gradient = -2 * np.mean(x * (y - y_pred))
    b_gradient = -2 * np.mean(y - y_pred)

    # Update parameters
    ws.append(ws[-1] - learning_rate * w_gradient)
    bs.append(bs[-1] - learning_rate * b_gradient)

    # Save the trajectory and predictions
    trajectory.append((ws[-1], bs[-1]))
    y_preds.append(ws[-1] * gridx + bs[-1])

# Prepare the figure
fig, axs = plt.subplots(1, 2, figsize=(12, 6))

# Left subplot: Loss landscape and trajectory
levels = [0, 0.1, 1, 2, 5, 10, 15, 20, 30, 50, 75, 100]
cp = axs[0].contour(w_grid, b_grid, loss_grid, levels=levels, cmap='viridis')
axs[0].clabel(cp, inline=True, fontsize=10)
traj_plot, = axs[0].plot([], [], 'r-o', label='Gradient Descent Path')
axs[0].set_xlabel('w')
axs[0].set_ylabel('b')
axs[0].set_title('Trajectory of Gradient Descent in the Loss Landscape')
axs[0].legend()

# Right subplot: Current line over noisy data
scatter = axs[1].scatter(x, y, color='blue', label='Noisy Data')
line, = axs[1].plot([], [], 'r-', label='Current Fit')
axs[1].set_xlabel('x')
axs[1].set_ylabel('y')
axs[1].set_title('Current Fit of the Model')
axs[1].legend()

def animate(i):
    # Update trajectory plot
    traj_plot.set_data(*zip(*trajectory[:i+1]))

    # Update line plot
    y_pred = y_preds[i]
    line.set_data(gridx.flatten(), y_pred.flatten())

    return traj_plot, line

ani = animation.FuncAnimation(fig, animate, frames=n_iterations, interval=200, blit=True)

# Display the animation in the notebook
html_output = HTML(ani.to_jshtml())
display(html_output)

More gradient descent algorithms