statistics

The Gradient Descent Algorithm Explained Simply

Imagine wanting to find the fastest route to reach a destination by car. You could use a road map to estimate the distance and travel time of different roads. However, this method doesn’t account for traffic, which can vary significantly throughout the day.

Gradient Descent can be used to find the fastest route in real-time. In this case:

The cost function represents the travel time of the journey.
The parameter to optimize is the route to follow.
The gradient indicates the direction in which travel time increases most rapidly.

The Gradient Descent algorithm can then be used to update the route iteratively, getting closer to the fastest route with each iteration.

Let’s now try to organize the definitions a bit.

Gradient Descent is an algorithm that tries to find the minimum of an objective function, i.e., the lowest possible value that the function can assume. To do this, the algorithm starts from a random point and moves in the opposite direction of the gradient, which is the direction in which the function grows most rapidly. The gradient is calculated as the derivative of the function, i.e., the slope of the curve at a point. The higher the gradient, the steeper the function.

Let’s start our journey… and try to minimize the cost function!

What we’ll discuss

The learning rate
The stopping criterion
The objective function and the cost function
A very simple code example
Gradient Descent: Advantages and Disadvantages

The learning rate

The learning rate is a parameter that determines how much the algorithm moves at each step. If the learning rate is too high, the algorithm might overshoot the minimum and go beyond it.
If the learning rate is too low, the algorithm might take too long to reach the minimum or stop before reaching it.
The learning rate must be chosen to balance speed and accuracy. This is an extremely important and delicate point to consider.

The stopping criterion

The stopping criterion is a condition that determines when the algorithm should stop. It can be based on the number of steps, the difference between two consecutive values of the objective function, or a preset threshold value. The stopping criterion serves to prevent the algorithm from continuing to run unnecessarily or getting stuck at a point that is not the minimum.

An analogy to better understand gradient descent is that of a person trying to descend a mountain without seeing where they’re going. The person can use a compass to orient themselves and follow the direction where the slope is greatest. The person can also decide how far to walk at each step and when to stop, depending on how tired or confident they are.

An example of gradient descent applied to machine learning is linear regression, which is the method for finding the line that best approximates a set of points. The objective function in this case is the mean squared error between the points and the line, i.e., the sum of the squares of the vertical distances between the points and the line divided by the number of points.
The goal is to find the parameters of the line (intercept and slope) that minimize this error.
Gradient descent starts from a random line and updates the parameters based on the gradient of the mean squared error until reaching an optimal line.

Another example of gradient descent applied to machine learning is neural networks, which are mathematical models that mimic the functioning of biological neurons.
Neural networks are composed of several layers of nodes (neurons) connected by weights (synapses). Each node receives inputs from previous layers, combines them with weights, and produces an output for subsequent layers.
The objective function in this case is the difference between the desired output and the actual output of the neural network, i.e., the prediction error. The goal is to find the weights that minimize this error.
Gradient descent starts from random weights and updates them based on the gradient of the prediction error until reaching an optimal neural network.

The objective function and the cost function

The objective function and the cost function are terms that are often used interchangeably, but they also have different nuances depending on the context.

In general, the objective function is a function that you want to optimize, i.e., maximize or minimize, based on a certain criterion. The cost function is a type of objective function that measures the “cost” or “loss” associated with a solution or prediction. In other words, the cost function is an objective function that you want to minimize.

For example, if you want to find the line that best approximates a set of points, you can use linear regression. In this case, the objective function is the mean squared error between the points and the line, i.e., the sum of the squares of the vertical distances between the points and the line divided by the number of points. This objective function is also a cost function because it represents the cost of having an imperfect line. The goal is to find the parameters of the line that minimize this cost.

However, not all objective functions are cost functions.

Some objective functions can be utility functions, profit functions, likelihood functions, etc. These are functions that you want to maximize because they represent the benefit or probability associated with a solution or prediction. In this case, we don’t talk about cost or loss, but optimization.

For example, if you want to find the probabilistic model that best describes a set of data, you can use the maximum likelihood method. In this case, the objective function is the probability of generating the data set given the model, i.e., the product of the probabilities of each data point given the model. This objective function is not a cost function, but a likelihood function.
The goal is to find the parameters of the model that maximize this likelihood.

A very simple code example

To keep the exposition as simple as possible, I provide an example in R of implementing a Gradient Descent algorithm, for purely educational purposes. Here’s the code:

# Cost function
f <- function(x) {
  return(x^2 + 2*x + 1)
}

# Derivative of the cost function
df <- function(x) {
  return(2*x + 2)
}

# Initial parameters
x <- 1
learning_rate <- 0.1

# Iteration loop
for (i in 1:10) {
  # Calculate gradient
  gradient <- df(x)
  
  # Update parameter
  x <- x - learning_rate * gradient
  
  # Print current value
  print(x)
}

The code is intentionally very simple, and we’ll examine it line by line:

Step 1: Cost function

The function f <- function(x) { return(x^2 + 2*x + 1) } defines the cost function we want to minimize. In this case, the function is a simple parabola with a minimum at x = -1.

Step 2: Derivative of the cost function

The function df <- function(x) { return(2*x + 2) } calculates the derivative of the cost function with respect to x. The derivative tells us the direction in which the function grows most rapidly.

Step 3: Initial parameters

The variables x <- 1 and learning_rate <- 0.1 define the initial parameters of the algorithm. The variable x represents the initial value of the parameter to be optimized, while learning_rate controls the convergence speed of the algorithm.

Step 4: Iteration loop

The loop for (i in 1:10) performs 10 iterations of the Gradient Descent algorithm. In each iteration, the following operations are performed:

Calculate gradient: The function df(x) is used to calculate the gradient of the cost function at the current point x.
Update parameter: The value of x is updated by subtracting from it a factor proportional to the learning rate multiplied by the gradient.
Print current value: The current value of x is printed to the screen.

Gradient Descent: Advantages and Disadvantages

In general, Gradient Descent is a versatile and powerful optimization algorithm that can be used to solve a variety of problems. However, it’s important to be aware of its advantages and disadvantages to choose the most suitable algorithm for the specific problem.

Advantages:

Simplicity: The Gradient Descent algorithm is relatively simple to understand and implement.
Efficiency: The algorithm is efficient in terms of time and memory, especially for moderate-sized problems.
Flexibility: Gradient Descent can be applied to a variety of optimization problems with different cost functions and constraints.
Robustness: The algorithm is robust and generally converges to a solution, even if the cost function is not convex.

Disadvantages:

Local convergence: The algorithm can converge to a local minimum rather than a global minimum, especially if the cost function is non-convex.
Sensitivity to learning rate: The choice of learning rate can significantly affect the speed of convergence and the quality of the solution.
Difficulty in tuning: Adjusting the algorithm’s parameters can be difficult, especially for complex problems.
Slowness for large-scale problems: The algorithm can become slow for large-scale problems with a high number of parameters.