Quasi-Newton Methods

1. What Are Quasi-Newton Methods?

Quasi-Newton methods are iterative optimization algorithms that approximate the Newton’s method without directly computing the second-order derivatives (i.e., the Hessian matrix). They are used to solve unconstrained optimization problems, particularly when calculating the exact Hessian matrix is computationally expensive or infeasible.

In Newton’s method, the update rule for optimizing a function $f (W)$ involves both the gradient and the Hessian matrix:

W_{k} = W_{k - 1} - H^{- 1} (f) (W) \frac{\partial f ( W )}{\partial W}

However, computing the Hessian $H (f) (W)$ and its inverse is expensive in high dimensions. Quasi-Newton methods sidestep this issue by approximating the Hessian Matrix iteratively as the optimization proceeds, rather than computing it directly at each step.

2. Purpose of Quasi-Newton Methods

The primary goal of Quasi-Newton methods is to find a compromise between the fast convergence of Newton’s method and the lower computational cost of simpler methods like gradient descent. By building an approximation to the Hessian matrix, these methods achieve much of the efficiency of Newton’s method without the full computational burden.

3. Intuition Behind Quasi-Newton Methods

Instead of using the true Hessian matrix, quasi-Newton methods construct an approximation to the Hessian by looking at changes in gradients as we update the parameter vector $W$ .

The core idea is that:

After taking a step in the direction of the gradient, you compare the gradient before and after the step.
This change in the gradient provides information about the curvature of the objective function.
Using this information, the algorithm refines its approximation to the inverse Hessian, allowing for more intelligent step sizes in future updates.

Thus, quasi-Newton methods gradually improve their understanding of the curvature of the loss function over time, making the updates more efficient.

4. How Do Quasi-Newton Methods Work?

Quasi-Newton methods rely on approximating the inverse of the Hessian matrix through rank-one updates or low-rank updates. These updates are applied iteratively as the algorithm progresses, refining the estimate of the Hessian at each step.

The update formula for the weights $W_{k}$ is:

W_{k} = W_{k - 1} - B_{k}^{- 1} \frac{\partial f ( W _{k - 1} )}{\partial W}

Here, $B_{k}$ is the approximation of the Hessian matrix at iteration $k$ , and $B_{k}^{- 1}$ is its inverse. The method updates $B_{k}$ iteratively based on the changes in the gradients.

Popular quasi-Newton methods include:

BFGS (Broyden-Fletcher-Goldfarb-Shanno) Method: This is the most commonly used quasi-Newton method and maintains an estimate of the inverse Hessian.
L-BFGS (Limited-memory BFGS): A version of BFGS that is optimized for large-scale problems by storing only a limited number of gradients and updates (making it less memory intensive).

The BFGS method updates the inverse Hessian approximation using the following formula:

B_{k + 1} = B_{k} + \frac{Δ g _{k} Δ g _{k}^{T}}{Δ g _{k}^{T} Δ x _{k}} - \frac{B _{k} Δ x _{k} Δ x _{k}^{T} B _{k}}{Δ x _{k}^{T} B _{k} Δ x _{k}}

Where:

$Δ x_{k} = W_{k + 1} - W_{k}$ is the change in the parameters.
$Δ g_{k} = \nabla f (W_{k + 1}) - \nabla f (W_{k})$ is the change in the gradients.

This update formula ensures that the new approximation $B_{k + 1}$ is still symmetric and positive definite, maintaining the important properties of the true Hessian.

5. Purpose of Inverse in Quasi-Newton Methods

In quasi-Newton methods, we are interested in approximating the inverse of the Hessian rather than the Hessian itself. The inverse Hessian is used to adjust the direction and size of the steps taken in each iteration.

The inverse of the Hessian helps to take more appropriate steps in the parameter space by scaling the gradient. Instead of blindly following the steepest descent direction (as in gradient descent), the inverse Hessian scales the gradient by the curvature, leading to faster convergence, especially near local minima.

For large problems, directly inverting the Hessian is impractical, so quasi-Newton methods maintain and update an approximation to the inverse Hessian based on the changes in gradients. The approximate inverse is used to adjust the gradient direction and size, effectively adapting the step size as the optimization progresses.

6. Significance of Quasi-Newton Methods

Quasi-Newton methods are important in optimization because they strike a balance between:

Accuracy: They retain much of the power of Newton’s method by using second-order information (through the Hessian approximation).
Efficiency: They avoid the expensive computations required for calculating and inverting the full Hessian matrix.

Why Quasi-Newton Methods Over Gradient Descent?

Faster Convergence: Quasi-Newton methods typically converge much faster than gradient descent because they account for the curvature of the objective function.
Adaptive Step Size: Quasi-Newton methods automatically adjust the step size based on the approximated inverse Hessian. In contrast, gradient descent uses a fixed learning rate, which may not be optimal for every iteration.

However, for extremely large problems, even quasi-Newton methods can become inefficient. This is where limited-memory versions like L-BFGS come into play, which only store a limited number of past gradients and updates, making them more scalable for large-scale optimization.

7. Use of Quasi-Newton Methods in Gradient-Based Learning

In gradient-based learning, quasi-Newton methods offer a more sophisticated way to minimize the loss function, especially for problems where the objective function has complex curvature.

In machine learning:

BFGS or L-BFGS is often used for training models with a relatively small number of parameters.
In deep learning, because the number of parameters is usually very large, quasi-Newton methods are not as common. Instead, optimizers like Adam or RMSprop (which are also adaptive) are more frequently used, as they require less memory and fewer computational resources.

8. Comparison with Second-Order Methods

Second-Order Methods (Newton’s method): Directly compute the inverse Hessian or the Hessian itself, offering very fast convergence at the cost of high computational complexity.
Quasi-Newton Methods: Approximate the inverse Hessian using information from past iterations, providing a middle ground between fast convergence and lower computational cost.

For large-scale machine learning problems, quasi-Newton methods like L-BFGS are often favored for their efficiency and scalability.

Summary of Key Concepts:

Quasi-Newton Methods: Optimization methods that approximate the Hessian matrix and use it to perform efficient updates without explicitly computing the Hessian.
Hessian Approximation: In quasi-Newton methods, the Hessian is approximated by tracking changes in the gradient, leading to more informed step sizes and directions.
BFGS: A popular quasi-Newton method that iteratively refines its approximation of the inverse Hessian, providing faster convergence than gradient descent.
L-BFGS: A limited-memory version of BFGS optimized for large-scale problems by storing only a small number of past updates.

Quasi-Newton methods are powerful tools in optimization, balancing the accuracy of second-order methods with the efficiency of first-order methods, making them widely used in various machine learning tasks.

Rank-one and Low rank updates:

Rank of a Matrix

The rank of a matrix is a fundamental concept in linear algebra that represents the maximum number of linearly independent rows or columns in the matrix. Essentially, it tells us how much “information” or “dimension” the matrix contains.

Formal Definition

Given a matrix $A$ with dimensions $m \times n$ , the rank of $A$ , denoted as $rank (A)$ , is the maximum number of linearly independent rows or columns in the matrix. In other words, it is the dimension of the vector space spanned by either the rows or the columns of the matrix.

For any matrix, the row rank (number of linearly independent rows) is equal to the column rank (number of linearly independent columns), so we can simply refer to this as the rank of the matrix.

Interpretation

If the rank of a matrix is equal to the smaller of its two dimensions (i.e., for an $m \times n$ matrix, $rank (A) = min (m, n)$ ), the matrix is said to have full rank. This means that all rows (or columns) are linearly independent, and the matrix has maximal “information content”.
If the rank of a matrix is less than $min (m, n)$ , it is said to be rank-deficient. This indicates that some rows or columns are linearly dependent, reducing the effective dimensionality of the matrix.

Rank-One and Low-Rank Matrices

A rank-one matrix is a matrix that can be written as the outer product of two vectors. For example, for vectors $u \in R^{m}$ and $v \in R^{n}$ , the matrix $A = u v^{T}$ has rank one. All rows (or columns) of this matrix are linearly dependent on each other.
A low-rank matrix is a matrix whose rank is significantly smaller than its dimensions. Low-rank matrices often appear in practical applications, such as in compressed data representations and optimization algorithms like quasi-Newton methods, where only small-rank updates are required.

Why is the Rank Important?

Solving Systems of Linear Equations: The rank of a matrix determines whether a system of linear equations has a unique solution, infinitely many solutions, or no solution. A matrix with full rank ensures that the system has a unique solution.
Matrix Inverses: A square matrix has an inverse if and only if it has full rank (i.e., its rank is equal to its dimension). A rank-deficient square matrix is singular and does not have an inverse.
Dimensionality: In applications like machine learning and data analysis, the rank of a matrix provides insight into the dimensionality of the data. Low-rank matrices indicate that the data lies in a lower-dimensional subspace, which can be useful for dimensionality reduction techniques like Principal Component Analysis (PCA).

How to Calculate Rank

The rank of a matrix can be computed using various methods:

Gaussian Elimination: By performing row reduction on a matrix to bring it to row echelon form, the number of non-zero rows corresponds to the rank of the matrix.
Singular Value Decomposition (SVD): Another method involves decomposing the matrix into the product of three matrices, where the rank is given by the number of non-zero singular values.

Summary

The rank of a matrix is a measure of the matrix’s linear independence and dimensionality. It plays a crucial role in determining properties such as the solvability of systems of equations, invertibility of matrices, and the effective dimensionality in data. In optimization, rank-one and low-rank matrices are used for computational efficiency in updates and approximations of larger, complex matrices.

Rank-one updates and low-rank updates are techniques used in optimization, particularly in quasi-Newton methods, to iteratively update an approximation of the Hessian matrix (or its inverse) without needing to compute it directly. These updates offer computational efficiency, reducing the complexity of operations involved in second-order optimization methods.

Rank-one updates

A rank-one update is a way to modify a matrix by adding or subtracting a matrix that has rank one. A rank-one matrix can be represented as the outer product of two vectors. If $u$ and $v$ are vectors of length $n$ , then their outer product $u v^{T}$ is an $n \times n$ matrix with rank one.

In the context of quasi-Newton methods, a rank-one update modifies the current estimate of the inverse Hessian matrix (denoted $H_{k}$ ) to improve its accuracy while preserving its structure. Mathematically, a rank-one update can be written as:

H_{k + 1} = H_{k} + u_{k} v_{k}^{T}

Here:

$H_{k}$ is the current approximation of the inverse Hessian matrix.
$u_{k} v_{k}^{T}$ is the rank-one correction term, where $u_{k}$ and $v_{k}$ are vectors computed based on the gradient differences between iterations or changes in the position of the parameters.

The idea is that adding a rank-one matrix to the current Hessian approximation adjusts the matrix slightly while keeping the computational cost low. However, simple rank-one updates may not always be stable or sufficient for all optimization problems, as they don’t guarantee positive definiteness of the Hessian matrix, which is critical for convergence.

Low-rank updates

A low-rank update extends the idea of a rank-one update by allowing the correction to have higher rank. Instead of just adding or subtracting a rank-one matrix, we can update the Hessian approximation by adding a matrix with rank greater than one but still much smaller than the full matrix dimension.

In practice, low-rank updates usually involve adding or subtracting a matrix with rank two or another small rank, depending on the method. The most common form of low-rank updates in optimization is the rank-two update, where the Hessian approximation is updated using a combination of two rank-one matrices. This is seen in methods like the BFGS (Broyden–Fletcher–Goldfarb–Shanno) update, which is a quasi-Newton method.

For instance, the BFGS update formula for the inverse Hessian approximation is:

H_{k + 1} = H_{k} + \frac{y _{k} y _{k}^{T}}{y _{k}^{T} s _{k}} - \frac{H _{k} s _{k} s _{k}^{T} H _{k}}{s _{k}^{T} H _{k} s _{k}}

Here, the update combines two rank-one terms, making it a rank-two update.

$y_{k} = \nabla f (x_{k + 1}) - \nabla f (x_{k})$ , the difference in gradients between successive iterations.
$s_{k} = x_{k + 1} - x_{k}$ , the difference in parameters.

Significance of rank-one and low-rank updates

Efficiency: These updates are computationally much cheaper than recomputing the entire Hessian or its inverse at each iteration, which is especially important for large-scale optimization problems.
Memory savings: Instead of storing and manipulating a full $n \times n$ Hessian matrix, which can be computationally prohibitive for large $n$ , we only need to store a few vectors or matrices of lower rank.
Preserving structure: These updates are designed to modify the Hessian approximation gradually while maintaining important properties like positive definiteness, ensuring convergence to a local minimum.

In summary, rank-one updates modify the Hessian approximation by adding a matrix with rank one, while low-rank updates generalize this idea by using matrices with a higher but still small rank. Both techniques improve computational efficiency in second-order optimization methods like quasi-Newton methods by reducing the need to compute the full Hessian matrix, allowing for faster convergence in large-scale problems.

🤓 Yashwanth's Notes

Explorer