198 Nesterov Accelerated Gradient

198.1 1. Introduction

Gradient descent is the workhorse of continuous optimization, but in its plain form it pays a steep price on ill-conditioned problems. When the curvature of the objective varies sharply across directions, gradient descent zigzags across narrow valleys and crawls along their floors. Momentum methods address this by accumulating a velocity that smooths the trajectory. Among them, Nesterov Accelerated Gradient (NAG), introduced by Yurii Nesterov in 1983, occupies a special place: it is not merely a heuristic improvement over classical momentum but a method that provably attains the optimal convergence rate for smooth convex problems within the class of first order methods.

This chapter develops NAG from three angles. First we recall classical momentum and expose its limitation. Then we introduce the lookahead gradient that defines NAG and explain why this single change of evaluation point matters. Finally we state and motivate the convergence rate, the celebrated $O(1/k^2)$ guarantee, and contrast it with the $O(1/k)$ rate of plain gradient descent.

198.2 2. Setting and Notation

We seek to minimize a differentiable function $f : \mathbb{R}^n \to \mathbb{R}$,

\[ \min_{x \in \mathbb{R}^n} f(x). \]

We assume $f$ is convex and that its gradient is $L$-Lipschitz continuous, meaning

\[ \|\nabla f(x) - \nabla f(y)\| \le L \|x - y\| \quad \text{for all } x, y. \]

The constant $L$ bounds the curvature: it is an upper bound on the largest eigenvalue of the Hessian wherever the Hessian exists. A consequence of $L$-smoothness is the descent inequality

\[ f(y) \le f(x) + \nabla f(x)^\top (y - x) + \tfrac{L}{2}\|y - x\|^2, \]

which says $f$ is dominated everywhere by a quadratic with curvature $L$. Some results additionally assume strong convexity with parameter $\mu > 0$, meaning $f(y) \ge f(x) + \nabla f(x)^\top (y - x) + \tfrac{\mu}{2}\|y - x\|^2$. The ratio $\kappa = L/\mu$ is the condition number, and it governs how badly plain gradient descent struggles.

198.3 3. Classical Momentum

198.3.1 3.1 Plain gradient descent

The baseline iteration is

\[ x_{k+1} = x_k - \eta \nabla f(x_k), \]

with step size $\eta$. With $\eta = 1/L$, gradient descent on a smooth convex objective satisfies $f(x_k) - f^\star \le \frac{L \|x_0 - x^\star\|^2}{2k}$, an $O(1/k)$ rate. On a strongly convex quadratic, the error contracts geometrically but at a rate tied to $\kappa$, so a poorly conditioned problem needs many iterations.

198.3.2 3.2 The heavy ball

Polyak’s heavy ball method adds inertia. Picture a heavy ball rolling down the loss surface: it does not stop and reverse every time the local gradient flips sign, because its momentum carries it forward. The update maintains a velocity $v_k$,

\[ v_{k+1} = \beta v_k - \eta \nabla f(x_k), \qquad x_{k+1} = x_k + v_{k+1}, \]

where $\beta \in [0, 1)$ is the momentum coefficient. Expanding the recursion shows that $v_{k+1}$ is an exponentially weighted sum of past gradients. Along a narrow valley, oscillating gradient components partly cancel in this sum while the consistent downhill component accumulates, so the ball accelerates along the floor of the valley and damps the zigzag across it.

The weakness of the heavy ball is that the gradient is always evaluated at the current point $x_k$, before the inertial step is taken. The method commits to a momentum step and only afterward, at the next iteration, learns whether that step was wise. When momentum is large the iterate can overshoot and the correction arrives a step too late. On general convex functions the heavy ball does not improve the worst case rate over plain gradient descent, and it can even fail to converge on some non quadratic problems.

198.4 4. The Lookahead Gradient

198.4.1 4.1 Nesterov’s reordering

Nesterov’s insight is to evaluate the gradient not at the current iterate but at the point the momentum is about to carry us toward. We first take the inertial step to a lookahead point, then compute the gradient there, then apply the correction. One common formulation uses a velocity variable:

\[ v_{k+1} = \beta v_k - \eta \nabla f\big(x_k + \beta v_k\big), \qquad x_{k+1} = x_k + v_{k+1}. \]

The only change from the heavy ball is the argument of $\nabla f$. The heavy ball evaluates $\nabla f(x_k)$; NAG evaluates $\nabla f(x_k + \beta v_k)$, the gradient at the lookahead point $y_k = x_k + \beta v_k$.

198.4.2 4.2 Why the lookahead helps

The lookahead point is where pure momentum would land before any gradient correction. Evaluating the gradient there lets the method sense the slope at its anticipated destination and correct before committing. If the momentum step is about to overshoot a minimum along some direction, the gradient at the lookahead point already points back, so the correction counteracts the overshoot within the same iteration rather than one iteration later. The heavy ball reacts; NAG anticipates. This converts the late correction of the heavy ball into a within step damping that stabilizes large momentum and is the mechanism behind the faster guaranteed rate.

198.4.3 4.3 The two sequence form

The form used in convergence proofs separates the iterate from the extrapolated point explicitly. With step size $\eta = 1/L$ it reads:

y_0 = x_0
for k = 0, 1, 2, ...
    x_{k+1} = y_k - (1/L) grad f(y_k)        # gradient step from the lookahead point
    t_{k+1} = (1 + sqrt(1 + 4 t_k^2)) / 2     # momentum schedule
    gamma   = (t_k - 1) / t_{k+1}
    y_{k+1} = x_{k+1} + gamma (x_{k+1} - x_k)  # extrapolation

Here $x_k$ is the sequence of solution estimates and $y_k$ is the extrapolated lookahead sequence at which gradients are taken. The auxiliary scalars $t_k$, with $t_0 = 1$, define a time varying momentum coefficient $\gamma_k = (t_k - 1)/t_{k+1}$ that grows toward $1$ as iterations proceed. This schedule is not arbitrary; it is precisely what the convergence analysis requires.

198.4.4 4.4 Equivalence of the forms

The velocity form and the two sequence form describe the same algorithm under the identification $v_{k} = x_{k} - x_{k-1}$ together with a matching choice of the coefficient. Setting $y_k = x_k + \beta_k (x_k - x_{k-1})$ and taking a gradient step from $y_k$ reproduces the velocity recursion when $\beta_k = \gamma_k$. The two sequence presentation is favored for analysis because the extrapolation coefficient and the gradient evaluation point appear separately, which makes the estimate sequence argument transparent.

198.5 5. Convergence Rate for Convex Problems

198.5.1 5.1 The main guarantee

For convex $f$ with $L$-Lipschitz gradient, NAG with step size $1/L$ and the momentum schedule above satisfies

\[ f(x_k) - f^\star \le \frac{2 L \|x_0 - x^\star\|^2}{(k+1)^2}. \]

Compare this with the $O(1/k)$ bound for plain gradient descent. To drive the optimality gap below a tolerance $\varepsilon$, gradient descent needs $O(1/\varepsilon)$ iterations whereas NAG needs only $O(1/\sqrt{\varepsilon})$. For $\varepsilon = 10^{-4}$ this is roughly the difference between ten thousand and one hundred iterations, a quadratic improvement in iteration count at essentially the same per iteration cost of one gradient evaluation.

198.5.2 5.2 Strongly convex case

When $f$ is also $\mu$-strongly convex, a constant momentum coefficient $\beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ yields linear convergence with rate governed by $\sqrt{\kappa}$ rather than $\kappa$:

\[ f(x_k) - f^\star \le C \left(1 - \frac{1}{\sqrt{\kappa}}\right)^{k}. \]

Plain gradient descent contracts at the slower rate $(1 - 1/\kappa)$. The replacement of $\kappa$ by $\sqrt{\kappa}$ is the same acceleration seen in the convex case, expressed through the condition number. For an ill conditioned problem with $\kappa = 10^4$, the acceleration cuts the iteration count from order $\kappa$ to order $\sqrt{\kappa}$, a hundredfold reduction.

198.5.3 5.3 Optimality

Nesterov’s rate is not merely fast; it is optimal in a precise sense. Nemirovski and Yudin established a lower bound showing that any first order method, meaning any method whose iterates lie in the span of the gradients it has observed, must suffer

\[ f(x_k) - f^\star \ge \frac{3 L \|x_0 - x^\star\|^2}{32 (k+1)^2} \]

on some smooth convex function, for $k$ not too large relative to the dimension. The lower bound matches the NAG upper bound up to a constant factor. No first order method can asymptotically beat the $1/k^2$ rate on this problem class, so NAG is worst case optimal. This is why acceleration is regarded as a fundamental phenomenon rather than a tuning trick.

198.5.4 5.4 An intuition for the square root

A useful way to see where the improvement comes from is the estimate sequence machinery Nesterov introduced. One builds a sequence of simple quadratic models $\phi_k$ that lower bound a running average of the gradient information and whose minima track $f^\star$. The momentum schedule is chosen so that the gap between $f(x_k)$ and the model shrinks like the product of factors $(1 - 1/t_k)$, and because $t_k$ grows linearly in $k$, this product decays like $1/k^2$ rather than $1/k$. The accumulation of past gradients into the extrapolated point is what allows each model to improve quadratically. Plain gradient descent uses only the current gradient and cannot build such an accelerating estimate.

198.6 6. Practical Considerations

198.6.1 6.1 Restarting

The momentum coefficient in the convex schedule increases toward $1$, which is ideal asymptotically but can cause the iterate to oscillate when it overshoots a region of strong local curvature. Adaptive restarting resets the momentum schedule, setting $t_k$ back to $1$, whenever a simple indicator fires, for example when the objective increases or when the momentum step forms an obtuse angle with the gradient. Restarting often recovers the linear rate on locally strongly convex regions without knowing $\mu$ in advance.

198.6.2 6.2 Use in deep learning

The form of NAG popularized for training neural networks rewrites the update so that the gradient is computed at the current parameters while still capturing the lookahead correction, which fits the structure of stochastic gradient frameworks. In the stochastic setting the clean $O(1/k^2)$ theory does not transfer directly, since gradient noise dominates the deterministic acceleration, yet the Nesterov variant of momentum remains a robust default and is the basis of widely used optimizers. The lookahead idea also reappears in modern wrappers that periodically synchronize a fast inner optimizer with a slow set of weights.

198.6.3 6.3 Choosing parameters

In the deterministic smooth convex case the step size $1/L$ and the prescribed momentum schedule require no tuning beyond an estimate of $L$, which can be found by backtracking line search. In the strongly convex case the optimal $\beta$ depends on $\kappa$, which is rarely known exactly; restarting or a conservative constant momentum near $0.9$ are common substitutes.

198.7 7. Implementation

The companion library ships a from-scratch, gradient-oracle based implementation in all three languages, exercised by a single shared parity fixture. The test objective is the strictly convex quadratic

\[ f(x) = \tfrac{1}{2} x^\top A x - b^\top x, \qquad A = \begin{bmatrix} 5 & 1 \\ 1 & 3 \end{bmatrix}, \quad b = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \]

with gradient $\nabla f(x) = A x - b$ and unique minimizer $x^\star = A^{-1} b = [\tfrac{1}{14}, \tfrac{9}{14}]$. Its largest eigenvalue is $L = \lambda_{\max}(A) \approx 5.414$, so the step size $\eta = 1/6 < 1/L$ is admissible; the strongly convex form uses the round momentum $\beta = 1/2$. Both constants are clean rationals, so all three languages reproduce bit-for-bit identical iterates. nesterov_convex implements the two-sequence schedule of Section 4.3 and nesterov_momentum implements the velocity form of Section 4.1.

Code

import numpy as np
from aiinaction.ch193_nesterov import nesterov_convex, nesterov_momentum

A = np.array([[5.0, 1.0], [1.0, 3.0]])
b = np.array([1.0, 2.0])
grad = lambda x: A @ x - b           # gradient oracle for f(x) = 0.5 xᵀAx - bᵀx

L = np.linalg.eigvalsh(A).max()      # smoothness constant
res = nesterov_convex(grad, [0.0, 0.0], step_size=1.0 / L, tol=1e-10)
print(f"two-sequence  x = {np.round(res.x, 6)}  iters = {res.n_iter}")

mu = np.linalg.eigvalsh(A).min()
kappa = L / mu
beta = (np.sqrt(kappa) - 1) / (np.sqrt(kappa) + 1)
res_m = nesterov_momentum(grad, [0.0, 0.0], step_size=1.0 / L, momentum=beta, tol=1e-10)
print(f"momentum form x = {np.round(res_m.x, 6)}  iters = {res_m.n_iter}")
print(f"exact minimizer x* = {np.round(np.linalg.solve(A, b), 6)}")

two-sequence  x = [0.071429 0.642857]  iters = 54
momentum form x = [0.071429 0.642857]  iters = 23
exact minimizer x* = [0.071429 0.642857]

using AIInAction.Ch193Nesterov
using LinearAlgebra

A = [5.0 1.0; 1.0 3.0]
b = [1.0, 2.0]
grad(x) = A * x .- b

L = maximum(eigvals(A))
res = nesterov_convex(grad, [0.0, 0.0], 1.0 / L; tol=1e-10)
println("two-sequence  x = ", round.(res.x, digits=6), "  iters = ", res.n_iter)

mu = minimum(eigvals(A))
beta = (sqrt(L / mu) - 1) / (sqrt(L / mu) + 1)
res_m = nesterov_momentum(grad, [0.0, 0.0], 1.0 / L, beta; tol=1e-10)
println("momentum form x = ", round.(res_m.x, digits=6), "  iters = ", res_m.n_iter)

use aiinaction::ch193_nesterov::{nesterov_convex, nesterov_momentum};

fn grad(x: &[f64]) -> Vec<f64> {
    // f(x) = 0.5 xᵀAx - bᵀx with A = [[5,1],[1,3]], b = [1,2]
    vec![5.0 * x[0] + x[1] - 1.0, x[0] + 3.0 * x[1] - 2.0]
}

fn main() {
    let l = 5.414213562373095_f64; // largest eigenvalue of A
    let res = nesterov_convex(&grad, &[0.0, 0.0], 1.0 / l, 10_000, 1e-10).unwrap();
    println!("two-sequence  x = {:?}  iters = {}", res.x, res.n_iter);

    let mu = 2.5857864376269046_f64;
    let kappa = l / mu;
    let beta = (kappa.sqrt() - 1.0) / (kappa.sqrt() + 1.0);
    let res_m = nesterov_momentum(&grad, &[0.0, 0.0], 1.0 / l, beta, 10_000, 1e-10).unwrap();
    println!("momentum form x = {:?}  iters = {}", res_m.x, res_m.n_iter);
}

198.8 8. Summary

Classical momentum accelerates gradient descent by accumulating velocity, but it evaluates the gradient at the current point and therefore corrects overshoot only after the fact. Nesterov Accelerated Gradient changes a single thing: it evaluates the gradient at the lookahead point that momentum is about to reach, turning a delayed correction into a within step one. This anticipatory evaluation, combined with a specific momentum schedule, improves the convergence rate for smooth convex problems from $O(1/k)$ to $O(1/k^2)$, and improves the strongly convex contraction from $\kappa$ to $\sqrt{\kappa}$. These rates match known lower bounds for first order methods, making NAG optimal in its class and a cornerstone of modern optimization.

198.9 References

Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Soviet Mathematics Doklady, 27(2), 372-376. https://www.mathnet.ru/eng/dan46009
Nesterov, Y. (2018). Lectures on Convex Optimization, 2nd ed. Springer. https://link.springer.com/book/10.1007/978-3-319-91578-4
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17. https://doi.org/10.1016/0041-5553(64)90137-5
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202. https://doi.org/10.1137/080716542
Nemirovski, A., & Yudin, D. (1983). Problem Complexity and Method Efficiency in Optimization. Wiley. https://www2.isye.gatech.edu/~nemirovs/
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML. https://proceedings.mlr.press/v28/sutskever13.html
O’Donoghue, B., & Candes, E. (2015). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3), 715-732. https://doi.org/10.1007/s10208-013-9150-3
Su, W., Boyd, S., & Candes, E. (2016). A differential equation for modeling Nesterov’s accelerated gradient method. Journal of Machine Learning Research, 17(153), 1-43. https://jmlr.org/papers/v17/15-084.html

# Nesterov Accelerated Gradient ## 1. Introduction Gradient descent is the workhorse of continuous optimization, but in its plain form it pays a steep price on ill-conditioned problems. When the curvature of the objective varies sharply across directions, gradient descent zigzags across narrow valleys and crawls along their floors. Momentum methods address this by accumulating a velocity that smooths the trajectory. Among them, Nesterov Accelerated Gradient (NAG), introduced by Yurii Nesterov in 1983, occupies a special place: it is not merely a heuristic improvement over classical momentum but a method that provably attains the optimal convergence rate for smooth convex problems within the class of first order methods. This chapter develops NAG from three angles. First we recall classical momentum and expose its limitation. Then we introduce the lookahead gradient that defines NAG and explain why this single change of evaluation point matters. Finally we state and motivate the convergence rate, the celebrated $O(1/k^2)$ guarantee, and contrast it with the $O(1/k)$ rate of plain gradient descent. ## 2. Setting and Notation We seek to minimize a differentiable function $f : \mathbb{R}^n \to \mathbb{R}$, $$ \min_{x \in \mathbb{R}^n} f(x). $$ We assume $f$ is convex and that its gradient is $L$-Lipschitz continuous, meaning $$ \|\nabla f(x) - \nabla f(y)\| \le L \|x - y\| \quad \text{for all } x, y. $$ The constant $L$ bounds the curvature: it is an upper bound on the largest eigenvalue of the Hessian wherever the Hessian exists. A consequence of $L$-smoothness is the descent inequality $$ f(y) \le f(x) + \nabla f(x)^\top (y - x) + \tfrac{L}{2}\|y - x\|^2, $$ which says $f$ is dominated everywhere by a quadratic with curvature $L$. Some results additionally assume strong convexity with parameter $\mu > 0$, meaning $f(y) \ge f(x) + \nabla f(x)^\top (y - x) + \tfrac{\mu}{2}\|y - x\|^2$. The ratio $\kappa = L/\mu$ is the condition number, and it governs how badly plain gradient descent struggles. ## 3. Classical Momentum ### 3.1 Plain gradient descent The baseline iteration is $$ x_{k+1} = x_k - \eta \nabla f(x_k), $$ with step size $\eta$. With $\eta = 1/L$, gradient descent on a smooth convex objective satisfies $f(x_k) - f^\star \le \frac{L \|x_0 - x^\star\|^2}{2k}$, an $O(1/k)$ rate. On a strongly convex quadratic, the error contracts geometrically but at a rate tied to $\kappa$, so a poorly conditioned problem needs many iterations. ### 3.2 The heavy ball Polyak's heavy ball method adds inertia. Picture a heavy ball rolling down the loss surface: it does not stop and reverse every time the local gradient flips sign, because its momentum carries it forward. The update maintains a velocity $v_k$, $$ v_{k+1} = \beta v_k - \eta \nabla f(x_k), \qquad x_{k+1} = x_k + v_{k+1}, $$ where $\beta \in [0, 1)$ is the momentum coefficient. Expanding the recursion shows that $v_{k+1}$ is an exponentially weighted sum of past gradients. Along a narrow valley, oscillating gradient components partly cancel in this sum while the consistent downhill component accumulates, so the ball accelerates along the floor of the valley and damps the zigzag across it. The weakness of the heavy ball is that the gradient is always evaluated at the current point $x_k$, before the inertial step is taken. The method commits to a momentum step and only afterward, at the next iteration, learns whether that step was wise. When momentum is large the iterate can overshoot and the correction arrives a step too late. On general convex functions the heavy ball does not improve the worst case rate over plain gradient descent, and it can even fail to converge on some non quadratic problems. ## 4. The Lookahead Gradient ### 4.1 Nesterov's reordering Nesterov's insight is to evaluate the gradient not at the current iterate but at the point the momentum is about to carry us toward. We first take the inertial step to a lookahead point, then compute the gradient there, then apply the correction. One common formulation uses a velocity variable: $$ v_{k+1} = \beta v_k - \eta \nabla f\big(x_k + \beta v_k\big), \qquad x_{k+1} = x_k + v_{k+1}. $$ The only change from the heavy ball is the argument of $\nabla f$. The heavy ball evaluates $\nabla f(x_k)$; NAG evaluates $\nabla f(x_k + \beta v_k)$, the gradient at the lookahead point $y_k = x_k + \beta v_k$. ### 4.2 Why the lookahead helps The lookahead point is where pure momentum would land before any gradient correction. Evaluating the gradient there lets the method sense the slope at its anticipated destination and correct before committing. If the momentum step is about to overshoot a minimum along some direction, the gradient at the lookahead point already points back, so the correction counteracts the overshoot within the same iteration rather than one iteration later. The heavy ball reacts; NAG anticipates. This converts the late correction of the heavy ball into a within step damping that stabilizes large momentum and is the mechanism behind the faster guaranteed rate. ### 4.3 The two sequence form The form used in convergence proofs separates the iterate from the extrapolated point explicitly. With step size $\eta = 1/L$ it reads: ```text y_0 = x_0 for k = 0, 1, 2, ... x_{k+1} = y_k - (1/L) grad f(y_k) # gradient step from the lookahead point t_{k+1} = (1 + sqrt(1 + 4 t_k^2)) / 2 # momentum schedule gamma = (t_k - 1) / t_{k+1} y_{k+1} = x_{k+1} + gamma (x_{k+1} - x_k) # extrapolation ``` Here $x_k$ is the sequence of solution estimates and $y_k$ is the extrapolated lookahead sequence at which gradients are taken. The auxiliary scalars $t_k$, with $t_0 = 1$, define a time varying momentum coefficient $\gamma_k = (t_k - 1)/t_{k+1}$ that grows toward $1$ as iterations proceed. This schedule is not arbitrary; it is precisely what the convergence analysis requires. ### 4.4 Equivalence of the forms The velocity form and the two sequence form describe the same algorithm under the identification $v_{k} = x_{k} - x_{k-1}$ together with a matching choice of the coefficient. Setting $y_k = x_k + \beta_k (x_k - x_{k-1})$ and taking a gradient step from $y_k$ reproduces the velocity recursion when $\beta_k = \gamma_k$. The two sequence presentation is favored for analysis because the extrapolation coefficient and the gradient evaluation point appear separately, which makes the estimate sequence argument transparent. ## 5. Convergence Rate for Convex Problems ### 5.1 The main guarantee For convex $f$ with $L$-Lipschitz gradient, NAG with step size $1/L$ and the momentum schedule above satisfies $$ f(x_k) - f^\star \le \frac{2 L \|x_0 - x^\star\|^2}{(k+1)^2}. $$ Compare this with the $O(1/k)$ bound for plain gradient descent. To drive the optimality gap below a tolerance $\varepsilon$, gradient descent needs $O(1/\varepsilon)$ iterations whereas NAG needs only $O(1/\sqrt{\varepsilon})$. For $\varepsilon = 10^{-4}$ this is roughly the difference between ten thousand and one hundred iterations, a quadratic improvement in iteration count at essentially the same per iteration cost of one gradient evaluation. ### 5.2 Strongly convex case When $f$ is also $\mu$-strongly convex, a constant momentum coefficient $\beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}$ yields linear convergence with rate governed by $\sqrt{\kappa}$ rather than $\kappa$: $$ f(x_k) - f^\star \le C \left(1 - \frac{1}{\sqrt{\kappa}}\right)^{k}. $$ Plain gradient descent contracts at the slower rate $(1 - 1/\kappa)$. The replacement of $\kappa$ by $\sqrt{\kappa}$ is the same acceleration seen in the convex case, expressed through the condition number. For an ill conditioned problem with $\kappa = 10^4$, the acceleration cuts the iteration count from order $\kappa$ to order $\sqrt{\kappa}$, a hundredfold reduction. ### 5.3 Optimality Nesterov's rate is not merely fast; it is optimal in a precise sense. Nemirovski and Yudin established a lower bound showing that any first order method, meaning any method whose iterates lie in the span of the gradients it has observed, must suffer $$ f(x_k) - f^\star \ge \frac{3 L \|x_0 - x^\star\|^2}{32 (k+1)^2} $$ on some smooth convex function, for $k$ not too large relative to the dimension. The lower bound matches the NAG upper bound up to a constant factor. No first order method can asymptotically beat the $1/k^2$ rate on this problem class, so NAG is worst case optimal. This is why acceleration is regarded as a fundamental phenomenon rather than a tuning trick. ### 5.4 An intuition for the square root A useful way to see where the improvement comes from is the estimate sequence machinery Nesterov introduced. One builds a sequence of simple quadratic models $\phi_k$ that lower bound a running average of the gradient information and whose minima track $f^\star$. The momentum schedule is chosen so that the gap between $f(x_k)$ and the model shrinks like the product of factors $(1 - 1/t_k)$, and because $t_k$ grows linearly in $k$, this product decays like $1/k^2$ rather than $1/k$. The accumulation of past gradients into the extrapolated point is what allows each model to improve quadratically. Plain gradient descent uses only the current gradient and cannot build such an accelerating estimate. ## 6. Practical Considerations ### 6.1 Restarting The momentum coefficient in the convex schedule increases toward $1$, which is ideal asymptotically but can cause the iterate to oscillate when it overshoots a region of strong local curvature. Adaptive restarting resets the momentum schedule, setting $t_k$ back to $1$, whenever a simple indicator fires, for example when the objective increases or when the momentum step forms an obtuse angle with the gradient. Restarting often recovers the linear rate on locally strongly convex regions without knowing $\mu$ in advance. ### 6.2 Use in deep learning The form of NAG popularized for training neural networks rewrites the update so that the gradient is computed at the current parameters while still capturing the lookahead correction, which fits the structure of stochastic gradient frameworks. In the stochastic setting the clean $O(1/k^2)$ theory does not transfer directly, since gradient noise dominates the deterministic acceleration, yet the Nesterov variant of momentum remains a robust default and is the basis of widely used optimizers. The lookahead idea also reappears in modern wrappers that periodically synchronize a fast inner optimizer with a slow set of weights. ### 6.3 Choosing parameters In the deterministic smooth convex case the step size $1/L$ and the prescribed momentum schedule require no tuning beyond an estimate of $L$, which can be found by backtracking line search. In the strongly convex case the optimal $\beta$ depends on $\kappa$, which is rarely known exactly; restarting or a conservative constant momentum near $0.9$ are common substitutes. ## 7. Implementation The companion library ships a from-scratch, gradient-oracle based implementation in all three languages, exercised by a single shared parity fixture. The test objective is the strictly convex quadratic $$ f(x) = \tfrac{1}{2} x^\top A x - b^\top x, \qquad A = \begin{bmatrix} 5 & 1 \\ 1 & 3 \end{bmatrix}, \quad b = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, $$ with gradient $\nabla f(x) = A x - b$ and unique minimizer $x^\star = A^{-1} b = [\tfrac{1}{14}, \tfrac{9}{14}]$. Its largest eigenvalue is $L = \lambda_{\max}(A) \approx 5.414$, so the step size $\eta = 1/6 < 1/L$ is admissible; the strongly convex form uses the round momentum $\beta = 1/2$. Both constants are clean rationals, so all three languages reproduce bit-for-bit identical iterates. `nesterov_convex` implements the two-sequence schedule of Section 4.3 and `nesterov_momentum` implements the velocity form of Section 4.1. ::: {.panel-tabset} ## Python ```{python} import numpy as np from aiinaction.ch193_nesterov import nesterov_convex, nesterov_momentum A = np.array([[5.0, 1.0], [1.0, 3.0]]) b = np.array([1.0, 2.0]) grad = lambda x: A @ x - b # gradient oracle for f(x) = 0.5 xᵀAx - bᵀx L = np.linalg.eigvalsh(A).max() # smoothness constant res = nesterov_convex(grad, [0.0, 0.0], step_size=1.0 / L, tol=1e-10) print(f"two-sequence x = {np.round(res.x, 6)} iters = {res.n_iter}") mu = np.linalg.eigvalsh(A).min() kappa = L / mu beta = (np.sqrt(kappa) - 1) / (np.sqrt(kappa) + 1) res_m = nesterov_momentum(grad, [0.0, 0.0], step_size=1.0 / L, momentum=beta, tol=1e-10) print(f"momentum form x = {np.round(res_m.x, 6)} iters = {res_m.n_iter}") print(f"exact minimizer x* = {np.round(np.linalg.solve(A, b), 6)}") ``` ## Julia ```julia using AIInAction.Ch193Nesterov using LinearAlgebra A = [5.0 1.0; 1.0 3.0] b = [1.0, 2.0] grad(x) = A * x .- b L = maximum(eigvals(A)) res = nesterov_convex(grad, [0.0, 0.0], 1.0 / L; tol=1e-10) println("two-sequence x = ", round.(res.x, digits=6), " iters = ", res.n_iter) mu = minimum(eigvals(A)) beta = (sqrt(L / mu) - 1) / (sqrt(L / mu) + 1) res_m = nesterov_momentum(grad, [0.0, 0.0], 1.0 / L, beta; tol=1e-10) println("momentum form x = ", round.(res_m.x, digits=6), " iters = ", res_m.n_iter) ``` ## Rust ```rust use aiinaction::ch193_nesterov::{nesterov_convex, nesterov_momentum}; fn grad(x: &[f64]) -> Vec<f64> { // f(x) = 0.5 xᵀAx - bᵀx with A = [[5,1],[1,3]], b = [1,2] vec![5.0 * x[0] + x[1] - 1.0, x[0] + 3.0 * x[1] - 2.0] } fn main() { let l = 5.414213562373095_f64; // largest eigenvalue of A let res = nesterov_convex(&grad, &[0.0, 0.0], 1.0 / l, 10_000, 1e-10).unwrap(); println!("two-sequence x = {:?} iters = {}", res.x, res.n_iter); let mu = 2.5857864376269046_f64; let kappa = l / mu; let beta = (kappa.sqrt() - 1.0) / (kappa.sqrt() + 1.0); let res_m = nesterov_momentum(&grad, &[0.0, 0.0], 1.0 / l, beta, 10_000, 1e-10).unwrap(); println!("momentum form x = {:?} iters = {}", res_m.x, res_m.n_iter); } ``` ::: ## 8. Summary Classical momentum accelerates gradient descent by accumulating velocity, but it evaluates the gradient at the current point and therefore corrects overshoot only after the fact. Nesterov Accelerated Gradient changes a single thing: it evaluates the gradient at the lookahead point that momentum is about to reach, turning a delayed correction into a within step one. This anticipatory evaluation, combined with a specific momentum schedule, improves the convergence rate for smooth convex problems from $O(1/k)$ to $O(1/k^2)$, and improves the strongly convex contraction from $\kappa$ to $\sqrt{\kappa}$. These rates match known lower bounds for first order methods, making NAG optimal in its class and a cornerstone of modern optimization. ## References 1. Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Soviet Mathematics Doklady, 27(2), 372-376. https://www.mathnet.ru/eng/dan46009 2. Nesterov, Y. (2018). Lectures on Convex Optimization, 2nd ed. Springer. https://link.springer.com/book/10.1007/978-3-319-91578-4 3. Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17. https://doi.org/10.1016/0041-5553(64)90137-5 4. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202. https://doi.org/10.1137/080716542 5. Nemirovski, A., & Yudin, D. (1983). Problem Complexity and Method Efficiency in Optimization. Wiley. https://www2.isye.gatech.edu/~nemirovs/ 6. Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML. https://proceedings.mlr.press/v28/sutskever13.html 7. O'Donoghue, B., & Candes, E. (2015). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3), 715-732. https://doi.org/10.1007/s10208-013-9150-3 8. Su, W., Boyd, S., & Candes, E. (2016). A differential equation for modeling Nesterov's accelerated gradient method. Journal of Machine Learning Research, 17(153), 1-43. https://jmlr.org/papers/v17/15-084.html