198 Nesterov Accelerated Gradient
198.1 1. Introduction
Gradient descent is the workhorse of continuous optimization, but in its plain form it pays a steep price on ill-conditioned problems. When the curvature of the objective varies sharply across directions, gradient descent zigzags across narrow valleys and crawls along their floors. Momentum methods address this by accumulating a velocity that smooths the trajectory. Among them, Nesterov Accelerated Gradient (NAG), introduced by Yurii Nesterov in 1983, occupies a special place: it is not merely a heuristic improvement over classical momentum but a method that provably attains the optimal convergence rate for smooth convex problems within the class of first order methods.
This chapter develops NAG from three angles. First we recall classical momentum and expose its limitation. Then we introduce the lookahead gradient that defines NAG and explain why this single change of evaluation point matters. Finally we state and motivate the convergence rate, the celebrated \(O(1/k^2)\) guarantee, and contrast it with the \(O(1/k)\) rate of plain gradient descent.
198.2 2. Setting and Notation
We seek to minimize a differentiable function \(f : \mathbb{R}^n \to \mathbb{R}\),
\[ \min_{x \in \mathbb{R}^n} f(x). \]
We assume \(f\) is convex and that its gradient is \(L\)-Lipschitz continuous, meaning
\[ \|\nabla f(x) - \nabla f(y)\| \le L \|x - y\| \quad \text{for all } x, y. \]
The constant \(L\) bounds the curvature: it is an upper bound on the largest eigenvalue of the Hessian wherever the Hessian exists. A consequence of \(L\)-smoothness is the descent inequality
\[ f(y) \le f(x) + \nabla f(x)^\top (y - x) + \tfrac{L}{2}\|y - x\|^2, \]
which says \(f\) is dominated everywhere by a quadratic with curvature \(L\). Some results additionally assume strong convexity with parameter \(\mu > 0\), meaning \(f(y) \ge f(x) + \nabla f(x)^\top (y - x) + \tfrac{\mu}{2}\|y - x\|^2\). The ratio \(\kappa = L/\mu\) is the condition number, and it governs how badly plain gradient descent struggles.
198.3 3. Classical Momentum
198.3.1 3.1 Plain gradient descent
The baseline iteration is
\[ x_{k+1} = x_k - \eta \nabla f(x_k), \]
with step size \(\eta\). With \(\eta = 1/L\), gradient descent on a smooth convex objective satisfies \(f(x_k) - f^\star \le \frac{L \|x_0 - x^\star\|^2}{2k}\), an \(O(1/k)\) rate. On a strongly convex quadratic, the error contracts geometrically but at a rate tied to \(\kappa\), so a poorly conditioned problem needs many iterations.
198.3.2 3.2 The heavy ball
Polyak’s heavy ball method adds inertia. Picture a heavy ball rolling down the loss surface: it does not stop and reverse every time the local gradient flips sign, because its momentum carries it forward. The update maintains a velocity \(v_k\),
\[ v_{k+1} = \beta v_k - \eta \nabla f(x_k), \qquad x_{k+1} = x_k + v_{k+1}, \]
where \(\beta \in [0, 1)\) is the momentum coefficient. Expanding the recursion shows that \(v_{k+1}\) is an exponentially weighted sum of past gradients. Along a narrow valley, oscillating gradient components partly cancel in this sum while the consistent downhill component accumulates, so the ball accelerates along the floor of the valley and damps the zigzag across it.
The weakness of the heavy ball is that the gradient is always evaluated at the current point \(x_k\), before the inertial step is taken. The method commits to a momentum step and only afterward, at the next iteration, learns whether that step was wise. When momentum is large the iterate can overshoot and the correction arrives a step too late. On general convex functions the heavy ball does not improve the worst case rate over plain gradient descent, and it can even fail to converge on some non quadratic problems.
198.4 4. The Lookahead Gradient
198.4.1 4.1 Nesterov’s reordering
Nesterov’s insight is to evaluate the gradient not at the current iterate but at the point the momentum is about to carry us toward. We first take the inertial step to a lookahead point, then compute the gradient there, then apply the correction. One common formulation uses a velocity variable:
\[ v_{k+1} = \beta v_k - \eta \nabla f\big(x_k + \beta v_k\big), \qquad x_{k+1} = x_k + v_{k+1}. \]
The only change from the heavy ball is the argument of \(\nabla f\). The heavy ball evaluates \(\nabla f(x_k)\); NAG evaluates \(\nabla f(x_k + \beta v_k)\), the gradient at the lookahead point \(y_k = x_k + \beta v_k\).
198.4.2 4.2 Why the lookahead helps
The lookahead point is where pure momentum would land before any gradient correction. Evaluating the gradient there lets the method sense the slope at its anticipated destination and correct before committing. If the momentum step is about to overshoot a minimum along some direction, the gradient at the lookahead point already points back, so the correction counteracts the overshoot within the same iteration rather than one iteration later. The heavy ball reacts; NAG anticipates. This converts the late correction of the heavy ball into a within step damping that stabilizes large momentum and is the mechanism behind the faster guaranteed rate.
198.4.3 4.3 The two sequence form
The form used in convergence proofs separates the iterate from the extrapolated point explicitly. With step size \(\eta = 1/L\) it reads:
y_0 = x_0
for k = 0, 1, 2, ...
x_{k+1} = y_k - (1/L) grad f(y_k) # gradient step from the lookahead point
t_{k+1} = (1 + sqrt(1 + 4 t_k^2)) / 2 # momentum schedule
gamma = (t_k - 1) / t_{k+1}
y_{k+1} = x_{k+1} + gamma (x_{k+1} - x_k) # extrapolation
Here \(x_k\) is the sequence of solution estimates and \(y_k\) is the extrapolated lookahead sequence at which gradients are taken. The auxiliary scalars \(t_k\), with \(t_0 = 1\), define a time varying momentum coefficient \(\gamma_k = (t_k - 1)/t_{k+1}\) that grows toward \(1\) as iterations proceed. This schedule is not arbitrary; it is precisely what the convergence analysis requires.
198.4.4 4.4 Equivalence of the forms
The velocity form and the two sequence form describe the same algorithm under the identification \(v_{k} = x_{k} - x_{k-1}\) together with a matching choice of the coefficient. Setting \(y_k = x_k + \beta_k (x_k - x_{k-1})\) and taking a gradient step from \(y_k\) reproduces the velocity recursion when \(\beta_k = \gamma_k\). The two sequence presentation is favored for analysis because the extrapolation coefficient and the gradient evaluation point appear separately, which makes the estimate sequence argument transparent.
198.5 5. Convergence Rate for Convex Problems
198.5.1 5.1 The main guarantee
For convex \(f\) with \(L\)-Lipschitz gradient, NAG with step size \(1/L\) and the momentum schedule above satisfies
\[ f(x_k) - f^\star \le \frac{2 L \|x_0 - x^\star\|^2}{(k+1)^2}. \]
Compare this with the \(O(1/k)\) bound for plain gradient descent. To drive the optimality gap below a tolerance \(\varepsilon\), gradient descent needs \(O(1/\varepsilon)\) iterations whereas NAG needs only \(O(1/\sqrt{\varepsilon})\). For \(\varepsilon = 10^{-4}\) this is roughly the difference between ten thousand and one hundred iterations, a quadratic improvement in iteration count at essentially the same per iteration cost of one gradient evaluation.
198.5.2 5.2 Strongly convex case
When \(f\) is also \(\mu\)-strongly convex, a constant momentum coefficient \(\beta = \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\) yields linear convergence with rate governed by \(\sqrt{\kappa}\) rather than \(\kappa\):
\[ f(x_k) - f^\star \le C \left(1 - \frac{1}{\sqrt{\kappa}}\right)^{k}. \]
Plain gradient descent contracts at the slower rate \((1 - 1/\kappa)\). The replacement of \(\kappa\) by \(\sqrt{\kappa}\) is the same acceleration seen in the convex case, expressed through the condition number. For an ill conditioned problem with \(\kappa = 10^4\), the acceleration cuts the iteration count from order \(\kappa\) to order \(\sqrt{\kappa}\), a hundredfold reduction.
198.5.3 5.3 Optimality
Nesterov’s rate is not merely fast; it is optimal in a precise sense. Nemirovski and Yudin established a lower bound showing that any first order method, meaning any method whose iterates lie in the span of the gradients it has observed, must suffer
\[ f(x_k) - f^\star \ge \frac{3 L \|x_0 - x^\star\|^2}{32 (k+1)^2} \]
on some smooth convex function, for \(k\) not too large relative to the dimension. The lower bound matches the NAG upper bound up to a constant factor. No first order method can asymptotically beat the \(1/k^2\) rate on this problem class, so NAG is worst case optimal. This is why acceleration is regarded as a fundamental phenomenon rather than a tuning trick.
198.5.4 5.4 An intuition for the square root
A useful way to see where the improvement comes from is the estimate sequence machinery Nesterov introduced. One builds a sequence of simple quadratic models \(\phi_k\) that lower bound a running average of the gradient information and whose minima track \(f^\star\). The momentum schedule is chosen so that the gap between \(f(x_k)\) and the model shrinks like the product of factors \((1 - 1/t_k)\), and because \(t_k\) grows linearly in \(k\), this product decays like \(1/k^2\) rather than \(1/k\). The accumulation of past gradients into the extrapolated point is what allows each model to improve quadratically. Plain gradient descent uses only the current gradient and cannot build such an accelerating estimate.
198.6 6. Practical Considerations
198.6.1 6.1 Restarting
The momentum coefficient in the convex schedule increases toward \(1\), which is ideal asymptotically but can cause the iterate to oscillate when it overshoots a region of strong local curvature. Adaptive restarting resets the momentum schedule, setting \(t_k\) back to \(1\), whenever a simple indicator fires, for example when the objective increases or when the momentum step forms an obtuse angle with the gradient. Restarting often recovers the linear rate on locally strongly convex regions without knowing \(\mu\) in advance.
198.6.2 6.2 Use in deep learning
The form of NAG popularized for training neural networks rewrites the update so that the gradient is computed at the current parameters while still capturing the lookahead correction, which fits the structure of stochastic gradient frameworks. In the stochastic setting the clean \(O(1/k^2)\) theory does not transfer directly, since gradient noise dominates the deterministic acceleration, yet the Nesterov variant of momentum remains a robust default and is the basis of widely used optimizers. The lookahead idea also reappears in modern wrappers that periodically synchronize a fast inner optimizer with a slow set of weights.
198.6.3 6.3 Choosing parameters
In the deterministic smooth convex case the step size \(1/L\) and the prescribed momentum schedule require no tuning beyond an estimate of \(L\), which can be found by backtracking line search. In the strongly convex case the optimal \(\beta\) depends on \(\kappa\), which is rarely known exactly; restarting or a conservative constant momentum near \(0.9\) are common substitutes.
198.7 7. Summary
Classical momentum accelerates gradient descent by accumulating velocity, but it evaluates the gradient at the current point and therefore corrects overshoot only after the fact. Nesterov Accelerated Gradient changes a single thing: it evaluates the gradient at the lookahead point that momentum is about to reach, turning a delayed correction into a within step one. This anticipatory evaluation, combined with a specific momentum schedule, improves the convergence rate for smooth convex problems from \(O(1/k)\) to \(O(1/k^2)\), and improves the strongly convex contraction from \(\kappa\) to \(\sqrt{\kappa}\). These rates match known lower bounds for first order methods, making NAG optimal in its class and a cornerstone of modern optimization.
198.8 References
- Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Soviet Mathematics Doklady, 27(2), 372-376. https://www.mathnet.ru/eng/dan46009
- Nesterov, Y. (2018). Lectures on Convex Optimization, 2nd ed. Springer. https://link.springer.com/book/10.1007/978-3-319-91578-4
- Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17. https://doi.org/10.1016/0041-5553(64)90137-5
- Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202. https://doi.org/10.1137/080716542
- Nemirovski, A., & Yudin, D. (1983). Problem Complexity and Method Efficiency in Optimization. Wiley. https://www2.isye.gatech.edu/~nemirovs/
- Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML. https://proceedings.mlr.press/v28/sutskever13.html
- O’Donoghue, B., & Candes, E. (2015). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3), 715-732. https://doi.org/10.1007/s10208-013-9150-3
- Su, W., Boyd, S., & Candes, E. (2016). A differential equation for modeling Nesterov’s accelerated gradient method. Journal of Machine Learning Research, 17(153), 1-43. https://jmlr.org/papers/v17/15-084.html