197  Momentum and the Heavy-Ball Method

197.1 1. Introduction

Plain gradient descent updates parameters by stepping directly down the local gradient. This is locally optimal in a myopic sense, but it is frequently inefficient on the kind of loss surfaces that arise in machine learning. When the loss is shaped like a long, narrow valley, plain gradient descent zigzags across the steep walls while creeping slowly along the gently sloping floor. Each individual step is reasonable, yet the trajectory wastes most of its motion bouncing back and forth.

Momentum is a modification that gives the optimizer memory. Instead of stepping along the current gradient alone, momentum accumulates a velocity vector that blends the current gradient with the recent history of gradients. Directions that persist across many steps reinforce one another and accelerate, while directions that flip sign from step to step cancel out and are damped. The result is faster progress along consistent descent directions and suppressed oscillation across the narrow ones.

This chapter develops momentum from three complementary viewpoints. We treat it as the heavy-ball method of Polyak, as an exponential moving average of gradients, and as a discretization of a physical system with mass and friction. Each view illuminates why momentum accelerates and why it stabilizes.

197.2 2. The Heavy-Ball Method

197.2.1 2.1 The Update Rule

Let \(f : \mathbb{R}^d \to \mathbb{R}\) be the objective and let \(\theta_t\) denote the parameters at iteration \(t\). Plain gradient descent is

\[ \theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t), \]

where \(\alpha > 0\) is the learning rate. Polyak’s heavy-ball method augments this with a momentum term proportional to the previous step:

\[ \theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t) + \beta (\theta_t - \theta_{t-1}), \]

where \(\beta \in [0, 1)\) is the momentum coefficient. The term \(\theta_t - \theta_{t-1}\) is the displacement produced by the last update, so the method literally keeps moving in the direction it was already going, modulated by \(\beta\).

197.2.2 2.2 The Velocity Formulation

It is cleaner to introduce an explicit velocity vector \(v_t\) and rewrite the recursion as a two-line update:

\[ v_{t} = \beta v_{t-1} + \nabla f(\theta_t), \]

\[ \theta_{t+1} = \theta_t - \alpha v_t. \]

Here \(v_t\) accumulates gradients with a geometric decay governed by \(\beta\). Setting \(\beta = 0\) recovers plain gradient descent, since \(v_t = \nabla f(\theta_t)\). As \(\beta\) approaches \(1\), the velocity retains more of its past, and the optimizer behaves more like a massive object that resists abrupt changes in direction.

v <- beta * v + grad(theta)
theta <- theta - alpha * v

A common equivalent variant scales the gradient by \((1 - \beta)\) before accumulating, which makes \(v_t\) a proper weighted average rather than a sum. The two forms differ only by a constant rescaling of the effective learning rate, so the qualitative behavior is identical.

197.3 3. Momentum as an Exponential Moving Average

197.3.1 3.1 Unrolling the Recursion

The velocity recursion \(v_t = \beta v_{t-1} + g_t\), with \(g_t := \nabla f(\theta_t)\), can be unrolled explicitly. Assuming \(v_{-1} = 0\),

\[ v_t = \sum_{k=0}^{t} \beta^{\,t-k} g_k = g_t + \beta g_{t-1} + \beta^2 g_{t-2} + \cdots \]

This is an exponentially weighted sum of all past gradients. The most recent gradient carries weight \(1\), the one before it weight \(\beta\), the one before that weight \(\beta^2\), and so on. Because \(0 \le \beta < 1\), contributions from the distant past decay geometrically and are effectively forgotten.

197.3.2 3.2 The Effective Averaging Window

The sum of the weights in a fully developed velocity is

\[ \sum_{k=0}^{\infty} \beta^{k} = \frac{1}{1 - \beta}. \]

This quantity is the effective number of gradients being averaged. With \(\beta = 0.9\) the optimizer averages roughly the last \(10\) gradients, and with \(\beta = 0.99\) roughly the last \(100\). This is why momentum is described as smoothing the gradient signal. Noisy or rapidly fluctuating components of the gradient are averaged toward zero, while the slowly varying mean direction survives.

The averaging interpretation explains the central trade-off in choosing \(\beta\). A larger \(\beta\) averages over a longer window, which suppresses more noise and builds more speed along consistent directions, but it also makes the optimizer slower to react when the true descent direction changes. A smaller \(\beta\) stays responsive but retains less of the accelerating and smoothing benefit.

197.4 4. Why Momentum Accelerates and Dampens Oscillation

197.4.1 4.1 The Anisotropic Quadratic Model

The cleanest analysis uses a convex quadratic, which locally approximates any smooth loss near a minimum. Take

\[ f(\theta) = \frac{1}{2} \theta^\top H \theta, \]

with \(H\) symmetric positive definite. Because \(H\) can be diagonalized in an orthonormal eigenbasis, the dynamics decouple into independent scalar problems, one per eigenvalue \(\lambda_i\) of \(H\). Along eigendirection \(i\) the gradient is \(\lambda_i \theta_i\), and the heavy-ball update becomes a one-dimensional linear recurrence.

The eigenvalues span a range from the smallest, \(\lambda_{\min}\), to the largest, \(\lambda_{\max}\). Their ratio,

\[ \kappa = \frac{\lambda_{\max}}{\lambda_{\min}}, \]

is the condition number. A large \(\kappa\) is exactly the long narrow valley: steep curvature along high-eigenvalue directions and shallow curvature along low-eigenvalue ones.

197.4.2 4.2 The Two Regimes

Consider how a single eigendirection evolves. Along a high-curvature direction, plain gradient descent with a learning rate near its stability limit overshoots the minimum and reverses on each step, so the iterates oscillate. The successive gradients \(g_t, g_{t+1}, \dots\) alternate in sign. When momentum sums gradients of alternating sign, the terms partially cancel, and the velocity in that direction stays small. Oscillation is therefore damped.

Along a low-curvature direction the gradient points the same way for many consecutive steps. Momentum sums gradients of the same sign, so the velocity grows, approaching a magnitude of roughly \(1/(1-\beta)\) times the per-step gradient. Progress in that direction is amplified. This asymmetry is the essence of momentum: it accelerates where the signal is consistent and brakes where the signal oscillates.

197.4.3 4.3 The Convergence Rate

For the quadratic model with optimally tuned \(\alpha\) and \(\beta\), the heavy-ball method achieves a contraction factor governed by \(\sqrt{\kappa}\) rather than \(\kappa\). Plain gradient descent contracts the error per step by a factor of order

\[ \frac{\kappa - 1}{\kappa + 1}, \]

while optimally tuned momentum contracts by a factor of order

\[ \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}. \]

The optimal momentum parameters are

\[ \alpha = \left( \frac{2}{\sqrt{\lambda_{\max}} + \sqrt{\lambda_{\min}}} \right)^2, \qquad \beta = \left( \frac{\sqrt{\lambda_{\max}} - \sqrt{\lambda_{\min}}}{\sqrt{\lambda_{\max}} + \sqrt{\lambda_{\min}}} \right)^2 . \]

The replacement of \(\kappa\) by \(\sqrt{\kappa}\) is a genuine acceleration. For an ill-conditioned problem with \(\kappa = 10^4\), the number of iterations needed to reach a target accuracy drops by roughly a factor of \(\sqrt{\kappa} = 100\). This is the formal sense in which momentum is fast: it improves the dependence of the iteration count on the conditioning of the problem from linear in \(\kappa\) to linear in \(\sqrt{\kappa}\).

197.5 5. The Physical Analogy

197.5.1 5.1 A Ball Rolling Under Friction

The name heavy-ball comes from a mechanical picture. Imagine a ball of mass \(m\) rolling on the loss surface, subject to a force equal to the negative gradient and a viscous friction proportional to its velocity. Newton’s second law gives the continuous-time equation of motion

\[ m \, \ddot{\theta}(s) + \mu \, \dot{\theta}(s) + \nabla f(\theta(s)) = 0, \]

where \(s\) is continuous time, \(\mu\) is the friction coefficient, and dots denote time derivatives. The gradient acts as a position-dependent force pulling the ball downhill, mass gives the ball inertia so it cannot change direction instantly, and friction removes energy so the ball eventually settles at a minimum.

197.5.2 5.2 From Physics to the Update

Discretizing this differential equation with a finite time step recovers the heavy-ball recursion. The second derivative \(\ddot{\theta}\) becomes a finite difference involving \(\theta_{t+1}\), \(\theta_t\), and \(\theta_{t-1}\), and the first derivative \(\dot{\theta}\) becomes \(\theta_t - \theta_{t-1}\). Collecting terms reproduces

\[ \theta_{t+1} = \theta_t - \alpha \nabla f(\theta_t) + \beta(\theta_t - \theta_{t-1}), \]

with the learning rate \(\alpha\) set by the mass and time step, and the momentum coefficient \(\beta\) set by the friction. Low friction corresponds to \(\beta\) near \(1\): the ball coasts a long way and oscillates before settling. High friction corresponds to \(\beta\) near \(0\): the ball is sluggish and behaves like overdamped gradient descent.

197.5.3 5.3 What the Analogy Explains

The mechanical view makes the behavior intuitive. Inertia is why the optimizer powers through small bumps and shallow local irregularities that would trap a memoryless method, since a moving mass carries kinetic energy across them. Inertia is also why momentum can overshoot a sharp minimum and oscillate around it before friction dissipates the excess energy. The friction term is what guarantees the ball does not orbit forever; without it, energy would be conserved and the ball would oscillate indefinitely. Tuning \(\beta\) is choosing how much friction to apply, balancing the desire to coast quickly against the need to come to rest.

The analogy also clarifies the failure modes. Too little friction, meaning \(\beta\) too close to \(1\), leaves the system underdamped, and the trajectory overshoots and rings. Too much friction recovers the slow behavior we wanted to escape. The sweet spot is the critically damped regime, where the ball reaches the minimum as quickly as possible without overshooting, and this is precisely the intuition behind the optimal \(\beta\) derived from the quadratic model.

197.6 6. Practical Considerations

In stochastic settings, gradients are estimated from minibatches and are therefore noisy. The exponential moving average view is especially relevant here, because momentum averages out the per-batch noise and yields a more stable descent direction than any single noisy gradient. A typical default is \(\beta = 0.9\), which averages roughly the last ten gradients and works well across a wide range of deep learning problems.

A practical caution concerns the interaction of \(\alpha\) and \(\beta\). Because the steady-state velocity scales like \(1/(1-\beta)\), increasing \(\beta\) effectively increases the size of accumulated steps. Raising \(\beta\) without lowering \(\alpha\) can push the optimizer into the underdamped or unstable regime, producing visible oscillation in the loss curve. When the formulation rescales the gradient by \((1 - \beta)\), this coupling is partly absorbed, which is one reason that variant is often preferred in libraries.

Nesterov’s accelerated gradient is a closely related method that evaluates the gradient at a look-ahead point \(\theta_t + \beta(\theta_t - \theta_{t-1})\) rather than at \(\theta_t\). This anticipatory correction improves the worst-case convergence guarantees for smooth convex objectives and often behaves slightly better in practice, but it shares the same accumulating velocity mechanism analyzed above.

197.7 7. Summary

Momentum equips gradient descent with a velocity that is an exponentially weighted average of past gradients. Viewed as the heavy-ball method, it adds a fraction of the previous step to the current one. Viewed as an exponential moving average, it smooths the gradient signal over an effective window of \(1/(1-\beta)\) steps. Viewed physically, it models a heavy ball rolling under friction, with inertia carrying it efficiently along consistent directions and friction settling it at the minimum.

These views agree on the mechanism. Persistent gradient components reinforce and accelerate, oscillating components cancel and are damped, and on ill-conditioned quadratics the convergence rate improves from a dependence on \(\kappa\) to a dependence on \(\sqrt{\kappa}\). This combination of acceleration and stabilization, achieved with almost no extra computation, is why momentum is a standard ingredient in modern optimizers.

197.8 References

  1. Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964. https://doi.org/10.1016/0041-5553(64)90137-5
  2. Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/k^2). Doklady Akademii Nauk SSSR, 1983. https://www.mathnet.ru/eng/dan46009
  3. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. ICML, 2013. https://proceedings.mlr.press/v28/sutskever13.html
  4. Goh, G. Why momentum really works. Distill, 2017. https://distill.pub/2017/momentum/
  5. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 8. MIT Press, 2016. https://www.deeplearningbook.org/
  6. Qian, N. On the momentum term in gradient descent learning algorithms. Neural Networks, 1999. https://doi.org/10.1016/S0893-6080(98)00116-6