199 Adaptive Learning Rates with AdaGrad
Stochastic gradient descent applies a single global learning rate to every parameter at every step. This choice is rarely optimal. Different parameters in a model often live on very different scales, receive gradients of very different magnitudes, and are updated at very different frequencies. A learning rate that is well tuned for a frequently updated, large-gradient coordinate may be far too aggressive for a rarely updated coordinate, and a rate that is gentle enough for the former may starve the latter of progress. AdaGrad, introduced by Duchi, Hazan, and Singer in 2011, was the first widely adopted optimizer to address this mismatch by giving every parameter its own learning rate that adapts automatically as training proceeds. This chapter develops the AdaGrad update from first principles, explains why it excels on sparse and ill-conditioned problems, and analyzes the vanishing learning rate pathology that ultimately limited its use and motivated its successors.
199.1 1. From Global to Per-Parameter Step Sizes
Consider minimizing a differentiable objective \(f(\theta)\) over \(\theta \in \mathbb{R}^d\). Plain stochastic gradient descent (SGD) computes a stochastic gradient \(g_t = \nabla_\theta f_t(\theta_t)\) at step \(t\) and updates
\[ \theta_{t+1} = \theta_t - \eta\, g_t, \]
where \(\eta > 0\) is a scalar learning rate shared across all \(d\) coordinates. The difficulty is that this single \(\eta\) must serve every coordinate simultaneously. If the loss surface is poorly conditioned, meaning its curvature differs sharply across directions, then the largest stable step in a steep direction is much smaller than the step needed to make timely progress in a flat direction. SGD must respect the steep direction to avoid divergence, so it crawls along the flat ones.
A natural fix is to replace the scalar \(\eta\) with a vector or matrix of step sizes, one tailored to each coordinate. Newton and quasi-Newton methods do this using second-order curvature information, but computing and inverting a \(d \times d\) Hessian is intractable for the high dimensional models common in machine learning. AdaGrad occupies a middle ground. It constructs a cheap, diagonal preconditioner from the history of gradients alone, requiring no second derivatives and only \(O(d)\) extra memory.
199.2 2. The AdaGrad Update Rule
AdaGrad maintains, for each coordinate \(i\), a running sum of the squares of all gradients observed so far in that coordinate. Let \(g_{t,i}\) denote the \(i\)-th component of the gradient at step \(t\). Define the accumulator
\[ G_{t,i} = \sum_{\tau=1}^{t} g_{\tau,i}^2 . \]
The update for coordinate \(i\) is then
\[ \theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i}} + \epsilon}\, g_{t,i}, \]
where \(\eta\) is a global base learning rate and \(\epsilon\) is a small constant (typically \(10^{-8}\)) added for numerical stability so the denominator never vanishes. In vector form, writing \(G_t = \sum_{\tau=1}^{t} g_\tau \odot g_\tau\) for the elementwise accumulator, the update is
\[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \odot g_t, \]
where \(\odot\) denotes elementwise multiplication and the square root and division act componentwise. The effective learning rate for coordinate \(i\) at step \(t\) is
\[ \eta_{t,i} = \frac{\eta}{\sqrt{G_{t,i}} + \epsilon}. \]
Because \(G_{t,i}\) is nondecreasing in \(t\), every effective learning rate \(\eta_{t,i}\) is nonincreasing. The base rate \(\eta\) becomes far less sensitive than in SGD, since the per-coordinate scaling absorbs much of the burden of choosing the right magnitude.
A compact statement of the algorithm:
initialize theta, G = 0
for t = 1, 2, ...:
g = stochastic_gradient(theta)
G = G + g * g # elementwise accumulate
theta = theta - eta * g / (sqrt(G) + epsilon)
The full matrix version of AdaGrad replaces \(G_t\) with the outer product sum \(\sum_\tau g_\tau g_\tau^\top\) and uses its inverse square root as a preconditioner. This captures correlations between coordinates but costs \(O(d^2)\) memory and \(O(d^3)\) per step, so it is almost never used. The diagonal approximation above is what practitioners mean by AdaGrad.
199.3 3. Why the Square Root and the Accumulator
The structure of the denominator is not arbitrary. Two design choices deserve attention.
199.3.1 3.1 Normalizing by Accumulated Magnitude
Dividing by \(\sqrt{G_{t,i}}\) rescales each coordinate so that directions which have historically produced large gradients take proportionally smaller steps, while directions with small gradients take larger steps. This equalizes progress across coordinates regardless of their natural scale. If one feature is measured in units a thousand times larger than another, its gradients will be correspondingly larger, and AdaGrad automatically shrinks its step to compensate. The optimizer therefore behaves like a crude diagonal preconditioner that approximates inverse curvature using gradient magnitudes as a proxy.
199.3.2 3.2 The Square Root Rather Than the Raw Sum
One might ask why we divide by \(\sqrt{G_{t,i}}\) rather than by \(G_{t,i}\) itself. The square root keeps the units consistent. The accumulator \(G_{t,i}\) has the units of a squared gradient, so \(\sqrt{G_{t,i}}\) has the units of a gradient, and the ratio \(g_{t,i} / \sqrt{G_{t,i}}\) is dimensionless and of order one in magnitude. Dividing by the raw sum would over-damp the update and cause the effective rate to collapse far too quickly. The square root also emerges naturally from the regret analysis: AdaGrad was derived to minimize a regret bound in online convex optimization, and the optimal diagonal preconditioner in that analysis is precisely the inverse square root of the accumulated squared gradients.
199.4 4. Regret Guarantee and Theoretical Footing
AdaGrad comes with a rigorous guarantee in the online convex optimization framework. Suppose at each round we play \(\theta_t\), then suffer a convex loss \(f_t\), and we measure performance by the regret relative to the best fixed point \(\theta^\star\) in hindsight:
\[ R_T = \sum_{t=1}^{T} \big( f_t(\theta_t) - f_t(\theta^\star) \big). \]
Duchi et al. show that the diagonal AdaGrad update achieves a regret bound of the form
\[ R_T \le 2 D_\infty \sum_{i=1}^{d} \sqrt{\sum_{t=1}^{T} g_{t,i}^2}, \]
where \(D_\infty\) bounds the distance of iterates from the optimum. The key feature is the dependence on \(\sum_i \sqrt{\sum_t g_{t,i}^2}\) rather than on the worst-case gradient norm times \(\sqrt{T}\). When gradients are sparse, meaning most coordinates are zero on most rounds, this data-dependent bound can be dramatically smaller than the \(O(\sqrt{T})\) regret of standard SGD with a tuned step size. This is the formal statement of AdaGrad’s advantage on sparse problems, and it holds without prior knowledge of which coordinates will turn out to be informative.
199.5 5. Strength on Sparse Features
The clearest practical benefit of AdaGrad appears in models with high dimensional sparse inputs, such as bag-of-words text classifiers, recommendation systems with large categorical vocabularies, and the embedding layers of early neural language models. In these settings most features are absent from most training examples, so the gradient for a given parameter is zero on the vast majority of steps and nonzero only on the rare steps where the corresponding feature fires.
Consider a parameter tied to a rare word. Its accumulator \(G_{t,i}\) grows only on the few steps when that word appears, so it stays small. The effective learning rate \(\eta / (\sqrt{G_{t,i}} + \epsilon)\) therefore remains large, and the parameter takes substantial steps whenever it is updated. By contrast, a parameter tied to a common word accumulates squared gradients on nearly every step, so its effective rate shrinks quickly and its updates become conservative. AdaGrad thus allocates aggressive learning to rare, informative features and cautious learning to frequent ones, all without any manual per-feature tuning. With a single global learning rate, the practitioner faces an impossible trade-off: a rate large enough to learn rare features quickly will be unstable on frequent ones, and a rate stable on frequent features will leave rare ones undertrained. AdaGrad dissolves this trade-off automatically, which is why it became a default choice for sparse linear models and embedding training in the early 2010s.
199.6 6. The Vanishing Learning Rate Problem
The same mechanism that gives AdaGrad its strength also creates its central weakness. Because the accumulator \(G_{t,i}\) is a sum of nonnegative terms, it grows monotonically and without bound as training continues. Whenever a coordinate receives a nonzero gradient, \(G_{t,i}\) strictly increases, so the effective learning rate
\[ \eta_{t,i} = \frac{\eta}{\sqrt{G_{t,i}} + \epsilon} \]
decreases monotonically and, for any coordinate that keeps receiving gradients, tends toward zero. To see the rate of decay, suppose a coordinate receives gradients of roughly constant magnitude \(\bar{g}\) on every step. Then \(G_{t,i} \approx t\,\bar{g}^2\), so
\[ \eta_{t,i} \approx \frac{\eta}{\bar{g}\sqrt{t}}. \]
The effective learning rate decays like \(1/\sqrt{t}\). In the convex setting analyzed by the original authors, this decay is exactly what the theory prescribes and is benign, because a decaying step size is required for convergence to a fixed optimum. In the nonconvex, long-horizon training of deep neural networks, however, it is harmful. Dense parameters in a deep network receive a nonzero gradient at essentially every step, so their accumulators grow steadily, their effective rates shrink toward zero, and the optimizer effectively stops learning well before the loss has been adequately minimized. Training stalls not because a good solution has been reached but because the step sizes have collapsed.
This failure is intrinsic to the accumulate-forever design. AdaGrad has no mechanism to forget old gradients or to recover a larger step size once curvature changes or the loss landscape shifts, as it does throughout the training of a deep model. The accumulator records the entire history with equal weight, so a burst of large gradients early in training permanently suppresses the learning rate for the rest of the run.
The remedy, developed in AdaGrad’s successors, is to replace the unbounded sum with an exponentially weighted moving average of squared gradients. RMSProp uses
\[ G_{t,i} = \rho\, G_{t-1,i} + (1-\rho)\, g_{t,i}^2, \]
with a decay factor \(\rho \in (0,1)\) typically around \(0.9\) or \(0.99\). Because old contributions decay geometrically, the accumulator reflects only the recent gradient regime and reaches a stationary value when the gradient statistics are stationary, so the effective learning rate no longer collapses to zero. Adam combines this same decaying second-moment estimate with a decaying first-moment (momentum) estimate and bias correction. These methods inherit AdaGrad’s per-parameter adaptivity while discarding its fatal monotonic decay, which is why they, rather than AdaGrad itself, dominate modern deep learning. AdaGrad remains the conceptual ancestor and is still a sensible default for convex and strongly sparse problems where its decay schedule is an asset rather than a liability.
199.7 7. Practical Considerations
A few points guide effective use. The base learning rate \(\eta\) still requires tuning, though AdaGrad is markedly less sensitive to it than SGD; values around \(0.01\) are common starting points. The stabilizer \(\epsilon\) is added outside the square root in the original formulation but inside it in many implementations, and the two conventions differ slightly; the value matters little as long as it is small. Because the effective rate only decreases, AdaGrad benefits from a generous initial \(\eta\), since there is no danger of the early steps being too large for long. Finally, AdaGrad’s per-coordinate accumulator must be stored alongside the parameters, doubling the optimizer’s memory footprint relative to plain SGD, which is negligible compared with the \(O(d^2)\) cost of true second-order methods and is the same order as RMSProp.
199.8 8. Summary
AdaGrad replaces SGD’s single global learning rate with a per-parameter rate obtained by dividing the base rate by the square root of the accumulated sum of squared gradients in each coordinate. This data-dependent rescaling acts as a cheap diagonal preconditioner, equalizes progress across coordinates of differing scale, and provably reduces regret on sparse problems by adapting aggressive learning to rare features and cautious learning to frequent ones. Its defining limitation is the vanishing learning rate: the unbounded accumulator forces effective step sizes to decay toward zero, which is appropriate for convex optimization but stalls long-horizon nonconvex training. Recognizing this, later optimizers such as RMSProp and Adam retain AdaGrad’s adaptive structure while replacing the monotone sum with a decaying average, preserving the strengths and curing the weakness.
199.9 References
- Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159. https://www.jmlr.org/papers/v12/duchi11a.html
- Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms. https://www.ruder.io/optimizing-gradient-descent/
- Tieleman, T. and Hinton, G. (2012). Lecture 6.5: RMSProp. Coursera, Neural Networks for Machine Learning. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
- Kingma, D. P. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations. https://arxiv.org/abs/1412.6980
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, Chapter 8: Optimization for Training Deep Models. MIT Press. https://www.deeplearningbook.org/contents/optimization.html