200 Adaptive Learning Rates: RMSProp

200.1 1. Introduction

Gradient descent and its stochastic variants share a single, awkward hyperparameter: the learning rate. A scalar step size $\eta$ must serve every coordinate of a high dimensional parameter vector, even though different coordinates often live on wildly different scales. A weight feeding into a rarely activated feature may receive sparse, large gradients, while a weight tied to a dense feature receives small, frequent ones. A global $\eta$ that is safe for one is wasteful for the other. Adaptive methods respond by giving each coordinate its own effective step size, derived from the recent history of that coordinate’s gradients.

RMSProp is among the most influential of these methods. It was proposed by Geoffrey Hinton in his Coursera course on neural networks and was never published as a standalone paper, yet it became a workhorse optimizer for recurrent networks and a conceptual ancestor of Adam. This chapter develops RMSProp from the failure mode of its predecessor AdaGrad, explains the exponential moving average that fixes that failure, and examines the small but consequential epsilon term that keeps the update numerically sane.

200.2 2. From AdaGrad to a Decay Problem

200.2.1 2.1 The AdaGrad update

AdaGrad accumulates the sum of squared gradients per coordinate and scales each step by the inverse square root of that accumulator. Let $g_t = \nabla_\theta \mathcal{L}_t(\theta_t)$ be the gradient at step $t$, and let $g_{t,i}$ denote its $i$-th component. AdaGrad maintains a state vector

\[ G_{t,i} = \sum_{\tau=1}^{t} g_{\tau,i}^2, \]

and updates parameters by

\[ \theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i}} + \epsilon} \, g_{t,i} . \]

Coordinates that have seen large or frequent gradients accumulate a large $G_{t,i}$ and are therefore damped, while quiet coordinates retain large effective steps. For convex problems and sparse features this behavior is provably helpful, and AdaGrad enjoys strong regret guarantees in the online convex optimization setting.

200.2.2 2.2 Why the accumulator kills learning

The trouble is that $G_{t,i}$ is a sum that only grows. Because every squared gradient is non negative, $G_{t,i}$ increases monotonically with $t$. The effective learning rate $\eta / (\sqrt{G_{t,i}} + \epsilon)$ therefore decays toward zero, and it does so regardless of whether the optimizer has actually reached a good region. In a deep, non convex loss landscape where training runs for many thousands of steps, this aggressive and irreversible decay often stalls progress long before the model has converged. The optimizer effectively freezes.

A useful way to see the pathology: if the gradient magnitude stays roughly constant at $|g_i|$, then $G_{t,i} \approx t \, g_i^2$, so the effective step shrinks like $1/\sqrt{t}$. That $1/\sqrt{t}$ schedule is exactly right for a convex objective, where you want steps to vanish, but it is far too rigid for the long, nonstationary trajectories of neural network training, where the appropriate scale of a coordinate can change as the model passes through different regions of parameter space.

200.3 3. The RMSProp Fix: Exponential Moving Average

200.3.1 3.1 Replacing a sum with a leaky average

RMSProp keeps AdaGrad’s idea of normalizing by a root mean square of recent gradients but replaces the unbounded sum with an exponential moving average (EMA). Instead of accumulating all past squared gradients with equal weight, it forms a running estimate of the second moment that forgets the distant past:

\[ v_{t,i} = \beta \, v_{t-1,i} + (1 - \beta)\, g_{t,i}^2, \]

with $v_{0,i} = 0$ and a decay rate $\beta \in (0,1)$, commonly $\beta = 0.9$. The parameter update is then

\[ \theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{v_{t,i}} + \epsilon}\, g_{t,i} . \]

The quantity $\sqrt{v_{t,i}}$ is the root mean square of recent gradients along coordinate $i$, which is the origin of the name RMSProp (root mean square propagation).

200.3.2 3.2 Why the EMA solves the decay problem

The crucial difference from AdaGrad is that $v_{t,i}$ does not grow without bound. Unrolling the recursion gives a weighted sum with geometrically decaying weights,

\[ v_{t,i} = (1 - \beta) \sum_{\tau=1}^{t} \beta^{\,t - \tau} \, g_{\tau,i}^2, \]

so each past squared gradient contributes with weight $(1-\beta)\beta^{t-\tau}$, which fades as $\tau$ recedes. If gradient magnitudes are roughly stationary with second moment $\mathbb{E}[g_i^2] = s_i$, then $v_{t,i}$ converges to $s_i$ rather than diverging, and the effective learning rate $\eta / \sqrt{v_{t,i}}$ settles to a stable, nonzero value $\eta / \sqrt{s_i}$. Learning no longer grinds to a halt.

The decay also makes the method responsive to nonstationarity. The EMA has an effective memory horizon of roughly $1/(1-\beta)$ steps, so with $\beta = 0.9$ it averages over about the last ten gradients. When the optimizer enters a region where a coordinate’s gradients suddenly grow, $v_{t,i}$ rises within a few steps and damps the step; when gradients shrink, $v_{t,i}$ relaxes and the step size recovers. This adaptivity to the local geometry is precisely what AdaGrad’s frozen accumulator cannot provide.

200.3.3 3.3 A geometric reading

It helps to read the update as an approximate, diagonal preconditioner. The vector $1/(\sqrt{v_t} + \epsilon)$ rescales each coordinate so that, after rescaling, the gradient has roughly unit root mean square magnitude in every direction. In a ravine, where the loss is steep across the valley and shallow along it, the steep directions accumulate large $v_{t,i}$ and are shrunk, while the shallow direction keeps a larger step. The net effect is to equalize progress across coordinates and to reduce the zig zagging that plagues plain gradient descent on ill conditioned problems. RMSProp is not a true second order method, since it ignores off diagonal curvature, but the diagonal normalization captures much of the benefit at negligible cost.

200.4 4. The Epsilon Term

200.4.1 4.1 Numerical role

The constant $\epsilon$ in the denominator is small, typically between $10^{-8}$ and $10^{-6}$. Its first job is to prevent division by zero. Early in training, or for a coordinate whose gradient has been near zero for many steps, $v_{t,i}$ can be extremely small, and $\sqrt{v_{t,i}}$ smaller still. Without $\epsilon$ the update could blow up or produce a NaN. Adding $\epsilon$ guarantees a finite denominator and caps the largest possible effective learning rate at $\eta / \epsilon$.

200.4.2 4.2 Epsilon as a soft floor on the denominator

Beyond bare numerical safety, $\epsilon$ behaves as a soft floor that sets the maximum step a coordinate can take. Consider a coordinate with a tiny but nonzero second moment. If $\sqrt{v_{t,i}} \ll \epsilon$, the denominator is dominated by $\epsilon$ and the update reduces to $-(\eta/\epsilon)\, g_{t,i}$, an ordinary gradient step with a large but bounded learning rate. If $\sqrt{v_{t,i}} \gg \epsilon$, the $\epsilon$ is negligible and the normalization is essentially pure. So $\epsilon$ smoothly interpolates between unnormalized and fully normalized regimes, and it determines how aggressively the optimizer is allowed to amplify small gradients. Setting $\epsilon$ too small can cause instability on flat regions, where a near zero $v_{t,i}$ would otherwise grant a coordinate an enormous step. Setting it too large weakens the adaptivity, since the denominator becomes dominated by a constant and the method drifts back toward plain SGD.

200.4.3 4.3 Placement matters

A subtle but practically important point is where $\epsilon$ sits. The common form places it outside the square root,

\[ \frac{\eta}{\sqrt{v_{t,i}} + \epsilon}, \]

but some implementations place it inside,

\[ \frac{\eta}{\sqrt{v_{t,i} + \epsilon}} . \]

These are not equivalent, and the effective value of $\epsilon$ that produces a given behavior differs between the two forms by roughly a square root. When porting hyperparameters between frameworks it is worth checking which convention is used, since a value tuned for one placement can be badly miscalibrated for the other.

200.5 5. Algorithm and Practical Notes

The full per step procedure is compact.

initialize theta, v <- 0
for t = 1, 2, ...:
    g  <- grad(loss(theta))          # gradient at current params
    v  <- beta * v + (1 - beta) * g * g   # EMA of squared gradient
    theta <- theta - eta * g / (sqrt(v) + eps)

A few practical observations follow from the analysis above.

Unlike Adam, plain RMSProp uses no bias correction for the EMA, so $v_t$ is biased toward zero during the first several steps when it is still warming up from $v_0 = 0$. This makes the very early updates somewhat larger than the steady state would suggest. In practice a short warmup of the learning rate, or simply tolerating the transient, handles this; Adam later addressed it explicitly with bias correction terms.

Typical defaults are $\eta$ in the range $10^{-4}$ to $10^{-3}$, $\beta = 0.9$, and $\epsilon = 10^{-8}$. RMSProp was historically favored for recurrent neural networks, whose gradient magnitudes vary sharply across time steps and parameters, exactly the nonstationary setting the EMA is built for. A momentum variant adds a velocity term on top of the normalized gradient, and the centered variant additionally tracks an EMA of the gradient itself to normalize by an estimate of the variance rather than the raw second moment.

The conceptual line to Adam is direct: Adam combines RMSProp’s EMA of squared gradients with an EMA of the gradient (momentum) and adds bias correction to both. Understanding RMSProp’s second moment estimate and its epsilon is therefore most of the way to understanding Adam.

200.5.1 5.1 The first step in closed form

Because $v_0 = 0$, the very first update has a clean closed form that is worth pinning down, both as a sanity check and because it reveals the warmup behavior. With $v_{1,i} = (1-\beta) g_{1,i}^2$ we get $\sqrt{v_{1,i}} = \sqrt{1-\beta}\,|g_{1,i}|$, so for any coordinate whose gradient dominates $\epsilon$,

\[ \theta_{2,i} - \theta_{1,i} = -\frac{\eta\, g_{1,i}}{\sqrt{1-\beta}\,|g_{1,i}| + \epsilon} \;\approx\; -\frac{\eta}{\sqrt{1-\beta}}\,\operatorname{sign}(g_{1,i}). \]

The first step is essentially a signed step of fixed magnitude $\eta/\sqrt{1-\beta}$, independent of how large the gradient is. With $\beta = 0.9$ this inflates the nominal step by a factor of $1/\sqrt{0.1} \approx 3.16$, which is the concrete face of the missing bias correction. The worked example below reproduces exactly this number on a hand-checkable input.

200.6 6. Reference Implementation

The library ships a small, dependency-light RMSProp in Python, Julia, and Rust. The three share one API: init_state builds the optimizer state with v set to zero, rmsprop_step applies a single update from an externally supplied gradient, and minimize drives a closure that returns the gradient at the current parameters. All three agree on the shared numeric fixtures to within 1e-9.

We demonstrate on the separable quadratic $f(x) = \tfrac{1}{2}\sum_i c_i x_i^2$ with $c = (1, 4)$, whose gradient is $\nabla f(x) = c \odot x$ and whose minimizer is the origin. RMSProp’s per-coordinate normalization lets the steep coordinate ($c_2 = 4$) and the shallow one ($c_1 = 1$) make comparable progress despite the four-to-one curvature ratio.

Code

import numpy as np
from aiinaction.ch195_rmsprop import init_state, rmsprop_step, minimize

# One explicit, hand-checkable step: v starts at zero.
state = init_state([1.0, -2.0, 0.5], lr=0.01, beta=0.9, eps=1e-8)
grad = np.array([0.1, -0.3, 2.0])
step1 = rmsprop_step(state, grad)
print("v after one step:   ", np.round(step1.v, 6))
print("params after step:  ", np.round(step1.params, 6))

# First-step inflation factor 1/sqrt(1 - beta) ~= 3.162 for beta = 0.9.
print("step / (lr*sign(g)):", np.round((step1.params - state.params) / (-0.01), 4))

# Minimize the ill-conditioned quadratic f(x) = 0.5 * sum(c * x^2).
c = np.array([1.0, 4.0])
result = minimize(lambda x: c * x, [2.0, 2.0], 60, lr=0.1, beta=0.9)
print("minimizer estimate: ", np.round(result.params, 6))
print("steps taken:        ", result.step_count)

v after one step:    [0.001 0.009 0.4  ]
params after step:   [ 0.968377 -1.968377  0.468377]
step / (lr*sign(g)): [ 3.1623 -3.1623  3.1623]
minimizer estimate:  [0. 0.]
steps taken:         60

using AIInAction.Ch195Rmsprop

# One explicit, hand-checkable step: v starts at zero.
state = init_state([1.0, -2.0, 0.5]; lr=0.01, beta=0.9, eps=1e-8)
grad = [0.1, -0.3, 2.0]
step1 = rmsprop_step(state, grad)
println("v after one step:  ", round.(step1.v, digits=6))
println("params after step: ", round.(step1.params, digits=6))

# Minimize the ill-conditioned quadratic f(x) = 0.5 * sum(c .* x.^2).
c = [1.0, 4.0]
result = minimize(x -> c .* x, [2.0, 2.0], 60; lr=0.1, beta=0.9)
println("minimizer estimate: ", round.(result.params, digits=6))
println("steps taken:        ", result.step_count)

use aiinaction::ch195_rmsprop::{init_state, rmsprop_step, minimize};

fn main() {
    // One explicit, hand-checkable step: v starts at zero.
    let state = init_state(&[1.0, -2.0, 0.5], 0.01, 0.9, 1e-8).unwrap();
    let grad = [0.1, -0.3, 2.0];
    let step1 = rmsprop_step(&state, &grad).unwrap();
    println!("v after one step:  {:?}", step1.v);
    println!("params after step: {:?}", step1.params);

    // Minimize the ill-conditioned quadratic f(x) = 0.5 * sum(c * x^2).
    let c = [1.0, 4.0];
    let result = minimize(
        |x: &[f64]| vec![c[0] * x[0], c[1] * x[1]],
        &[2.0, 2.0],
        60,
        0.1,
        0.9,
        1e-8,
    )
    .unwrap();
    println!("minimizer estimate: {:?}", result.params);
    println!("steps taken:        {}", result.step_count);
}

200.7 7. Summary

AdaGrad introduced per coordinate adaptive learning rates by normalizing with a sum of squared gradients, but that monotone sum forces the effective step to decay toward zero and stalls deep network training. RMSProp replaces the sum with an exponential moving average of squared gradients controlled by a decay rate $\beta$, giving a bounded, forgetting estimate of the local gradient scale that adapts to nonstationary objectives. The epsilon term guarantees numerical stability and acts as a soft floor that bounds the maximum step, with its placement relative to the square root affecting its calibration. Together these choices make RMSProp a robust, low cost diagonal preconditioner and a direct foundation for Adam.

200.8 References

Hinton, G., Srivastava, N., and Swersky, K. “Neural Networks for Machine Learning, Lecture 6e: RMSProp.” University of Toronto / Coursera. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Duchi, J., Hazan, E., and Singer, Y. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research, 12:2121-2159, 2011. https://jmlr.org/papers/v12/duchi11a.html
Kingma, D. P., and Ba, J. “Adam: A Method for Stochastic Optimization.” International Conference on Learning Representations, 2015. https://arxiv.org/abs/1412.6980
Ruder, S. “An Overview of Gradient Descent Optimization Algorithms.” 2016. https://arxiv.org/abs/1609.04747
Goodfellow, I., Bengio, Y., and Courville, A. “Deep Learning,” Chapter 8: Optimization for Training Deep Models. MIT Press, 2016. https://www.deeplearningbook.org/

# Adaptive Learning Rates: RMSProp ## 1. Introduction Gradient descent and its stochastic variants share a single, awkward hyperparameter: the learning rate. A scalar step size $\eta$ must serve every coordinate of a high dimensional parameter vector, even though different coordinates often live on wildly different scales. A weight feeding into a rarely activated feature may receive sparse, large gradients, while a weight tied to a dense feature receives small, frequent ones. A global $\eta$ that is safe for one is wasteful for the other. Adaptive methods respond by giving each coordinate its own effective step size, derived from the recent history of that coordinate's gradients. RMSProp is among the most influential of these methods. It was proposed by Geoffrey Hinton in his Coursera course on neural networks and was never published as a standalone paper, yet it became a workhorse optimizer for recurrent networks and a conceptual ancestor of Adam. This chapter develops RMSProp from the failure mode of its predecessor AdaGrad, explains the exponential moving average that fixes that failure, and examines the small but consequential epsilon term that keeps the update numerically sane. ## 2. From AdaGrad to a Decay Problem ### 2.1 The AdaGrad update AdaGrad accumulates the sum of squared gradients per coordinate and scales each step by the inverse square root of that accumulator. Let $g_t = \nabla_\theta \mathcal{L}_t(\theta_t)$ be the gradient at step $t$, and let $g_{t,i}$ denote its $i$-th component. AdaGrad maintains a state vector $$ G_{t,i} = \sum_{\tau=1}^{t} g_{\tau,i}^2, $$ and updates parameters by $$ \theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i}} + \epsilon} \, g_{t,i} . $$ Coordinates that have seen large or frequent gradients accumulate a large $G_{t,i}$ and are therefore damped, while quiet coordinates retain large effective steps. For convex problems and sparse features this behavior is provably helpful, and AdaGrad enjoys strong regret guarantees in the online convex optimization setting. ### 2.2 Why the accumulator kills learning The trouble is that $G_{t,i}$ is a sum that only grows. Because every squared gradient is non negative, $G_{t,i}$ increases monotonically with $t$. The effective learning rate $\eta / (\sqrt{G_{t,i}} + \epsilon)$ therefore decays toward zero, and it does so regardless of whether the optimizer has actually reached a good region. In a deep, non convex loss landscape where training runs for many thousands of steps, this aggressive and irreversible decay often stalls progress long before the model has converged. The optimizer effectively freezes. A useful way to see the pathology: if the gradient magnitude stays roughly constant at $|g_i|$, then $G_{t,i} \approx t \, g_i^2$, so the effective step shrinks like $1/\sqrt{t}$. That $1/\sqrt{t}$ schedule is exactly right for a convex objective, where you want steps to vanish, but it is far too rigid for the long, nonstationary trajectories of neural network training, where the appropriate scale of a coordinate can change as the model passes through different regions of parameter space. ## 3. The RMSProp Fix: Exponential Moving Average ### 3.1 Replacing a sum with a leaky average RMSProp keeps AdaGrad's idea of normalizing by a root mean square of recent gradients but replaces the unbounded sum with an exponential moving average (EMA). Instead of accumulating all past squared gradients with equal weight, it forms a running estimate of the second moment that forgets the distant past: $$ v_{t,i} = \beta \, v_{t-1,i} + (1 - \beta)\, g_{t,i}^2, $$ with $v_{0,i} = 0$ and a decay rate $\beta \in (0,1)$, commonly $\beta = 0.9$. The parameter update is then $$ \theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{v_{t,i}} + \epsilon}\, g_{t,i} . $$ The quantity $\sqrt{v_{t,i}}$ is the root mean square of recent gradients along coordinate $i$, which is the origin of the name RMSProp (root mean square propagation). ### 3.2 Why the EMA solves the decay problem The crucial difference from AdaGrad is that $v_{t,i}$ does not grow without bound. Unrolling the recursion gives a weighted sum with geometrically decaying weights, $$ v_{t,i} = (1 - \beta) \sum_{\tau=1}^{t} \beta^{\,t - \tau} \, g_{\tau,i}^2, $$ so each past squared gradient contributes with weight $(1-\beta)\beta^{t-\tau}$, which fades as $\tau$ recedes. If gradient magnitudes are roughly stationary with second moment $\mathbb{E}[g_i^2] = s_i$, then $v_{t,i}$ converges to $s_i$ rather than diverging, and the effective learning rate $\eta / \sqrt{v_{t,i}}$ settles to a stable, nonzero value $\eta / \sqrt{s_i}$. Learning no longer grinds to a halt. The decay also makes the method responsive to nonstationarity. The EMA has an effective memory horizon of roughly $1/(1-\beta)$ steps, so with $\beta = 0.9$ it averages over about the last ten gradients. When the optimizer enters a region where a coordinate's gradients suddenly grow, $v_{t,i}$ rises within a few steps and damps the step; when gradients shrink, $v_{t,i}$ relaxes and the step size recovers. This adaptivity to the local geometry is precisely what AdaGrad's frozen accumulator cannot provide. ### 3.3 A geometric reading It helps to read the update as an approximate, diagonal preconditioner. The vector $1/(\sqrt{v_t} + \epsilon)$ rescales each coordinate so that, after rescaling, the gradient has roughly unit root mean square magnitude in every direction. In a ravine, where the loss is steep across the valley and shallow along it, the steep directions accumulate large $v_{t,i}$ and are shrunk, while the shallow direction keeps a larger step. The net effect is to equalize progress across coordinates and to reduce the zig zagging that plagues plain gradient descent on ill conditioned problems. RMSProp is not a true second order method, since it ignores off diagonal curvature, but the diagonal normalization captures much of the benefit at negligible cost. ## 4. The Epsilon Term ### 4.1 Numerical role The constant $\epsilon$ in the denominator is small, typically between $10^{-8}$ and $10^{-6}$. Its first job is to prevent division by zero. Early in training, or for a coordinate whose gradient has been near zero for many steps, $v_{t,i}$ can be extremely small, and $\sqrt{v_{t,i}}$ smaller still. Without $\epsilon$ the update could blow up or produce a NaN. Adding $\epsilon$ guarantees a finite denominator and caps the largest possible effective learning rate at $\eta / \epsilon$. ### 4.2 Epsilon as a soft floor on the denominator Beyond bare numerical safety, $\epsilon$ behaves as a soft floor that sets the maximum step a coordinate can take. Consider a coordinate with a tiny but nonzero second moment. If $\sqrt{v_{t,i}} \ll \epsilon$, the denominator is dominated by $\epsilon$ and the update reduces to $-(\eta/\epsilon)\, g_{t,i}$, an ordinary gradient step with a large but bounded learning rate. If $\sqrt{v_{t,i}} \gg \epsilon$, the $\epsilon$ is negligible and the normalization is essentially pure. So $\epsilon$ smoothly interpolates between unnormalized and fully normalized regimes, and it determines how aggressively the optimizer is allowed to amplify small gradients. Setting $\epsilon$ too small can cause instability on flat regions, where a near zero $v_{t,i}$ would otherwise grant a coordinate an enormous step. Setting it too large weakens the adaptivity, since the denominator becomes dominated by a constant and the method drifts back toward plain SGD. ### 4.3 Placement matters A subtle but practically important point is where $\epsilon$ sits. The common form places it outside the square root, $$ \frac{\eta}{\sqrt{v_{t,i}} + \epsilon}, $$ but some implementations place it inside, $$ \frac{\eta}{\sqrt{v_{t,i} + \epsilon}} . $$ These are not equivalent, and the effective value of $\epsilon$ that produces a given behavior differs between the two forms by roughly a square root. When porting hyperparameters between frameworks it is worth checking which convention is used, since a value tuned for one placement can be badly miscalibrated for the other. ## 5. Algorithm and Practical Notes The full per step procedure is compact. ``` initialize theta, v <- 0 for t = 1, 2, ...: g <- grad(loss(theta)) # gradient at current params v <- beta * v + (1 - beta) * g * g # EMA of squared gradient theta <- theta - eta * g / (sqrt(v) + eps) ``` A few practical observations follow from the analysis above. Unlike Adam, plain RMSProp uses no bias correction for the EMA, so $v_t$ is biased toward zero during the first several steps when it is still warming up from $v_0 = 0$. This makes the very early updates somewhat larger than the steady state would suggest. In practice a short warmup of the learning rate, or simply tolerating the transient, handles this; Adam later addressed it explicitly with bias correction terms. Typical defaults are $\eta$ in the range $10^{-4}$ to $10^{-3}$, $\beta = 0.9$, and $\epsilon = 10^{-8}$. RMSProp was historically favored for recurrent neural networks, whose gradient magnitudes vary sharply across time steps and parameters, exactly the nonstationary setting the EMA is built for. A momentum variant adds a velocity term on top of the normalized gradient, and the centered variant additionally tracks an EMA of the gradient itself to normalize by an estimate of the variance rather than the raw second moment. The conceptual line to Adam is direct: Adam combines RMSProp's EMA of squared gradients with an EMA of the gradient (momentum) and adds bias correction to both. Understanding RMSProp's second moment estimate and its epsilon is therefore most of the way to understanding Adam. ### 5.1 The first step in closed form Because $v_0 = 0$, the very first update has a clean closed form that is worth pinning down, both as a sanity check and because it reveals the warmup behavior. With $v_{1,i} = (1-\beta) g_{1,i}^2$ we get $\sqrt{v_{1,i}} = \sqrt{1-\beta}\,|g_{1,i}|$, so for any coordinate whose gradient dominates $\epsilon$, $$ \theta_{2,i} - \theta_{1,i} = -\frac{\eta\, g_{1,i}}{\sqrt{1-\beta}\,|g_{1,i}| + \epsilon} \;\approx\; -\frac{\eta}{\sqrt{1-\beta}}\,\operatorname{sign}(g_{1,i}). $$ The first step is essentially a *signed* step of fixed magnitude $\eta/\sqrt{1-\beta}$, independent of how large the gradient is. With $\beta = 0.9$ this inflates the nominal step by a factor of $1/\sqrt{0.1} \approx 3.16$, which is the concrete face of the missing bias correction. The worked example below reproduces exactly this number on a hand-checkable input. ## 6. Reference Implementation The library ships a small, dependency-light RMSProp in Python, Julia, and Rust. The three share one API: `init_state` builds the optimizer state with `v` set to zero, `rmsprop_step` applies a single update from an externally supplied gradient, and `minimize` drives a closure that returns the gradient at the current parameters. All three agree on the shared numeric fixtures to within `1e-9`. We demonstrate on the separable quadratic $f(x) = \tfrac{1}{2}\sum_i c_i x_i^2$ with $c = (1, 4)$, whose gradient is $\nabla f(x) = c \odot x$ and whose minimizer is the origin. RMSProp's per-coordinate normalization lets the steep coordinate ($c_2 = 4$) and the shallow one ($c_1 = 1$) make comparable progress despite the four-to-one curvature ratio. ::: {.panel-tabset} ## Python ```{python} import numpy as np from aiinaction.ch195_rmsprop import init_state, rmsprop_step, minimize # One explicit, hand-checkable step: v starts at zero. state = init_state([1.0, -2.0, 0.5], lr=0.01, beta=0.9, eps=1e-8) grad = np.array([0.1, -0.3, 2.0]) step1 = rmsprop_step(state, grad) print("v after one step: ", np.round(step1.v, 6)) print("params after step: ", np.round(step1.params, 6)) # First-step inflation factor 1/sqrt(1 - beta) ~= 3.162 for beta = 0.9. print("step / (lr*sign(g)):", np.round((step1.params - state.params) / (-0.01), 4)) # Minimize the ill-conditioned quadratic f(x) = 0.5 * sum(c * x^2). c = np.array([1.0, 4.0]) result = minimize(lambda x: c * x, [2.0, 2.0], 60, lr=0.1, beta=0.9) print("minimizer estimate: ", np.round(result.params, 6)) print("steps taken: ", result.step_count) ``` ## Julia ```julia using AIInAction.Ch195Rmsprop # One explicit, hand-checkable step: v starts at zero. state = init_state([1.0, -2.0, 0.5]; lr=0.01, beta=0.9, eps=1e-8) grad = [0.1, -0.3, 2.0] step1 = rmsprop_step(state, grad) println("v after one step: ", round.(step1.v, digits=6)) println("params after step: ", round.(step1.params, digits=6)) # Minimize the ill-conditioned quadratic f(x) = 0.5 * sum(c .* x.^2). c = [1.0, 4.0] result = minimize(x -> c .* x, [2.0, 2.0], 60; lr=0.1, beta=0.9) println("minimizer estimate: ", round.(result.params, digits=6)) println("steps taken: ", result.step_count) ``` ## Rust ```rust use aiinaction::ch195_rmsprop::{init_state, rmsprop_step, minimize}; fn main() { // One explicit, hand-checkable step: v starts at zero. let state = init_state(&[1.0, -2.0, 0.5], 0.01, 0.9, 1e-8).unwrap(); let grad = [0.1, -0.3, 2.0]; let step1 = rmsprop_step(&state, &grad).unwrap(); println!("v after one step: {:?}", step1.v); println!("params after step: {:?}", step1.params); // Minimize the ill-conditioned quadratic f(x) = 0.5 * sum(c * x^2). let c = [1.0, 4.0]; let result = minimize( |x: &[f64]| vec![c[0] * x[0], c[1] * x[1]], &[2.0, 2.0], 60, 0.1, 0.9, 1e-8, ) .unwrap(); println!("minimizer estimate: {:?}", result.params); println!("steps taken: {}", result.step_count); } ``` ::: ## 7. Summary AdaGrad introduced per coordinate adaptive learning rates by normalizing with a sum of squared gradients, but that monotone sum forces the effective step to decay toward zero and stalls deep network training. RMSProp replaces the sum with an exponential moving average of squared gradients controlled by a decay rate $\beta$, giving a bounded, forgetting estimate of the local gradient scale that adapts to nonstationary objectives. The epsilon term guarantees numerical stability and acts as a soft floor that bounds the maximum step, with its placement relative to the square root affecting its calibration. Together these choices make RMSProp a robust, low cost diagonal preconditioner and a direct foundation for Adam. ## References 1. Hinton, G., Srivastava, N., and Swersky, K. "Neural Networks for Machine Learning, Lecture 6e: RMSProp." University of Toronto / Coursera. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf 2. Duchi, J., Hazan, E., and Singer, Y. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research, 12:2121-2159, 2011. https://jmlr.org/papers/v12/duchi11a.html 3. Kingma, D. P., and Ba, J. "Adam: A Method for Stochastic Optimization." International Conference on Learning Representations, 2015. https://arxiv.org/abs/1412.6980 4. Ruder, S. "An Overview of Gradient Descent Optimization Algorithms." 2016. https://arxiv.org/abs/1609.04747 5. Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning," Chapter 8: Optimization for Training Deep Models. MIT Press, 2016. https://www.deeplearningbook.org/