Gradient Descent
Prologue: The Map of the Territory
Let us begin our investigation by establishing a precise correspondence between two worlds: the world of classical mechanics, where particles move through physical space under forces, and the world of machine learning, where algorithms navigate through parameter space seeking to minimize loss.
This is not merely an analogy—it is an exact mathematical mapping. Every symbol, every equation in physics has its precise counterpart in optimization theory. Let us make this mapping explicit from the very beginning:
| PHYSICAL WORLD | ↔ | MACHINE LEARNING WORLD | ||
|---|---|---|---|---|
| Symbol | Meaning | ↔ | Symbol | Meaning |
| $\mathbf{x}$ | Position of particle | ↔ | $\mathbf{w}$ | Weights/parameters of neural network |
| $U(\mathbf{x})$ | Potential energy | ↔ | $L(\mathbf{w})$ | Loss function (error function) |
| $\mathbf{v}$ | Velocity of particle | ↔ | $\mathbf{v}$ | Velocity in momentum (same symbol!) |
| $m$ | Mass of particle | ↔ | $m$ | Virtual “mass” of optimizer |
| $\gamma$ | Friction coefficient | ↔ | $\gamma$ | Damping coefficient of optimizer |
| $-\nabla U(\mathbf{x})$ | Force pushing downhill | ↔ | $-\nabla L(\mathbf{w})$ | Gradient pushing toward minimum |
| $t$ | Physical time (continuous) | ↔ | $t$ | Training iteration (discrete) |
| $\Delta t$ | Time step | ↔ | $\Delta t$ | One optimization step |
Fundamental Insight:
This is not just a poetic metaphor. It is an exact mathematical mapping. Whenever we write a physics equation, we can substitute $\mathbf{x} \to \mathbf{w}$ and $U \to L$ and immediately obtain the corresponding gradient descent equation. From now on, we will always show both versions side by side.
Chapter I: The Fundamental Equations
1.1 The Conservative Force and the Loss Gradient
Let us begin with the most fundamental equation: how does the landscape—whether physical or parametric—exert a “force” that guides motion toward lower regions?
🔵 PHYSICS
The force is the negative gradient of potential energy. It pushes the particle toward regions of lower energy.
🟡 MACHINE LEARNING
The negative gradient of the loss indicates the direction of steepest descent. It guides the weights toward values that minimize error.
Why the negative sign?
The gradient $\nabla U(\mathbf{x})$ points in the direction of maximum increase of $U$. But we want to minimize $U$ (or $L$), so we move in the opposite direction: $-\nabla U$ (or $-\nabla L$).
In physics: the particle “slides down” the energy gradient.
In ML: the algorithm “descends” along the loss gradient.
1.2 Newton’s Second Law and Parameter Inertia
But force alone does not determine motion. We must specify how the system responds to this force. Newton’s second law provides the answer in physics, and we can write its exact counterpart for optimization:
🔵 PHYSICS
Mass $m$ determines how much force influences acceleration. Large mass → small acceleration (inertia).
🟡 MACHINE LEARNING
“Mass” $m$ determines how much the gradient influences the change in velocity. Large $m$ → gradual changes.
Combining with our force/gradient equation:
🔵 PHYSICS
🟡 MACHINE LEARNING
1.3 Friction: Energy Dissipation
In the real world, motion is always subject to friction. For motion through a viscous medium, the friction force is proportional to velocity. This is crucial—let us write both versions:
🔵 PHYSICS
$\gamma$ = viscous friction coefficient
Units: [mass/time]
Large $\gamma$ → high resistance
🟡 MACHINE LEARNING
$\gamma$ = damping coefficient
Same conceptual units
Large $\gamma$ → velocity decays quickly
1.4 The Complete Equation of Motion
Now we combine all forces—the driving force from the gradient and the resistive force from friction—to obtain the complete equation of motion:
🔵 PHYSICS – Complete Equation
Three terms:
• $m \frac{d\mathbf{v}}{dt}$: resistance to velocity change (inertia)
• $-\nabla U(\mathbf{x})$: force downhill
• $-\gamma \mathbf{v}$: friction that slows
🟡 ML – Complete Equation
Three terms:
• $m \frac{d\mathbf{v}}{dt}$: resistance to velocity change
• $-\nabla L(\mathbf{w})$: gradient toward minimum
• $-\gamma \mathbf{v}$: velocity damping
Chapter II: The Characteristic Time $\tau$
2.1 The Fundamental Time Scale
Let us now introduce the most important quantity in our entire analysis: the characteristic time scale $\tau = m/\gamma$. This single number determines the entire behavior of our system.
The Relaxation Time: $\tau = m/\gamma$
🔵 PHYSICS
Time in which friction significantly reduces velocity.
Example: $m=1$ kg, $\gamma=10$ kg/s → $\tau=0.1$ s
🟡 MACHINE LEARNING
“Time” in which velocity decays significantly (measured in iterations).
Example: $\tau=10$ iterations
Deep Insight: The Meaning of $\tau$
If $\tau$ is LARGE (large $m$, small $\gamma$):
- PHYSICS: heavy particle in air → high inertia, oscillates a lot
- ML: optimizer with lots of “memory” → accumulates velocity, can overshoot
If $\tau$ is SMALL (small $m$, large $\gamma$):
- PHYSICS: light particle in honey → low inertia, follows instantaneous force
- ML: optimizer with little “memory” → responds only to current gradient
2.2 Rewriting the Equation in Terms of $\tau$
Let us divide our equation of motion by $\gamma$ to make the role of $\tau$ explicit:
🔵 PHYSICS
🟡 MACHINE LEARNING
Chapter III: The Limit $\tau \to 0$ and the Emergence of Gradient Descent
3.1 The Overdamped Regime: When $\tau$ Becomes Negligible
Now we arrive at the crucial moment. What happens when $\tau$ becomes very small? This occurs when:
- $m \to 0$ (negligible mass)
- or $\gamma \to \infty$ (infinitely strong friction)
- or both, as long as $m/\gamma \to 0$
In this limit, the term $\tau \frac{d\mathbf{v}}{dt}$ becomes negligible compared to the other terms. Mathematically:
🔵 PHYSICS – Limit $\tau \to 0$
Velocity is instantaneously proportional to force. No inertia, no memory.
🟡 ML – Limit $\tau \to 0$
Velocity is instantaneously proportional to gradient. No momentum accumulation.
ATTENTION: What does $m=0$ and $\gamma \to \infty$ physically mean?
| Parameter | Value in Pure GD | Physical Consequence | Consequence in ML |
|---|---|---|---|
| $m$ (mass) | → 0 | No resistance to velocity change | Velocity changes instantly at each step |
| $\gamma$ (friction) | → ∞ | Infinite resistance to movement | Velocity decays instantly if not reinforced by gradient |
| $\tau = m/\gamma$ | → 0 | Zero relaxation time | No “memory” between iterations |
3.2 From Continuous Dynamics to Discretization
In both physics simulations and machine learning, we work with discrete time steps. Let us discretize our overdamped equation:
🔵 PHYSICS – Continuous Form
Velocity is the time derivative of position
🟡 ML – Continuous Form
Velocity is the time derivative of weights
Now we discretize using $\Delta t$ (time step):
🔵 PHYSICS – Discretization
🟡 ML – Discretization
3.3 The Learning Rate: $\eta = \Delta t / \gamma$
We now define the learning rate as the ratio of time step to friction coefficient:
Deep Insight: What the Learning Rate Really Is
$\eta = \Delta t / \gamma$ is not a magic number to choose randomly. It is the product of two physical quantities:
$\Delta t$ (time step):
- PHYSICS: how much time passes between updates
- ML: how much “virtual time” passes in one iteration
$1/\gamma$ (mobility):
- PHYSICS: how easily the particle moves (inverse of friction)
- ML: how easily weights update
So large $\eta$ means:
- Large $\Delta t$: big temporal jumps
- OR small $\gamma$: little friction, easy movement
- Result: large steps in space ($\mathbf{x}$ or $\mathbf{w}$)
And small $\eta$ means:
- Small $\Delta t$: frequent, gradual updates
- OR large $\gamma$: high friction
- Result: small steps, slow but stable convergence
3.4 The Final Gradient Descent Equation
Substituting $\eta = \Delta t / \gamma$, we obtain the canonical form of gradient descent:
🔵 PHYSICS
The particle moves in the direction opposite to the potential energy gradient.
🟡 MACHINE LEARNING
The weights update in the direction opposite to the loss function gradient.
SUMMARY: Pure Gradient Descent (SGD without momentum)
| Parameter | Physical Value | Meaning |
|---|---|---|
| Mass $m$ | = 0 | No inertia, no memory of past motion |
| Friction $\gamma$ | → ∞ | Infinitely strong friction, instant velocity decay |
| Characteristic time $\tau$ | = $m/\gamma$ → 0 | Instantaneous response to forces, extreme overdamped regime |
| Learning rate $\eta$ | = $\Delta t / \gamma$ (finite) | Controls step size; must be small for stability |
| Velocity $\mathbf{v}$ | Not a state variable | Instantly determined by gradient: $\mathbf{v} = -\eta \nabla L$ |
Chapter IV: Momentum – When $m \neq 0$
4.1 Restoring the Mass
Let us now explore what happens when we do not take the extreme limit $m \to 0$. Instead, we allow the particle (or the optimizer) to retain some mass, some memory of its previous motion.
We return to our fundamental equation, keeping $\tau = m/\gamma$ finite but still assuming we’re in a reasonably damped regime ($\tau$ not too large):
🔵 PHYSICS – With finite mass
Now the term $\tau \frac{d\mathbf{v}}{dt}$ is NOT negligible
🟡 ML – With finite “mass”
The optimizer has memory, velocity is a state variable
4.2 Discretization with Memory
Let us discretize this equation carefully, approximating the derivative as:
Substituting into our equation and multiplying both sides by $\Delta t$:
🔵 PHYSICS
🟡 MACHINE LEARNING
Factoring out $\mathbf{v}_t$ from the first two terms:
🔵 PHYSICS
🟡 MACHINE LEARNING
4.3 The Momentum Coefficient: $\rho = 1 – \Delta t / \tau$
Now we define the momentum coefficient $\rho$ (rho):
Deep Insight: The Physical Meaning of $\rho$
$\rho = 1 – \Delta t / \tau$ tells us how much “memory” the system has from one step to the next.
Case 1: $\rho \to 1$ (strong momentum)
This happens when $\Delta t / \tau \to 0$, meaning:
- $\tau = m/\gamma$ is large (large mass, small friction)
- OR $\Delta t$ is much smaller than $\tau$
| Physics | Machine Learning |
|---|---|
| Heavy particle in air, high inertia | Optimizer with strong “memory”, accumulates velocity |
| Velocity decays very slowly | Smoothed gradient descent, can jump shallow minima |
| Can oscillate around minima | Can overshoot the loss minimum |
Case 2: $\rho \to 0$ (weak momentum)
This happens when $\Delta t / \tau \to 1$, meaning:
- $\tau = m/\gamma$ is small (small mass, large friction)
- OR $\Delta t \approx \tau$ (time step equals relaxation time)
| Physics | Machine Learning |
|---|---|
| Light particle in honey, low inertia | Optimizer with little “memory” |
| Velocity decays rapidly | We return to pure gradient descent! |
| Almost instantly follows local force | Almost instantly follows local gradient |
Typical Case: $\rho = 0.9$
In practice, $\rho = 0.9$ is often used. This means:
- $\Delta t / \tau = 0.1$ → time step is 1/10 of relaxation time
- Velocity decays by 10% per step
- After ~10 steps, velocity has decayed to $1/e \approx 37\%$ of initial value
- Good balance between memory of the past and reactivity to the present
4.4 The Final Equation with Momentum
Defining $\rho = 1 – \Delta t / \tau$ and $\eta = \Delta t / \gamma$, we obtain the classical momentum method:
🔵 PHYSICS – With Momentum
Two coupled equations: one for velocity, one for position
🟡 ML – Momentum Method
Two coupled equations: velocity update + parameter update
SUMMARY: Gradient Descent with Momentum
| Parameter | Value | Physical Meaning | Meaning in ML |
|---|---|---|---|
| $m$ | > 0 (finite) | Particle has mass, inertia | Optimizer has “virtual mass” |
| $\gamma$ | Finite | Finite viscous friction | Finite velocity damping |
| $\tau = m/\gamma$ | > 0 (finite) | Finite relaxation time | Velocity has finite “memory” |
| $\rho = 1-\Delta t/\tau$ | Typically 0.9 | Fraction of velocity retained | Momentum coefficient |
| $\eta = \Delta t/\gamma$ | Small (e.g. 0.01) | Learning rate | Step size |
| $\mathbf{v}$ | State variable | Particle velocity | Accumulated optimizer velocity |
Chapter V: Visual Comparison and Final Tables
Visualization: The Elongated Valley Problem
Minimizing $f(w_1, w_2) = \frac{1}{2}(20w_1^2 + w_2^2)$ — an elongated valley with high curvature in $w_1$ direction
Key Observation: Why Momentum Helps
In elongated valleys, pure gradient descent oscillates wildly in the narrow dimension (high curvature) while making slow progress along the valley floor (low curvature). Momentum dampens these oscillations by accumulating velocity: successive gradients pointing in opposite directions cancel out, while gradients consistently pointing down the valley amplify each other.
5.1 Complete Comparison Table
| Aspect | Pure Gradient Descent | GD with Momentum |
|---|---|---|
| $m=0$, $\gamma \to \infty$, $\tau \to 0$ | $m>0$, $\gamma$ finite, $\tau$ finite | |
| Equation (physics) | $\mathbf{x}_{t+1} = \mathbf{x}_t – \eta \nabla U(\mathbf{x}_t)$ | $\mathbf{v}_{t+1} = \rho \mathbf{v}_t – \eta \nabla U(\mathbf{x}_t)$ $\mathbf{x}_{t+1} = \mathbf{x}_t + \mathbf{v}_{t+1}$ |
| Equation (ML) | $\mathbf{w}_{t+1} = \mathbf{w}_t – \eta \nabla L(\mathbf{w}_t)$ | $\mathbf{v}_{t+1} = \rho \mathbf{v}_t – \eta \nabla L(\mathbf{w}_t)$ $\mathbf{w}_{t+1} = \mathbf{w}_t + \mathbf{v}_{t+1}$ |
| State variables | Only $\mathbf{x}$ (or $\mathbf{w}$) | $\mathbf{x}$ and $\mathbf{v}$ (or $\mathbf{w}$ and $\mathbf{v}$) |
| Memory of past | None ($\tau=0$) | Yes, through $\mathbf{v}_t$ |
| Physics: particle in… | Very thick honey | Oil or air (depending on $\rho$) |
| Response to constant $\nabla U$ or $\nabla L$ | Constant velocity | Acceleration ($\mathbf{v}$ grows) |
| In narrow valleys | Can oscillate (zig-zag) | Smoother trajectory |
| Shallow minima | Gets stuck easily | Can jump them thanks to momentum |
| Overshooting | Impossible (no inertia) | Possible if $\rho$ too high |
| Convergence | Slower but stable | Faster but can oscillate |
| Hyperparameters | Only $\eta$ | $\eta$ and $\rho$ |
| Typical values | $\eta \approx 0.01\text{-}0.1$ | $\eta \approx 0.01$, $\rho \approx 0.9$ |
Epilogue: The Beauty of Exact Mapping
What we have discovered in this long journey is not merely an analogy, but an exact mathematical isomorphism—a one-to-one correspondence between the equations of dissipative classical mechanics and the algorithms of gradient-based optimization.
Every symbol in physics has its precise counterpart in machine learning. Every physical parameter—mass, friction, relaxation time—has its corresponding role in the optimizer. The gradient descent algorithm is not inspired by physics; it is physics, applied to the abstract space of neural network parameters rather than physical space.
When we set $m = 0$ and $\gamma \to \infty$ (while keeping $\eta = \Delta t / \gamma$ finite), we obtain pure gradient descent: a system with no memory, moving through parameter space like a particle in infinitely viscous honey, always following the instantaneous gradient.
When we allow $m > 0$ and $\gamma$ finite, we obtain the momentum method: a system that remembers its past motion, accumulates velocity in consistent directions, and can coast through unfavorable regions toward better minima.
The learning rate $\eta$ is not an arbitrary tuning parameter—it is precisely $\Delta t / \gamma$, the ratio of temporal discretization to friction. The momentum coefficient $\rho$ is not a magic number—it is exactly $1 – \Delta t / \tau$, encoding how much velocity persists from one iteration to the next.
Yet we must remember: the loss landscapes of neural networks are stranger than any physical terrain. They exist in spaces of unimaginable dimension, they shift with each mini-batch, and they contain structures—sharp versus flat minima, mode connectivity, loss surface geometry—that we are only beginning to understand.
But the physics gives us a foundation, a language, a set of intuitions that guide us through this strange landscape. And that, perhaps, is the deepest lesson: that the mathematics of the natural world and the mathematics of artificial intelligence are not separate domains, but different manifestations of the same underlying principles.