Gradient Descent: A Physical Perspective

Gradient Descent

A Physical Perspective

Prologue: The Map of the Territory
Chapter I: The Fundamental Equations
Chapter II: The Characteristic Time τ
Chapter III: The Limit τ → 0 and Gradient Descent
Chapter IV: Momentum – When m ≠ 0
Chapter V: Visual Comparison and Final Tables
Epilogue: The Beauty of Exact Mapping

Prologue: The Map of the Territory

Let us begin our investigation by establishing a precise correspondence between two worlds: the world of classical mechanics, where particles move through physical space under forces, and the world of machine learning, where algorithms navigate through parameter space seeking to minimize loss.

This is not merely an analogy—it is an exact mathematical mapping. Every symbol, every equation in physics has its precise counterpart in optimization theory. Let us make this mapping explicit from the very beginning:

PHYSICAL WORLD		↔	MACHINE LEARNING WORLD
Symbol	Meaning	↔	Symbol	Meaning
$\mathbf{x}$	Position of particle	↔	$\mathbf{w}$	Weights/parameters of neural network
$U(\mathbf{x})$	Potential energy	↔	$L(\mathbf{w})$	Loss function (error function)
$\mathbf{v}$	Velocity of particle	↔	$\mathbf{v}$	Velocity in momentum (same symbol!)
$m$	Mass of particle	↔	$m$	Virtual “mass” of optimizer
$\gamma$	Friction coefficient	↔	$\gamma$	Damping coefficient of optimizer
$-\nabla U(\mathbf{x})$	Force pushing downhill	↔	$-\nabla L(\mathbf{w})$	Gradient pushing toward minimum
$t$	Physical time (continuous)	↔	$t$	Training iteration (discrete)
$\Delta t$	Time step	↔	$\Delta t$	One optimization step

Fundamental Insight:

This is not just a poetic metaphor. It is an exact mathematical mapping. Whenever we write a physics equation, we can substitute $\mathbf{x} \to \mathbf{w}$ and $U \to L$ and immediately obtain the corresponding gradient descent equation. From now on, we will always show both versions side by side.

Chapter I: The Fundamental Equations

1.1 The Conservative Force and the Loss Gradient

Let us begin with the most fundamental equation: how does the landscape—whether physical or parametric—exert a “force” that guides motion toward lower regions?

🔵 PHYSICS

$$\mathbf{F} = -\nabla U(\mathbf{x})$$

The force is the negative gradient of potential energy. It pushes the particle toward regions of lower energy.

🟡 MACHINE LEARNING

$$\mathbf{g} = -\nabla L(\mathbf{w})$$

The negative gradient of the loss indicates the direction of steepest descent. It guides the weights toward values that minimize error.

Why the negative sign?

The gradient $\nabla U(\mathbf{x})$ points in the direction of maximum increase of $U$. But we want to minimize $U$ (or $L$), so we move in the opposite direction: $-\nabla U$ (or $-\nabla L$).

In physics: the particle “slides down” the energy gradient.
In ML: the algorithm “descends” along the loss gradient.

1.2 Newton’s Second Law and Parameter Inertia

But force alone does not determine motion. We must specify how the system responds to this force. Newton’s second law provides the answer in physics, and we can write its exact counterpart for optimization:

🔵 PHYSICS

$$m \cdot \mathbf{a} = \mathbf{F}$$ $$m \cdot \frac{d\mathbf{v}}{dt} = \mathbf{F}$$

Mass $m$ determines how much force influences acceleration. Large mass → small acceleration (inertia).

🟡 MACHINE LEARNING

$$m \cdot \mathbf{a} = \mathbf{g}$$ $$m \cdot \frac{d\mathbf{v}}{dt} = \mathbf{g}$$

“Mass” $m$ determines how much the gradient influences the change in velocity. Large $m$ → gradual changes.

Combining with our force/gradient equation:

🔵 PHYSICS

$$m \cdot \frac{d\mathbf{v}}{dt} = -\nabla U(\mathbf{x})$$

🟡 MACHINE LEARNING

$$m \cdot \frac{d\mathbf{v}}{dt} = -\nabla L(\mathbf{w})$$

1.3 Friction: Energy Dissipation

In the real world, motion is always subject to friction. For motion through a viscous medium, the friction force is proportional to velocity. This is crucial—let us write both versions:

🔵 PHYSICS

$$\mathbf{F}_{\text{friction}} = -\gamma \cdot \mathbf{v}$$

$\gamma$ = viscous friction coefficient
Units: [mass/time]
Large $\gamma$ → high resistance

🟡 MACHINE LEARNING

$$\text{damping} = -\gamma \cdot \mathbf{v}$$

$\gamma$ = damping coefficient
Same conceptual units
Large $\gamma$ → velocity decays quickly

1.4 The Complete Equation of Motion

Now we combine all forces—the driving force from the gradient and the resistive force from friction—to obtain the complete equation of motion:

🔵 PHYSICS – Complete Equation

$$m \cdot \frac{d\mathbf{v}}{dt} = -\nabla U(\mathbf{x}) – \gamma \cdot \mathbf{v}$$

Three terms:
• $m \frac{d\mathbf{v}}{dt}$: resistance to velocity change (inertia)
• $-\nabla U(\mathbf{x})$: force downhill
• $-\gamma \mathbf{v}$: friction that slows

🟡 ML – Complete Equation

$$m \cdot \frac{d\mathbf{v}}{dt} = -\nabla L(\mathbf{w}) – \gamma \cdot \mathbf{v}$$

Three terms:
• $m \frac{d\mathbf{v}}{dt}$: resistance to velocity change
• $-\nabla L(\mathbf{w})$: gradient toward minimum
• $-\gamma \mathbf{v}$: velocity damping

Chapter II: The Characteristic Time $\tau$

2.1 The Fundamental Time Scale

Let us now introduce the most important quantity in our entire analysis: the characteristic time scale $\tau = m/\gamma$. This single number determines the entire behavior of our system.

The Relaxation Time: $\tau = m/\gamma$

🔵 PHYSICS

$$\tau = \frac{m}{\gamma}$$

Time in which friction significantly reduces velocity.

Example: $m=1$ kg, $\gamma=10$ kg/s → $\tau=0.1$ s

🟡 MACHINE LEARNING

$$\tau = \frac{m}{\gamma}$$

“Time” in which velocity decays significantly (measured in iterations).

Example: $\tau=10$ iterations

Deep Insight: The Meaning of $\tau$

If $\tau$ is LARGE (large $m$, small $\gamma$):

PHYSICS: heavy particle in air → high inertia, oscillates a lot
ML: optimizer with lots of “memory” → accumulates velocity, can overshoot

If $\tau$ is SMALL (small $m$, large $\gamma$):

PHYSICS: light particle in honey → low inertia, follows instantaneous force
ML: optimizer with little “memory” → responds only to current gradient

2.2 Rewriting the Equation in Terms of $\tau$

Let us divide our equation of motion by $\gamma$ to make the role of $\tau$ explicit:

🔵 PHYSICS

$$\frac{m}{\gamma} \cdot \frac{d\mathbf{v}}{dt} = -\frac{1}{\gamma} \nabla U(\mathbf{x}) – \mathbf{v}$$ $$\tau \cdot \frac{d\mathbf{v}}{dt} = -\frac{1}{\gamma} \nabla U(\mathbf{x}) – \mathbf{v}$$

🟡 MACHINE LEARNING

$$\frac{m}{\gamma} \cdot \frac{d\mathbf{v}}{dt} = -\frac{1}{\gamma} \nabla L(\mathbf{w}) – \mathbf{v}$$ $$\tau \cdot \frac{d\mathbf{v}}{dt} = -\frac{1}{\gamma} \nabla L(\mathbf{w}) – \mathbf{v}$$

Chapter III: The Limit $\tau \to 0$ and the Emergence of Gradient Descent

3.1 The Overdamped Regime: When $\tau$ Becomes Negligible

Now we arrive at the crucial moment. What happens when $\tau$ becomes very small? This occurs when:

$m \to 0$ (negligible mass)
or $\gamma \to \infty$ (infinitely strong friction)
or both, as long as $m/\gamma \to 0$

In this limit, the term $\tau \frac{d\mathbf{v}}{dt}$ becomes negligible compared to the other terms. Mathematically:

🔵 PHYSICS – Limit $\tau \to 0$

$$0 \approx -\frac{1}{\gamma} \nabla U(\mathbf{x}) – \mathbf{v}$$ $$\mathbf{v} \approx -\frac{1}{\gamma} \nabla U(\mathbf{x})$$

Velocity is instantaneously proportional to force. No inertia, no memory.

🟡 ML – Limit $\tau \to 0$

$$0 \approx -\frac{1}{\gamma} \nabla L(\mathbf{w}) – \mathbf{v}$$ $$\mathbf{v} \approx -\frac{1}{\gamma} \nabla L(\mathbf{w})$$

Velocity is instantaneously proportional to gradient. No momentum accumulation.

ATTENTION: What does $m=0$ and $\gamma \to \infty$ physically mean?

Parameter	Value in Pure GD	Physical Consequence	Consequence in ML
$m$ (mass)	→ 0	No resistance to velocity change	Velocity changes instantly at each step
$\gamma$ (friction)	→ ∞	Infinite resistance to movement	Velocity decays instantly if not reinforced by gradient
$\tau = m/\gamma$	→ 0	Zero relaxation time	No “memory” between iterations

3.2 From Continuous Dynamics to Discretization

In both physics simulations and machine learning, we work with discrete time steps. Let us discretize our overdamped equation:

🔵 PHYSICS – Continuous Form

$$\mathbf{v} = \frac{d\mathbf{x}}{dt} = -\frac{1}{\gamma} \nabla U(\mathbf{x})$$

Velocity is the time derivative of position

🟡 ML – Continuous Form

$$\mathbf{v} = \frac{d\mathbf{w}}{dt} = -\frac{1}{\gamma} \nabla L(\mathbf{w})$$

Velocity is the time derivative of weights

Now we discretize using $\Delta t$ (time step):

🔵 PHYSICS – Discretization

$$\frac{d\mathbf{x}}{dt} \approx \frac{\mathbf{x}_{t+1} – \mathbf{x}_t}{\Delta t}$$ $$\frac{\mathbf{x}_{t+1} – \mathbf{x}_t}{\Delta t} = -\frac{1}{\gamma} \nabla U(\mathbf{x}_t)$$ $$\mathbf{x}_{t+1} = \mathbf{x}_t – \frac{\Delta t}{\gamma} \nabla U(\mathbf{x}_t)$$

🟡 ML – Discretization

$$\frac{d\mathbf{w}}{dt} \approx \frac{\mathbf{w}_{t+1} – \mathbf{w}_t}{\Delta t}$$ $$\frac{\mathbf{w}_{t+1} – \mathbf{w}_t}{\Delta t} = -\frac{1}{\gamma} \nabla L(\mathbf{w}_t)$$ $$\mathbf{w}_{t+1} = \mathbf{w}_t – \frac{\Delta t}{\gamma} \nabla L(\mathbf{w}_t)$$

3.3 The Learning Rate: $\eta = \Delta t / \gamma$

We now define the learning rate as the ratio of time step to friction coefficient:

DEFINITION OF LEARNING RATE: $$\eta = \frac{\Delta t}{\gamma}$$

Deep Insight: What the Learning Rate Really Is

$\eta = \Delta t / \gamma$ is not a magic number to choose randomly. It is the product of two physical quantities:

$\Delta t$ (time step):

PHYSICS: how much time passes between updates
ML: how much “virtual time” passes in one iteration

$1/\gamma$ (mobility):

PHYSICS: how easily the particle moves (inverse of friction)
ML: how easily weights update

So large $\eta$ means:

Large $\Delta t$: big temporal jumps
OR small $\gamma$: little friction, easy movement
Result: large steps in space ($\mathbf{x}$ or $\mathbf{w}$)

And small $\eta$ means:

Small $\Delta t$: frequent, gradual updates
OR large $\gamma$: high friction
Result: small steps, slow but stable convergence

3.4 The Final Gradient Descent Equation

Substituting $\eta = \Delta t / \gamma$, we obtain the canonical form of gradient descent:

🔵 PHYSICS

$$\mathbf{x}_{t+1} = \mathbf{x}_t – \eta \nabla U(\mathbf{x}_t)$$

The particle moves in the direction opposite to the potential energy gradient.

🟡 MACHINE LEARNING

$$\mathbf{w}_{t+1} = \mathbf{w}_t – \eta \nabla L(\mathbf{w}_t)$$

The weights update in the direction opposite to the loss function gradient.

SUMMARY: Pure Gradient Descent (SGD without momentum)

Parameter	Physical Value	Meaning
Mass $m$	= 0	No inertia, no memory of past motion
Friction $\gamma$	→ ∞	Infinitely strong friction, instant velocity decay
Characteristic time $\tau$	= $m/\gamma$ → 0	Instantaneous response to forces, extreme overdamped regime
Learning rate $\eta$	= $\Delta t / \gamma$ (finite)	Controls step size; must be small for stability
Velocity $\mathbf{v}$	Not a state variable	Instantly determined by gradient: $\mathbf{v} = -\eta \nabla L$

Chapter IV: Momentum – When $m \neq 0$

4.1 Restoring the Mass

Let us now explore what happens when we do not take the extreme limit $m \to 0$. Instead, we allow the particle (or the optimizer) to retain some mass, some memory of its previous motion.

We return to our fundamental equation, keeping $\tau = m/\gamma$ finite but still assuming we’re in a reasonably damped regime ($\tau$ not too large):

🔵 PHYSICS – With finite mass

$$\tau \cdot \frac{d\mathbf{v}}{dt} = -\frac{1}{\gamma} \nabla U(\mathbf{x}) – \mathbf{v}$$

Now the term $\tau \frac{d\mathbf{v}}{dt}$ is NOT negligible

🟡 ML – With finite “mass”

$$\tau \cdot \frac{d\mathbf{v}}{dt} = -\frac{1}{\gamma} \nabla L(\mathbf{w}) – \mathbf{v}$$

The optimizer has memory, velocity is a state variable

4.2 Discretization with Memory

Let us discretize this equation carefully, approximating the derivative as:

$$\frac{d\mathbf{v}}{dt} \approx \frac{\mathbf{v}_{t+1} – \mathbf{v}_t}{\Delta t}$$

Substituting into our equation and multiplying both sides by $\Delta t$:

🔵 PHYSICS

$$\tau \cdot \frac{\mathbf{v}_{t+1} – \mathbf{v}_t}{\Delta t} = -\frac{1}{\gamma}\nabla U(\mathbf{x}_t) – \mathbf{v}_t$$ $$\tau(\mathbf{v}_{t+1} – \mathbf{v}_t) = -\frac{\Delta t}{\gamma}\nabla U(\mathbf{x}_t) – \Delta t \cdot \mathbf{v}_t$$ $$\mathbf{v}_{t+1} = \mathbf{v}_t – \frac{\Delta t}{\tau}\mathbf{v}_t – \frac{\Delta t}{\gamma}\nabla U(\mathbf{x}_t)$$

🟡 MACHINE LEARNING

$$\tau \cdot \frac{\mathbf{v}_{t+1} – \mathbf{v}_t}{\Delta t} = -\frac{1}{\gamma}\nabla L(\mathbf{w}_t) – \mathbf{v}_t$$ $$\tau(\mathbf{v}_{t+1} – \mathbf{v}_t) = -\frac{\Delta t}{\gamma}\nabla L(\mathbf{w}_t) – \Delta t \cdot \mathbf{v}_t$$ $$\mathbf{v}_{t+1} = \mathbf{v}_t – \frac{\Delta t}{\tau}\mathbf{v}_t – \frac{\Delta t}{\gamma}\nabla L(\mathbf{w}_t)$$

Factoring out $\mathbf{v}_t$ from the first two terms:

🔵 PHYSICS

$$\mathbf{v}_{t+1} = \left(1 – \frac{\Delta t}{\tau}\right)\mathbf{v}_t – \eta \nabla U(\mathbf{x}_t)$$

🟡 MACHINE LEARNING

$$\mathbf{v}_{t+1} = \left(1 – \frac{\Delta t}{\tau}\right)\mathbf{v}_t – \eta \nabla L(\mathbf{w}_t)$$

4.3 The Momentum Coefficient: $\rho = 1 – \Delta t / \tau$

Now we define the momentum coefficient $\rho$ (rho):

DEFINITION OF MOMENTUM COEFFICIENT: $$\rho = 1 – \frac{\Delta t}{\tau} = 1 – \frac{\Delta t \cdot \gamma}{m}$$

Deep Insight: The Physical Meaning of $\rho$

$\rho = 1 – \Delta t / \tau$ tells us how much “memory” the system has from one step to the next.

Case 1: $\rho \to 1$ (strong momentum)

This happens when $\Delta t / \tau \to 0$, meaning:

$\tau = m/\gamma$ is large (large mass, small friction)
OR $\Delta t$ is much smaller than $\tau$

Physics	Machine Learning
Heavy particle in air, high inertia	Optimizer with strong “memory”, accumulates velocity
Velocity decays very slowly	Smoothed gradient descent, can jump shallow minima
Can oscillate around minima	Can overshoot the loss minimum

Case 2: $\rho \to 0$ (weak momentum)

This happens when $\Delta t / \tau \to 1$, meaning:

$\tau = m/\gamma$ is small (small mass, large friction)
OR $\Delta t \approx \tau$ (time step equals relaxation time)

Physics	Machine Learning
Light particle in honey, low inertia	Optimizer with little “memory”
Velocity decays rapidly	We return to pure gradient descent!
Almost instantly follows local force	Almost instantly follows local gradient

Typical Case: $\rho = 0.9$

In practice, $\rho = 0.9$ is often used. This means:

$\Delta t / \tau = 0.1$ → time step is 1/10 of relaxation time
Velocity decays by 10% per step
After ~10 steps, velocity has decayed to $1/e \approx 37\%$ of initial value
Good balance between memory of the past and reactivity to the present

4.4 The Final Equation with Momentum

Defining $\rho = 1 – \Delta t / \tau$ and $\eta = \Delta t / \gamma$, we obtain the classical momentum method:

🔵 PHYSICS – With Momentum

$$\mathbf{v}_{t+1} = \rho \mathbf{v}_t – \eta \nabla U(\mathbf{x}_t)$$ $$\mathbf{x}_{t+1} = \mathbf{x}_t + \mathbf{v}_{t+1}$$

Two coupled equations: one for velocity, one for position

🟡 ML – Momentum Method

$$\mathbf{v}_{t+1} = \rho \mathbf{v}_t – \eta \nabla L(\mathbf{w}_t)$$ $$\mathbf{w}_{t+1} = \mathbf{w}_t + \mathbf{v}_{t+1}$$

Two coupled equations: velocity update + parameter update

SUMMARY: Gradient Descent with Momentum

Parameter	Value	Physical Meaning	Meaning in ML
$m$	> 0 (finite)	Particle has mass, inertia	Optimizer has “virtual mass”
$\gamma$	Finite	Finite viscous friction	Finite velocity damping
$\tau = m/\gamma$	> 0 (finite)	Finite relaxation time	Velocity has finite “memory”
$\rho = 1-\Delta t/\tau$	Typically 0.9	Fraction of velocity retained	Momentum coefficient
$\eta = \Delta t/\gamma$	Small (e.g. 0.01)	Learning rate	Step size
$\mathbf{v}$	State variable	Particle velocity	Accumulated optimizer velocity

Chapter V: Visual Comparison and Final Tables

Visualization: The Elongated Valley Problem

Minimizing $f(w_1, w_2) = \frac{1}{2}(20w_1^2 + w_2^2)$ — an elongated valley with high curvature in $w_1$ direction

Pure GD (oscillates)

GD + Momentum (smooth)

Start Point

Minimum

Key Observation: Why Momentum Helps

In elongated valleys, pure gradient descent oscillates wildly in the narrow dimension (high curvature) while making slow progress along the valley floor (low curvature). Momentum dampens these oscillations by accumulating velocity: successive gradients pointing in opposite directions cancel out, while gradients consistently pointing down the valley amplify each other.

5.1 Complete Comparison Table

Aspect	Pure Gradient Descent	GD with Momentum
Aspect	$m=0$, $\gamma \to \infty$, $\tau \to 0$	$m>0$, $\gamma$ finite, $\tau$ finite
Equation (physics)	$\mathbf{x}_{t+1} = \mathbf{x}_t – \eta \nabla U(\mathbf{x}_t)$	$\mathbf{v}_{t+1} = \rho \mathbf{v}_t – \eta \nabla U(\mathbf{x}_t)$ $\mathbf{x}_{t+1} = \mathbf{x}_t + \mathbf{v}_{t+1}$
Equation (ML)	$\mathbf{w}_{t+1} = \mathbf{w}_t – \eta \nabla L(\mathbf{w}_t)$	$\mathbf{v}_{t+1} = \rho \mathbf{v}_t – \eta \nabla L(\mathbf{w}_t)$ $\mathbf{w}_{t+1} = \mathbf{w}_t + \mathbf{v}_{t+1}$
State variables	Only $\mathbf{x}$ (or $\mathbf{w}$)	$\mathbf{x}$ and $\mathbf{v}$ (or $\mathbf{w}$ and $\mathbf{v}$)
Memory of past	None ($\tau=0$)	Yes, through $\mathbf{v}_t$
Physics: particle in…	Very thick honey	Oil or air (depending on $\rho$)
Response to constant $\nabla U$ or $\nabla L$	Constant velocity	Acceleration ($\mathbf{v}$ grows)
In narrow valleys	Can oscillate (zig-zag)	Smoother trajectory
Shallow minima	Gets stuck easily	Can jump them thanks to momentum
Overshooting	Impossible (no inertia)	Possible if $\rho$ too high
Convergence	Slower but stable	Faster but can oscillate
Hyperparameters	Only $\eta$	$\eta$ and $\rho$
Typical values	$\eta \approx 0.01\text{-}0.1$	$\eta \approx 0.01$, $\rho \approx 0.9$

Epilogue: The Beauty of Exact Mapping

What we have discovered in this long journey is not merely an analogy, but an exact mathematical isomorphism—a one-to-one correspondence between the equations of dissipative classical mechanics and the algorithms of gradient-based optimization.

Every symbol in physics has its precise counterpart in machine learning. Every physical parameter—mass, friction, relaxation time—has its corresponding role in the optimizer. The gradient descent algorithm is not inspired by physics; it is physics, applied to the abstract space of neural network parameters rather than physical space.

When we set $m = 0$ and $\gamma \to \infty$ (while keeping $\eta = \Delta t / \gamma$ finite), we obtain pure gradient descent: a system with no memory, moving through parameter space like a particle in infinitely viscous honey, always following the instantaneous gradient.

When we allow $m > 0$ and $\gamma$ finite, we obtain the momentum method: a system that remembers its past motion, accumulates velocity in consistent directions, and can coast through unfavorable regions toward better minima.

The learning rate $\eta$ is not an arbitrary tuning parameter—it is precisely $\Delta t / \gamma$, the ratio of temporal discretization to friction. The momentum coefficient $\rho$ is not a magic number—it is exactly $1 – \Delta t / \tau$, encoding how much velocity persists from one iteration to the next.

Yet we must remember: the loss landscapes of neural networks are stranger than any physical terrain. They exist in spaces of unimaginable dimension, they shift with each mini-batch, and they contain structures—sharp versus flat minima, mode connectivity, loss surface geometry—that we are only beginning to understand.

But the physics gives us a foundation, a language, a set of intuitions that guide us through this strange landscape. And that, perhaps, is the deepest lesson: that the mathematics of the natural world and the mathematics of artificial intelligence are not separate domains, but different manifestations of the same underlying principles.

Table of Contents

Prologue: The Map of the Territory

Chapter I: The Fundamental Equations

1.1 The Conservative Force and the Loss Gradient

🔵 PHYSICS

🟡 MACHINE LEARNING

1.2 Newton’s Second Law and Parameter Inertia

🔵 PHYSICS

🟡 MACHINE LEARNING

🔵 PHYSICS

🟡 MACHINE LEARNING

1.3 Friction: Energy Dissipation

🔵 PHYSICS

🟡 MACHINE LEARNING

1.4 The Complete Equation of Motion

🔵 PHYSICS – Complete Equation

🟡 ML – Complete Equation

Chapter II: The Characteristic Time $\tau$

2.1 The Fundamental Time Scale

🔵 PHYSICS

🟡 MACHINE LEARNING

2.2 Rewriting the Equation in Terms of $\tau$

🔵 PHYSICS

🟡 MACHINE LEARNING

Chapter III: The Limit $\tau \to 0$ and the Emergence of Gradient Descent

3.1 The Overdamped Regime: When $\tau$ Becomes Negligible

🔵 PHYSICS – Limit $\tau \to 0$

🟡 ML – Limit $\tau \to 0$

3.2 From Continuous Dynamics to Discretization

🔵 PHYSICS – Continuous Form

🟡 ML – Continuous Form

🔵 PHYSICS – Discretization

🟡 ML – Discretization

3.3 The Learning Rate: $\eta = \Delta t / \gamma$

3.4 The Final Gradient Descent Equation

🔵 PHYSICS

🟡 MACHINE LEARNING

Chapter IV: Momentum – When $m \neq 0$

4.1 Restoring the Mass

🔵 PHYSICS – With finite mass

🟡 ML – With finite “mass”

4.2 Discretization with Memory

🔵 PHYSICS

🟡 MACHINE LEARNING

🔵 PHYSICS

🟡 MACHINE LEARNING

4.3 The Momentum Coefficient: $\rho = 1 – \Delta t / \tau$

4.4 The Final Equation with Momentum

🔵 PHYSICS – With Momentum

🟡 ML – Momentum Method

Chapter V: Visual Comparison and Final Tables

Visualization: The Elongated Valley Problem

5.1 Complete Comparison Table

Epilogue: The Beauty of Exact Mapping