Automatic Differentiation
Table of Contents
- Prologue: The Dual Paths Through the Computational Graph
- Section I: The Foundation – The Chain Rule
- Section II: Forward Mode (Tangent Mode)
- Section III: Backward Mode (Reverse/Adjoint Mode)
- Section IV: Computational Cost Analysis
- Section V: Comparative Analysis and Selection Criteria
- Epilogue: The Beauty of Duality
Prologue: The Dual Paths Through the Computational Graph
The ability to compute derivatives efficiently is fundamental to modern machine learning. Every gradient descent step and every parameter update depends on evaluating partial derivatives of complex functions.
Behind every neural network’s training procedure lies a computational graph—a directed acyclic structure encoding the sequence of operations that transform inputs into outputs. Through this graph, we can compute derivatives in two distinct ways, each with its own efficiency characteristics and optimal use cases.
These are the forward mode and backward mode of automatic differentiation. Both methods apply the chain rule systematically to compute exact derivatives, differing only in the direction they traverse the computational graph.
Fundamental Insight:
Forward mode and backward mode are dual algorithms that compute the same derivatives through opposite traversals of the computational graph. Forward mode propagates derivatives from inputs to outputs, while backward mode propagates them from outputs to inputs.
The choice between them is determined by computational efficiency, which depends on the dimensionality of the input and output spaces.
Section I: The Foundation – The Chain Rule
1.1 The Composition of Functions
Consider a function $f: \mathbb{R}^D \to \mathbb{R}^M$ that maps an input vector $\mathbf{x} \in \mathbb{R}^D$ to an output vector $\mathbf{y} \in \mathbb{R}^M$. In practice, this function is rarely evaluated as a single atomic operation. Instead, it is the composition of many simpler operations:
Each intermediate operation $f_\ell$ produces a vector $\mathbf{v}_\ell$ of intermediate values, forming a sequence:
This sequence defines a computational graph: a directed acyclic graph where nodes represent intermediate values and edges represent dependencies between operations.
1.2 The Jacobian Matrix
Our objective is to compute the Jacobian matrix—the matrix of all first-order partial derivatives:
Definition: The Jacobian Matrix
For a function $f: \mathbb{R}^D \to \mathbb{R}^M$, the Jacobian is an $M \times D$ matrix:
Each row contains the gradient of a single output component:
Each column contains derivatives of all outputs with respect to a single input:
1.3 The Multivariable Chain Rule
The chain rule for compositions of vector-valued functions states that the Jacobian of a composition is the product of the Jacobians of the constituent functions:
This is a product of $L$ Jacobian matrices. The question now becomes: in what order should we perform these matrix multiplications?
Key Insight: The Associativity of Matrix Multiplication
Matrix multiplication is associative: $(AB)C = A(BC)$. The parenthesization determines the order of operations, but not the final result. However, different parenthesizations have vastly different computational costs.
Forward mode corresponds to right-to-left parenthesization:
Backward mode corresponds to left-to-right parenthesization:
Section II: Forward Mode (Tangent Mode)
2.1 The Philosophy of Forward Mode
Forward mode automatic differentiation propagates derivatives forward through the computational graph, from inputs to outputs. It answers the question: “If I perturb a single input variable, how do all the outputs respond?”
Conceptually, forward mode augments each intermediate value $v$ with a tangent $\dot{v}$, which represents the derivative of that value with respect to a chosen input variable. As the computation proceeds forward, tangents are updated according to local derivative rules.
Forward Mode: Column-by-Column Jacobian Construction
Forward mode computes the Jacobian one column at a time. Each forward pass computes derivatives with respect to a single input variable $x_j$:
To compute the full Jacobian: Perform $D$ forward passes (one per input dimension).
2.2 The Forward Mode Algorithm
To compute the $j$-th column of the Jacobian (derivatives with respect to $x_j$), forward mode proceeds as follows:
FORWARD MODE ALGORITHM
Initialization:
Set the tangent of $x_j$ to 1 (we’re differentiating with respect to $x_j$), and all other tangents to 0.
Forward Propagation:
For each operation $v_\ell = f_\ell(v_{\ell-1})$ in topological order:
- Compute the value: $v_\ell = f_\ell(v_{\ell-1})$
- Compute the tangent using the chain rule: $$\dot{v}_\ell = \frac{\partial f_\ell}{\partial v_{\ell-1}} \cdot \dot{v}_{\ell-1}$$
Output:
At the end, $\dot{y}_i$ contains $\frac{\partial f_i}{\partial x_j}$ for all $i$.
2.3 Forward Mode: Geometric Interpretation
Forward mode can be understood geometrically as computing the directional derivative in the direction of a coordinate axis. When we set $\dot{x}_j = 1$ and all other tangents to zero, we are asking: “How does the function change when we move in the direction of the $j$-th coordinate?”
Jacobian-Vector Product
Forward mode naturally computes products of the form $J \cdot \mathbf{v}$, where $\mathbf{v}$ is a direction vector. By setting $\mathbf{v} = \mathbf{e}_j$ (the $j$-th standard basis vector), we obtain the $j$-th column of the Jacobian.
2.4 Forward Mode: Example Computation
Let us consider a simple example to illustrate forward mode. Suppose we wish to compute the derivatives of:
The computational graph consists of the following operations:
Computing $\frac{\partial f}{\partial x_1}$ (Column 1 of Jacobian):
Initialization: $\dot{x}_1 = 1$, $\dot{x}_2 = 0$
Forward propagation:
- $v_1 = x_1 \cdot x_2$ → $\dot{v}_1 = x_2 \cdot \dot{x}_1 + x_1 \cdot \dot{x}_2 = x_2 \cdot 1 + x_1 \cdot 0 = x_2$
- $v_2 = \sin(x_1)$ → $\dot{v}_2 = \cos(x_1) \cdot \dot{x}_1 = \cos(x_1) \cdot 1 = \cos(x_1)$
- $f = v_1 + v_2$ → $\dot{f} = \dot{v}_1 + \dot{v}_2 = x_2 + \cos(x_1)$
Result: $\frac{\partial f}{\partial x_1} = x_2 + \cos(x_1)$ ✓
To compute $\frac{\partial f}{\partial x_2}$, we would repeat the process with $\dot{x}_1 = 0$ and $\dot{x}_2 = 1$, obtaining $\frac{\partial f}{\partial x_2} = x_1$.
Section III: Backward Mode (Reverse/Adjoint Mode)
3.1 The Philosophy of Backward Mode
Backward mode automatic differentiation propagates derivatives backward through the computational graph, from outputs to inputs. It answers the question: “If I perturb a single output, how sensitive is it to each of the inputs?”
Conceptually, backward mode augments each intermediate value $v$ with an adjoint $\bar{v}$, which represents the derivative of a chosen output variable with respect to that intermediate value. The computation proceeds in two phases: a forward pass to compute all values, followed by a backward pass to accumulate adjoints.
Backward Mode: Row-by-Row Jacobian Construction
Backward mode computes the Jacobian one row at a time. Each backward pass computes derivatives of a single output $f_i$ with respect to all inputs:
To compute the full Jacobian: Perform $M$ backward passes (one per output dimension).
3.2 The Backward Mode Algorithm
To compute the gradient of output $f_i$ (the $i$-th row of the Jacobian), backward mode proceeds in two phases:
BACKWARD MODE ALGORITHM
Phase 1: Forward Pass
Evaluate the function $f$ in the forward direction, storing all intermediate values $v_\ell$ and all local partial derivatives $\frac{\partial f_\ell}{\partial v_{\ell-1}}$.
Phase 2: Backward Pass
Initialization:
Set the adjoint of $f_i$ to 1 (we’re differentiating output $i$), and all other output adjoints to 0.
Backward Propagation:
For each operation $v_\ell = f_\ell(v_{\ell-1})$ in reverse topological order:
- Accumulate the adjoint using the chain rule: $$\bar{v}_{\ell-1} \mathrel{+}= \frac{\partial f_\ell}{\partial v_{\ell-1}} \cdot \bar{v}_\ell$$
Note: += because multiple operations may depend on $v_{\ell-1}$
Output:
At the end, $\bar{x}_j$ contains $\frac{\partial f_i}{\partial x_j}$ for all $j$.
3.3 Backward Mode: Geometric Interpretation
Backward mode can be understood as computing the gradient of a scalar output. The adjoint $\bar{v}_\ell$ represents “how much does the output change if I perturb $v_\ell$?”—which is the definition of $\frac{\partial (\text{output})}{\partial v_\ell}$.
Vector-Jacobian Product
Backward mode naturally computes products of the form $\mathbf{v}^T \cdot J$, where $\mathbf{v}$ is a vector. By setting $\mathbf{v} = \mathbf{e}_i$ (the $i$-th standard basis vector), we obtain the $i$-th row of the Jacobian.
3.4 Backward Mode: Example Computation
Using the same example $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$, let us compute the gradient using backward mode:
Phase 1: Forward Pass
Compute and store all intermediate values:
- $v_1 = x_1 \cdot x_2$
- $v_2 = \sin(x_1)$
- $f = v_1 + v_2$
Phase 2: Backward Pass
Initialization: $\bar{f} = 1$
Backward propagation:
- $f = v_1 + v_2$ → $\bar{v}_1 = \bar{f} \cdot 1 = 1$, $\bar{v}_2 = \bar{f} \cdot 1 = 1$
- $v_2 = \sin(x_1)$ → $\bar{x}_1 \mathrel{+}= \bar{v}_2 \cdot \cos(x_1) = \cos(x_1)$
- $v_1 = x_1 \cdot x_2$ → $\bar{x}_1 \mathrel{+}= \bar{v}_1 \cdot x_2 = x_2$, $\bar{x}_2 = \bar{v}_1 \cdot x_1 = x_1$
Results:
- $\frac{\partial f}{\partial x_1} = \bar{x}_1 = x_2 + \cos(x_1)$ ✓
- $\frac{\partial f}{\partial x_2} = \bar{x}_2 = x_1$ ✓
Notice that backward mode computed both partial derivatives in a single backward pass, whereas forward mode would require two separate passes (one for each input).
Section IV: Computational Cost Analysis
4.1 The Fundamental Question
We now address the central question: Which mode is more efficient? To answer this, we must analyze the computational cost of each approach as a function of the input dimensionality $D$, the output dimensionality $M$, and the cost of evaluating the function itself.
Let us denote:
- $D$: Input dimensionality (number of parameters)
- $M$: Output dimensionality (number of outputs)
- $C_f$: Cost of a single forward evaluation of $f$
- $N$: Number of data points (training examples)
4.2 Forward Mode: Cost Structure
FORWARD MODE COST ANALYSIS
Computational Strategy: Forward mode computes one column of the Jacobian per pass—that is, derivatives of all outputs with respect to a single input variable.
Cost per pass: Approximately $C_f$ (comparable to a single forward evaluation)
Total passes required: $D$ (one per input dimension)
For a single data point (complete Jacobian):
For $N$ data points:
Scaling Behavior:
Cost scales linearly with $D$ (number of inputs), but is independent of $M$ (number of outputs).
This makes forward mode very expensive when the input dimensionality is large—the typical scenario in modern deep learning, where models may have millions or billions of parameters.
4.3 Backward Mode: Cost Structure
BACKWARD MODE COST ANALYSIS
Computational Strategy: Backward mode computes one row of the Jacobian per pass—that is, derivatives of a single output with respect to all input variables.
The computation for each row proceeds in two phases:
- Forward pass: Compute all intermediate values → cost $\approx C_f$
- Backward pass: Propagate adjoints for output $f_i$ → cost $\approx C_f$
Cost per output:
Total passes required: $M$ (one per output dimension)
For a single data point (complete Jacobian):
For $N$ data points:
Scaling Behavior:
Cost scales linearly with $M$ (number of outputs), but is independent of $D$ (number of inputs).
This makes backward mode highly efficient when the output dimensionality is small relative to the input dimensionality—the canonical case being a scalar loss function ($M = 1$).
4.4 Comparative Cost Summary
| Method | Jacobian Computation | Cost per Data Point | Scales With | Total Cost ($N$ points) |
|---|---|---|---|---|
| Forward Mode | Column-by-column ($\partial f / \partial x_i$ for single $i$) |
$\approx C_f$ per column | $D$ (input dimension) |
$O(N \cdot D \cdot C_f)$ |
| Backward Mode | Row-by-row ($\partial f_i / \partial x$ for single $i$) |
$\approx 2C_f$ per row | $M$ (output dimension) |
$O(N \cdot M \cdot C_f)$ |
4.5 The Special Case: Scalar Loss Function
In machine learning, we typically optimize a scalar loss function: $L: \mathbb{R}^D \to \mathbb{R}$. In this case, $M = 1$, and the computational costs become:
Forward Mode ($M=1$)
Must perform $D$ forward passes to compute the gradient. Cost grows linearly with parameter count.
Backward Mode ($M=1$)
Requires only 1 backward pass to compute the entire gradient, regardless of parameter count!
Fundamental Result: The Power of Backward Mode for Scalar Functions
For a scalar loss function $L: \mathbb{R}^D \to \mathbb{R}$ with $N$ training examples:
- Forward mode: Requires $D$ passes → Cost $O(N \cdot D \cdot C_f)$
- Backward mode: Requires 1 pass → Cost $O(N \cdot C_f)$
Speedup factor: $D$ (the number of parameters)
For a modern neural network with $D = 10^6$ parameters, backward mode is one million times faster than forward mode.
Section V: Comparative Analysis and Selection Criteria
5.1 The Fundamental Principle
The Dimensionality Criterion
The choice between forward and backward mode depends on the dimensionality ratio between inputs and outputs:
Algorithmic Intuition:
- Forward mode: Iterates over input variables, requiring $D$ passes. Cost grows with the number of inputs but is independent of output count.
- Backward mode: Iterates over output variables, requiring $M$ passes. Cost grows with the number of outputs but is independent of input count.
5.2 Detailed Comparison Matrix
| Criterion | Forward Mode | Backward Mode |
|---|---|---|
| Optimal Scenario | Many outputs, few inputs ($D \ll M$) |
Few outputs, many inputs ($M \ll D$) |
| Computation Cost | One pass per input dimension $O(D \cdot C_f)$ |
One pass per output dimension $O(M \cdot C_f)$ |
| Memory Requirement | Minimal No storage of intermediate values |
Higher Must store intermediate values and local derivatives |
| Jacobian Computation | Constructs Jacobian column-by-column | Constructs Jacobian row-by-row |
| Gradient of Scalar Loss ($M = 1$) |
Requires $D$ passes $O(N \cdot D \cdot C_f)$ |
Requires 1 pass $O(N \cdot C_f)$ |
| Machine Learning | Rarely used (large parameter counts) |
Standard choice (single loss function) |
| Jacobian-Vector Product | Naturally computes $J \cdot \mathbf{v}$ | Naturally computes $\mathbf{v}^T \cdot J$ |
| Implementation Complexity | Simpler (accumulate tangents during forward pass) | More complex (two-pass algorithm with storage) |
| Propagation Direction | Forward through graph (inputs → outputs) |
Backward through graph (outputs → inputs) |
| Data Structure | Each node stores value and tangent | Each node stores value and adjoint |
| Application of Chain Rule | Applied over parent nodes during forward pass | Applied over child nodes during backward pass |
5.3 Memory Considerations
Beyond computational cost, memory usage is an important practical consideration:
Forward Mode Memory
Requirement: Minimal
- No need to store intermediate values
- Each node maintains only current value and tangent
- Memory usage: $O(\text{graph size})$
Backward Mode Memory
Requirement: Higher
- Must store all intermediate values from forward pass
- Must store all local partial derivatives
- Memory usage: $O(\text{graph size} \times \text{depth})$
Memory-Computation Trade-off
Backward mode’s computational efficiency comes at the cost of increased memory usage. For very deep networks, memory consumption can become a bottleneck. Techniques such as gradient checkpointing can trade some recomputation for reduced memory usage.
5.4 Summary: When to Use Each Mode
Use Forward Mode When:
- Few input variables ($D$ small)
- Many output variables ($M$ large)
- Need Jacobian-vector products $J \cdot \mathbf{v}$
- Memory is severely constrained
- Computing sensitivities of many outputs w.r.t. few parameters
Example: Sensitivity analysis in engineering (many outputs, few design parameters)
Use Backward Mode When:
- Many input variables ($D$ large)
- Few output variables ($M$ small)
- Need vector-Jacobian products $\mathbf{v}^T \cdot J$
- Optimizing a scalar loss function ($M=1$)
- Training neural networks
Example: Deep learning (millions of parameters, single loss)
Epilogue: The Beauty of Duality
We have examined two methods for traversing the computational graph—one forward, one backward—and found that they are complementary perspectives on computing derivatives. Forward mode and backward mode are dual algorithms, each optimal in its own regime.
Forward mode asks: “If I perturb an input, how do all outputs respond?” It propagates tangents forward, computing columns of the Jacobian, and is efficient when inputs are few.
Backward mode asks: “If I perturb an output, how sensitive is it to each input?” It propagates adjoints backward, computing rows of the Jacobian, and is efficient when outputs are few.
For machine learning—computing the gradient of a scalar loss with respect to millions of parameters—backward mode provides substantial computational advantages. This efficiency is what makes backpropagation practical for training modern neural networks.
Forward mode remains useful in domains where the dimensionality structure is reversed—few parameters with many outputs. Both methods provide essential insight into the nature of derivative computation.
The Duality Theorem
Forward mode and backward mode are dual algorithms related by the transpose operation:
They compute the same derivatives through the same computational graph, differing only in:
- The direction of propagation (forward vs. backward)
- The order of chain rule application (left-to-right vs. right-to-left)
- The efficiency for different dimensionality ratios ($D \ll M$ vs. $M \ll D$)
Together, they form a complete theory of automatic differentiation, providing exact derivatives for arbitrary compositions of differentiable functions.
The choice between forward and backward mode is determined by the structure of your problem—specifically the ratio $M/D$. This ratio indicates which algorithm is appropriate and which will scale efficiently.
This duality shows that a single mathematical structure can be approached from multiple directions, each revealing different aspects. In automatic differentiation, this duality has enabled the advances in machine learning that characterize our current era.