Automatic Differentiation: Forward and Backward Modes

Automatic Differentiation

Forward and Backward Modes: A Computational Perspective

Prologue: The Dual Paths Through the Computational Graph
Section I: The Foundation – The Chain Rule
Section II: Forward Mode (Tangent Mode)
Section III: Backward Mode (Reverse/Adjoint Mode)
Section IV: Computational Cost Analysis
Section V: Comparative Analysis and Selection Criteria
Epilogue: The Beauty of Duality

Prologue: The Dual Paths Through the Computational Graph

The ability to compute derivatives efficiently is fundamental to modern machine learning. Every gradient descent step and every parameter update depends on evaluating partial derivatives of complex functions.

Behind every neural network’s training procedure lies a computational graph—a directed acyclic structure encoding the sequence of operations that transform inputs into outputs. Through this graph, we can compute derivatives in two distinct ways, each with its own efficiency characteristics and optimal use cases.

These are the forward mode and backward mode of automatic differentiation. Both methods apply the chain rule systematically to compute exact derivatives, differing only in the direction they traverse the computational graph.

Fundamental Insight:

Forward mode and backward mode are dual algorithms that compute the same derivatives through opposite traversals of the computational graph. Forward mode propagates derivatives from inputs to outputs, while backward mode propagates them from outputs to inputs.

The choice between them is determined by computational efficiency, which depends on the dimensionality of the input and output spaces.

Section I: The Foundation – The Chain Rule

1.1 The Composition of Functions

Consider a function $f: \mathbb{R}^D \to \mathbb{R}^M$ that maps an input vector $\mathbf{x} \in \mathbb{R}^D$ to an output vector $\mathbf{y} \in \mathbb{R}^M$. In practice, this function is rarely evaluated as a single atomic operation. Instead, it is the composition of many simpler operations:

$$\mathbf{y} = f(\mathbf{x}) = f_L \circ f_{L-1} \circ \cdots \circ f_2 \circ f_1(\mathbf{x})$$

Each intermediate operation $f_\ell$ produces a vector $\mathbf{v}_\ell$ of intermediate values, forming a sequence:

$$\mathbf{v}_0 = \mathbf{x}, \quad \mathbf{v}_1 = f_1(\mathbf{v}_0), \quad \mathbf{v}_2 = f_2(\mathbf{v}_1), \quad \ldots, \quad \mathbf{v}_L = f_L(\mathbf{v}_{L-1}) = \mathbf{y}$$

This sequence defines a computational graph: a directed acyclic graph where nodes represent intermediate values and edges represent dependencies between operations.

1.2 The Jacobian Matrix

Our objective is to compute the Jacobian matrix—the matrix of all first-order partial derivatives:

Definition: The Jacobian Matrix

For a function $f: \mathbb{R}^D \to \mathbb{R}^M$, the Jacobian is an $M \times D$ matrix:

$$J = \frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_D} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_D} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_M}{\partial x_1} & \frac{\partial f_M}{\partial x_2} & \cdots & \frac{\partial f_M}{\partial x_D} \end{bmatrix}$$

Each row contains the gradient of a single output component:

$$\text{Row } i: \quad \nabla f_i = \left[\frac{\partial f_i}{\partial x_1}, \frac{\partial f_i}{\partial x_2}, \ldots, \frac{\partial f_i}{\partial x_D}\right]$$

Each column contains derivatives of all outputs with respect to a single input:

$$\text{Column } j: \quad \frac{\partial f}{\partial x_j} = \begin{bmatrix} \frac{\partial f_1}{\partial x_j} \\ \frac{\partial f_2}{\partial x_j} \\ \vdots \\ \frac{\partial f_M}{\partial x_j} \end{bmatrix}$$

1.3 The Multivariable Chain Rule

The chain rule for compositions of vector-valued functions states that the Jacobian of a composition is the product of the Jacobians of the constituent functions:

$$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial f_L}{\partial \mathbf{v}_{L-1}} \cdot \frac{\partial f_{L-1}}{\partial \mathbf{v}_{L-2}} \cdots \frac{\partial f_2}{\partial \mathbf{v}_1} \cdot \frac{\partial f_1}{\partial \mathbf{x}}$$

This is a product of $L$ Jacobian matrices. The question now becomes: in what order should we perform these matrix multiplications?

Key Insight: The Associativity of Matrix Multiplication

Matrix multiplication is associative: $(AB)C = A(BC)$. The parenthesization determines the order of operations, but not the final result. However, different parenthesizations have vastly different computational costs.

Forward mode corresponds to right-to-left parenthesization:

$$\left(\cdots\left(\left(\frac{\partial f_L}{\partial \mathbf{v}_{L-1}} \cdot \frac{\partial f_{L-1}}{\partial \mathbf{v}_{L-2}}\right) \cdot \frac{\partial f_{L-2}}{\partial \mathbf{v}_{L-3}}\right) \cdots \right)$$

Backward mode corresponds to left-to-right parenthesization:

$$\left(\cdots\left(\left(\frac{\partial f_3}{\partial \mathbf{v}_2} \cdot \frac{\partial f_2}{\partial \mathbf{v}_1}\right) \cdot \frac{\partial f_1}{\partial \mathbf{x}}\right) \right)$$

Section II: Forward Mode (Tangent Mode)

2.1 The Philosophy of Forward Mode

Forward mode automatic differentiation propagates derivatives forward through the computational graph, from inputs to outputs. It answers the question: “If I perturb a single input variable, how do all the outputs respond?”

Conceptually, forward mode augments each intermediate value $v$ with a tangent $\dot{v}$, which represents the derivative of that value with respect to a chosen input variable. As the computation proceeds forward, tangents are updated according to local derivative rules.

Forward Mode: Column-by-Column Jacobian Construction

Forward mode computes the Jacobian one column at a time. Each forward pass computes derivatives with respect to a single input variable $x_j$:

$$\text{Forward pass } j: \quad \text{Compute column } j = \frac{\partial f}{\partial x_j} = \begin{bmatrix} \frac{\partial f_1}{\partial x_j} \\ \frac{\partial f_2}{\partial x_j} \\ \vdots \\ \frac{\partial f_M}{\partial x_j} \end{bmatrix}$$

To compute the full Jacobian: Perform $D$ forward passes (one per input dimension).

2.2 The Forward Mode Algorithm

To compute the $j$-th column of the Jacobian (derivatives with respect to $x_j$), forward mode proceeds as follows:

FORWARD MODE ALGORITHM

Initialization:

$$\dot{x}_j = 1, \quad \dot{x}_i = 0 \text{ for } i \neq j$$

Set the tangent of $x_j$ to 1 (we’re differentiating with respect to $x_j$), and all other tangents to 0.

Forward Propagation:

For each operation $v_\ell = f_\ell(v_{\ell-1})$ in topological order:

Compute the value: $v_\ell = f_\ell(v_{\ell-1})$
Compute the tangent using the chain rule: $$\dot{v}_\ell = \frac{\partial f_\ell}{\partial v_{\ell-1}} \cdot \dot{v}_{\ell-1}$$

Output:

At the end, $\dot{y}_i$ contains $\frac{\partial f_i}{\partial x_j}$ for all $i$.

2.3 Forward Mode: Geometric Interpretation

Forward mode can be understood geometrically as computing the directional derivative in the direction of a coordinate axis. When we set $\dot{x}_j = 1$ and all other tangents to zero, we are asking: “How does the function change when we move in the direction of the $j$-th coordinate?”

Jacobian-Vector Product

Forward mode naturally computes products of the form $J \cdot \mathbf{v}$, where $\mathbf{v}$ is a direction vector. By setting $\mathbf{v} = \mathbf{e}_j$ (the $j$-th standard basis vector), we obtain the $j$-th column of the Jacobian.

$$J \cdot \mathbf{e}_j = \text{Column } j \text{ of } J$$

2.4 Forward Mode: Example Computation

Let us consider a simple example to illustrate forward mode. Suppose we wish to compute the derivatives of:

$$f(x_1, x_2) = x_1 x_2 + \sin(x_1)$$

The computational graph consists of the following operations:

Computing $\frac{\partial f}{\partial x_1}$ (Column 1 of Jacobian):

Initialization: $\dot{x}_1 = 1$, $\dot{x}_2 = 0$

Forward propagation:

$v_1 = x_1 \cdot x_2$ → $\dot{v}_1 = x_2 \cdot \dot{x}_1 + x_1 \cdot \dot{x}_2 = x_2 \cdot 1 + x_1 \cdot 0 = x_2$
$v_2 = \sin(x_1)$ → $\dot{v}_2 = \cos(x_1) \cdot \dot{x}_1 = \cos(x_1) \cdot 1 = \cos(x_1)$
$f = v_1 + v_2$ → $\dot{f} = \dot{v}_1 + \dot{v}_2 = x_2 + \cos(x_1)$

Result: $\frac{\partial f}{\partial x_1} = x_2 + \cos(x_1)$ ✓

To compute $\frac{\partial f}{\partial x_2}$, we would repeat the process with $\dot{x}_1 = 0$ and $\dot{x}_2 = 1$, obtaining $\frac{\partial f}{\partial x_2} = x_1$.

Section III: Backward Mode (Reverse/Adjoint Mode)

3.1 The Philosophy of Backward Mode

Backward mode automatic differentiation propagates derivatives backward through the computational graph, from outputs to inputs. It answers the question: “If I perturb a single output, how sensitive is it to each of the inputs?”

Conceptually, backward mode augments each intermediate value $v$ with an adjoint $\bar{v}$, which represents the derivative of a chosen output variable with respect to that intermediate value. The computation proceeds in two phases: a forward pass to compute all values, followed by a backward pass to accumulate adjoints.

Backward Mode: Row-by-Row Jacobian Construction

Backward mode computes the Jacobian one row at a time. Each backward pass computes derivatives of a single output $f_i$ with respect to all inputs:

$$\text{Backward pass } i: \quad \text{Compute row } i = \nabla f_i = \left[\frac{\partial f_i}{\partial x_1}, \frac{\partial f_i}{\partial x_2}, \ldots, \frac{\partial f_i}{\partial x_D}\right]$$

To compute the full Jacobian: Perform $M$ backward passes (one per output dimension).

3.2 The Backward Mode Algorithm

To compute the gradient of output $f_i$ (the $i$-th row of the Jacobian), backward mode proceeds in two phases:

BACKWARD MODE ALGORITHM

Phase 1: Forward Pass

Evaluate the function $f$ in the forward direction, storing all intermediate values $v_\ell$ and all local partial derivatives $\frac{\partial f_\ell}{\partial v_{\ell-1}}$.

Phase 2: Backward Pass

Initialization:

$$\bar{f}_i = 1, \quad \bar{f}_j = 0 \text{ for } j \neq i$$

Set the adjoint of $f_i$ to 1 (we’re differentiating output $i$), and all other output adjoints to 0.

Backward Propagation:

For each operation $v_\ell = f_\ell(v_{\ell-1})$ in reverse topological order:

Accumulate the adjoint using the chain rule: $$\bar{v}_{\ell-1} \mathrel{+}= \frac{\partial f_\ell}{\partial v_{\ell-1}} \cdot \bar{v}_\ell$$

Note: += because multiple operations may depend on $v_{\ell-1}$

Output:

At the end, $\bar{x}_j$ contains $\frac{\partial f_i}{\partial x_j}$ for all $j$.

3.3 Backward Mode: Geometric Interpretation

Backward mode can be understood as computing the gradient of a scalar output. The adjoint $\bar{v}_\ell$ represents “how much does the output change if I perturb $v_\ell$?”—which is the definition of $\frac{\partial (\text{output})}{\partial v_\ell}$.

Vector-Jacobian Product

Backward mode naturally computes products of the form $\mathbf{v}^T \cdot J$, where $\mathbf{v}$ is a vector. By setting $\mathbf{v} = \mathbf{e}_i$ (the $i$-th standard basis vector), we obtain the $i$-th row of the Jacobian.

$$\mathbf{e}_i^T \cdot J = \text{Row } i \text{ of } J$$

3.4 Backward Mode: Example Computation

Using the same example $f(x_1, x_2) = x_1 x_2 + \sin(x_1)$, let us compute the gradient using backward mode:

Phase 1: Forward Pass

Compute and store all intermediate values:

$v_1 = x_1 \cdot x_2$
$v_2 = \sin(x_1)$
$f = v_1 + v_2$

Phase 2: Backward Pass

Initialization: $\bar{f} = 1$

Backward propagation:

$f = v_1 + v_2$ → $\bar{v}_1 = \bar{f} \cdot 1 = 1$, $\bar{v}_2 = \bar{f} \cdot 1 = 1$
$v_2 = \sin(x_1)$ → $\bar{x}_1 \mathrel{+}= \bar{v}_2 \cdot \cos(x_1) = \cos(x_1)$
$v_1 = x_1 \cdot x_2$ → $\bar{x}_1 \mathrel{+}= \bar{v}_1 \cdot x_2 = x_2$, $\bar{x}_2 = \bar{v}_1 \cdot x_1 = x_1$

Results:

$\frac{\partial f}{\partial x_1} = \bar{x}_1 = x_2 + \cos(x_1)$ ✓
$\frac{\partial f}{\partial x_2} = \bar{x}_2 = x_1$ ✓

Notice that backward mode computed both partial derivatives in a single backward pass, whereas forward mode would require two separate passes (one for each input).

Section IV: Computational Cost Analysis

4.1 The Fundamental Question

We now address the central question: Which mode is more efficient? To answer this, we must analyze the computational cost of each approach as a function of the input dimensionality $D$, the output dimensionality $M$, and the cost of evaluating the function itself.

Let us denote:

$D$: Input dimensionality (number of parameters)
$M$: Output dimensionality (number of outputs)
$C_f$: Cost of a single forward evaluation of $f$
$N$: Number of data points (training examples)

4.2 Forward Mode: Cost Structure

FORWARD MODE COST ANALYSIS

Computational Strategy: Forward mode computes one column of the Jacobian per pass—that is, derivatives of all outputs with respect to a single input variable.

Cost per pass: Approximately $C_f$ (comparable to a single forward evaluation)

Total passes required: $D$ (one per input dimension)

For a single data point (complete Jacobian):

$$\text{Cost}_{\text{forward}} \approx D \times C_f$$

For $N$ data points:

$$\text{Total Cost}_{\text{forward}} \approx N \times D \times C_f = O(N \cdot D \cdot C_f)$$

Scaling Behavior:

Cost scales linearly with $D$ (number of inputs), but is independent of $M$ (number of outputs).

This makes forward mode very expensive when the input dimensionality is large—the typical scenario in modern deep learning, where models may have millions or billions of parameters.

4.3 Backward Mode: Cost Structure

BACKWARD MODE COST ANALYSIS

Computational Strategy: Backward mode computes one row of the Jacobian per pass—that is, derivatives of a single output with respect to all input variables.

$$\text{Row } i: \quad \nabla f_i = \left[\frac{\partial f_i}{\partial x_1}, \frac{\partial f_i}{\partial x_2}, \ldots, \frac{\partial f_i}{\partial x_D}\right]$$

The computation for each row proceeds in two phases:

Forward pass: Compute all intermediate values → cost $\approx C_f$
Backward pass: Propagate adjoints for output $f_i$ → cost $\approx C_f$

Cost per output:

$$\text{Cost per row} \approx 2 \times C_f$$

Total passes required: $M$ (one per output dimension)

For a single data point (complete Jacobian):

$$\text{Cost}_{\text{backward}} \approx M \times 2C_f$$

For $N$ data points:

$$\text{Total Cost}_{\text{backward}} \approx N \times M \times 2C_f = O(N \cdot M \cdot C_f)$$

Scaling Behavior:

Cost scales linearly with $M$ (number of outputs), but is independent of $D$ (number of inputs).

This makes backward mode highly efficient when the output dimensionality is small relative to the input dimensionality—the canonical case being a scalar loss function ($M = 1$).

4.4 Comparative Cost Summary

Method	Jacobian Computation	Cost per Data Point	Scales With	Total Cost ($N$ points)
Forward Mode	Column-by-column ($\partial f / \partial x_i$ for single $i$)	$\approx C_f$ per column	$D$ (input dimension)	$O(N \cdot D \cdot C_f)$
Backward Mode	Row-by-row ($\partial f_i / \partial x$ for single $i$)	$\approx 2C_f$ per row	$M$ (output dimension)	$O(N \cdot M \cdot C_f)$

4.5 The Special Case: Scalar Loss Function

In machine learning, we typically optimize a scalar loss function: $L: \mathbb{R}^D \to \mathbb{R}$. In this case, $M = 1$, and the computational costs become:

Forward Mode ($M=1$)

$$\text{Cost} = O(N \cdot D \cdot C_f)$$

Must perform $D$ forward passes to compute the gradient. Cost grows linearly with parameter count.

Backward Mode ($M=1$)

$$\text{Cost} = O(N \cdot C_f)$$

Requires only 1 backward pass to compute the entire gradient, regardless of parameter count!

Fundamental Result: The Power of Backward Mode for Scalar Functions

For a scalar loss function $L: \mathbb{R}^D \to \mathbb{R}$ with $N$ training examples:

Forward mode: Requires $D$ passes → Cost $O(N \cdot D \cdot C_f)$
Backward mode: Requires 1 pass → Cost $O(N \cdot C_f)$

Speedup factor: $D$ (the number of parameters)

For a modern neural network with $D = 10^6$ parameters, backward mode is one million times faster than forward mode.

Section V: Comparative Analysis and Selection Criteria

5.1 The Fundamental Principle

The Dimensionality Criterion

The choice between forward and backward mode depends on the dimensionality ratio between inputs and outputs:

$$\text{Choose Forward Mode if: } D \ll M$$

$$\text{Choose Backward Mode if: } M \ll D$$

Algorithmic Intuition:

Forward mode: Iterates over input variables, requiring $D$ passes. Cost grows with the number of inputs but is independent of output count.
Backward mode: Iterates over output variables, requiring $M$ passes. Cost grows with the number of outputs but is independent of input count.

5.2 Detailed Comparison Matrix

Criterion	Forward Mode	Backward Mode
Optimal Scenario	Many outputs, few inputs ($D \ll M$)	Few outputs, many inputs ($M \ll D$)
Computation Cost	One pass per input dimension $O(D \cdot C_f)$	One pass per output dimension $O(M \cdot C_f)$
Memory Requirement	Minimal No storage of intermediate values	Higher Must store intermediate values and local derivatives
Jacobian Computation	Constructs Jacobian column-by-column	Constructs Jacobian row-by-row
Gradient of Scalar Loss ($M = 1$)	Requires $D$ passes $O(N \cdot D \cdot C_f)$	Requires 1 pass $O(N \cdot C_f)$
Machine Learning	Rarely used (large parameter counts)	Standard choice (single loss function)
Jacobian-Vector Product	Naturally computes $J \cdot \mathbf{v}$	Naturally computes $\mathbf{v}^T \cdot J$
Implementation Complexity	Simpler (accumulate tangents during forward pass)	More complex (two-pass algorithm with storage)
Propagation Direction	Forward through graph (inputs → outputs)	Backward through graph (outputs → inputs)
Data Structure	Each node stores value and tangent	Each node stores value and adjoint
Application of Chain Rule	Applied over parent nodes during forward pass	Applied over child nodes during backward pass

5.3 Memory Considerations

Beyond computational cost, memory usage is an important practical consideration:

Forward Mode Memory

Requirement: Minimal

No need to store intermediate values
Each node maintains only current value and tangent
Memory usage: $O(\text{graph size})$

Backward Mode Memory

Requirement: Higher

Must store all intermediate values from forward pass
Must store all local partial derivatives
Memory usage: $O(\text{graph size} \times \text{depth})$

Memory-Computation Trade-off

Backward mode’s computational efficiency comes at the cost of increased memory usage. For very deep networks, memory consumption can become a bottleneck. Techniques such as gradient checkpointing can trade some recomputation for reduced memory usage.

5.4 Summary: When to Use Each Mode

Use Forward Mode When:

Few input variables ($D$ small)
Many output variables ($M$ large)
Need Jacobian-vector products $J \cdot \mathbf{v}$
Memory is severely constrained
Computing sensitivities of many outputs w.r.t. few parameters

Example: Sensitivity analysis in engineering (many outputs, few design parameters)

Use Backward Mode When:

Many input variables ($D$ large)
Few output variables ($M$ small)
Need vector-Jacobian products $\mathbf{v}^T \cdot J$
Optimizing a scalar loss function ($M=1$)
Training neural networks

Example: Deep learning (millions of parameters, single loss)

Epilogue: The Beauty of Duality

We have examined two methods for traversing the computational graph—one forward, one backward—and found that they are complementary perspectives on computing derivatives. Forward mode and backward mode are dual algorithms, each optimal in its own regime.

Forward mode asks: “If I perturb an input, how do all outputs respond?” It propagates tangents forward, computing columns of the Jacobian, and is efficient when inputs are few.

Backward mode asks: “If I perturb an output, how sensitive is it to each input?” It propagates adjoints backward, computing rows of the Jacobian, and is efficient when outputs are few.

For machine learning—computing the gradient of a scalar loss with respect to millions of parameters—backward mode provides substantial computational advantages. This efficiency is what makes backpropagation practical for training modern neural networks.

Forward mode remains useful in domains where the dimensionality structure is reversed—few parameters with many outputs. Both methods provide essential insight into the nature of derivative computation.

The Duality Theorem

Forward mode and backward mode are dual algorithms related by the transpose operation:

$$\text{Forward: } J \cdot \mathbf{v} \quad \longleftrightarrow \quad \text{Backward: } \mathbf{v}^T \cdot J$$

They compute the same derivatives through the same computational graph, differing only in:

The direction of propagation (forward vs. backward)
The order of chain rule application (left-to-right vs. right-to-left)
The efficiency for different dimensionality ratios ($D \ll M$ vs. $M \ll D$)

Together, they form a complete theory of automatic differentiation, providing exact derivatives for arbitrary compositions of differentiable functions.

The choice between forward and backward mode is determined by the structure of your problem—specifically the ratio $M/D$. This ratio indicates which algorithm is appropriate and which will scale efficiently.

This duality shows that a single mathematical structure can be approached from multiple directions, each revealing different aspects. In automatic differentiation, this duality has enabled the advances in machine learning that characterize our current era.

Table of Contents