195 Computational Graphs

A computational graph is the central data structure behind modern machine learning frameworks. It records how an output is produced from inputs through a sequence of primitive operations, and it turns differentiation, scheduling, and optimization into problems on a directed graph. Understanding computational graphs clarifies why automatic differentiation is exact rather than numerical, why some frameworks feel rigid while others feel like ordinary programming, and how compilers extract performance from a model that was written without any thought for hardware.

195.1 1. Nodes, Edges, and the Structure of Computation

A computational graph is a directed acyclic graph (DAG) $G = (V, E)$. Each node $v \in V$ represents either an input value, a constant, a parameter, or a primitive operation such as addition, matrix multiplication, or a nonlinearity. Each directed edge $(u, v) \in E$ indicates that the value produced at node $u$ is consumed as an argument by node $v$. Because data flows from producers to consumers and a value cannot depend on itself, the graph is acyclic, which guarantees that a topological order exists.

Consider the scalar expression $L = (x w + b - y)^2$, a single squared error term. We decompose it into primitives:

t1 = x * w      # multiply
t2 = t1 + b     # add
t3 = t2 - y     # subtract
L  = t3 * t3    # square

The nodes are $\{x, w, b, y, t_1, t_2, t_3, L\}$ and the edges encode the dependencies, for example $x \to t_1$ and $w \to t_1$. Each intermediate node stores both the operation it performs and, at execution time, the cached value it produced. This caching matters: the same intermediate can feed several downstream nodes, and storing it once avoids recomputation.

The graph for this expression is small enough to draw in full. Inputs and parameters sit at the top, primitives in the middle, and the scalar loss at the bottom.

graph TD
    x["x"] --> t1["t1 = x * w"]
    w["w"] --> t1
    t1 --> t2["t2 = t1 + b"]
    b["b"] --> t2
    t2 --> t3["t3 = t2 - y"]
    y["y"] --> t3
    t3 --> L["L = t3 * t3"]

We can fix the structure once and for all with a precise definition.

Definition: computational graph

A computational graph is a directed acyclic graph $G = (V, E)$ together with, for each non input node $v$, a primitive function $f_v$ and an ordered list of parent nodes $\mathrm{pa}(v) = (u_1, \ldots, u_k)$ such that $(u_i, v) \in E$. The value at $v$ is $a_v = f_v(a_{u_1}, \ldots, a_{u_k})$. Input, constant, and parameter nodes have no parents and carry a value directly. Acyclicity guarantees that the values $a_v$ are well defined and computable in a single pass.

The granularity of a node is a design choice. A graph may be coarse, with a whole layer such as $\text{LayerNorm}$ as one node, or fine, with elementwise multiply and reduce-sum as separate nodes. Coarse graphs are easier to reason about and map cleanly to optimized library kernels. Fine graphs expose more structure to a compiler and allow operation fusion. Frameworks usually expose a coarse user level graph that is lowered into a finer internal representation before optimization.

Two structural properties are worth naming. First, a topological order is any ordering of $V$ such that every edge points from an earlier node to a later node. Forward evaluation visits nodes in topological order. Second, the reverse topological order is what gradient computation uses, because a gradient at a node depends on gradients at all of its consumers.

195.2 2. Static Versus Dynamic Graphs

Frameworks differ in when the graph is constructed relative to when it is executed. This distinction shapes the entire developer experience.

195.2.1 2.1 Static Graphs (Define-and-Run)

In the static, or define-and-run, model the program first builds a complete symbolic graph and then executes it repeatedly with concrete data. TensorFlow 1.x and Theano popularized this style. The user constructs placeholder nodes, wires operations together, and only later feeds numerical arrays through a session. Control flow such as loops and conditionals must be expressed with special graph operators, for example a functional while_loop, because ordinary Python control flow runs only once during graph construction.

The advantage is that the whole computation is known before any number is computed. The framework can analyze the graph globally, allocate memory ahead of time, schedule operations across devices, and apply aggressive compilation. The cost is expressiveness and debuggability. A model whose structure depends on the data, such as a recursive network over a parse tree of varying shape, is awkward to write, and a runtime error surfaces inside the opaque execution engine rather than at the Python line that caused it.

195.2.2 2.2 Dynamic Graphs (Define-by-Run)

In the dynamic, or define-by-run, model the graph is built on the fly as the forward computation executes. Every operation both computes its result and records a node. PyTorch, Chainer, and TensorFlow 2.x eager mode follow this style. Ordinary Python if statements and for loops control the shape of the graph directly, so a different input can produce a different graph on each call.

h = x
for layer in layers:          # ordinary Python loop
    h = relu(layer(h))        # each op appends a node
    if h.norm() > threshold:  # data dependent branch
        h = normalize(h)
loss = mse(h, y)

Define-by-run trades some global optimizability for immediacy. Errors appear at the offending operation, intermediate values can be inspected with a debugger or print statement, and dynamic architectures are expressed in plain code. The historical gap in performance has narrowed because modern frameworks add a tracing or just-in-time compilation step, discussed in Section 5, that recovers a static graph from dynamic code when the structure is stable.

195.3 3. Define-by-Run and the Recorded Tape

The mechanism that makes define-by-run support differentiation is the tape, sometimes called a Wengert list. As the forward pass runs, the framework appends to an ordered record every primitive operation, along with references to its input nodes and its output. The tape is precisely a linearized computational graph captured in execution order.

Each recorded entry stores enough information to later compute a local derivative. For a primitive $y = f(x_1, \ldots, x_k)$ the framework knows a backward rule that, given the gradient of the loss with respect to $y$, returns the gradient with respect to each $x_i$. The tape does not store the symbolic form of $f$; it stores the cached inputs and outputs needed to evaluate the backward rule numerically at the current point.

A subtle but important consequence is memory. Because backward rules often need the forward activations, those activations must be retained until the backward pass consumes them. This is why training memory scales with the depth of the graph and why techniques such as gradient checkpointing exist: they discard some activations and recompute them during the backward pass, trading compute for memory. In eager frameworks the tape is typically discarded after one backward pass, so a second backward call requires either retaining the graph explicitly or rebuilding it.

195.4 4. How Graphs Enable Automatic Differentiation

The computational graph is what makes automatic differentiation (AD) exact and efficient. AD is neither symbolic differentiation, which manipulates expressions and can blow up in size, nor numerical differentiation, which approximates with finite differences and suffers truncation and rounding error. Instead AD applies the chain rule mechanically along the edges of the graph, computing derivatives that are exact up to floating point.

195.4.1 4.1 The Chain Rule on a Graph

Let the loss be $L$ and let node $v$ have value $a_v$. The reverse mode adjoint of $v$ is defined as $\bar{a}_v = \partial L / \partial a_v$. For the output node, $\bar{a}_L = 1$. For any other node $v$, the chain rule sums contributions over all consumers $c$ of $v$:

\[ \bar{a}_v = \sum_{c \,:\, (v, c) \in E} \bar{a}_c \, \frac{\partial a_c}{\partial a_v}. \]

The factor $\partial a_c / \partial a_v$ is the local Jacobian of the operation at node $c$ with respect to its input $v$, and it is exactly what the per primitive backward rule computes. Evaluating this recurrence in reverse topological order guarantees that every consumer’s adjoint $\bar{a}_c$ is already available when $\bar{a}_v$ is needed.

The sum over consumers is the part most often misremembered. When a node $v$ feeds more than one downstream node, its adjoint is the sum of the contributions routed back through each consumer, not just the contribution along one path. This is the multivariable chain rule, and it is why a tensor that is reused several times in a forward pass accumulates gradient from every use during the backward pass.

195.4.2 4.2 A Worked Example

Return to $L = (xw + b - y)^2$ and pick concrete numbers $x = 2$, $w = 3$, $b = 1$, $y = 4$. The forward pass fills in the cached values:

\[ t_1 = xw = 6, \quad t_2 = t_1 + b = 7, \quad t_3 = t_2 - y = 3, \quad L = t_3^2 = 9. \]

The backward pass seeds $\bar{a}_L = \partial L / \partial L = 1$ and walks the graph in reverse, applying one local rule per node:

\[ \bar{a}_{t_3} = \bar{a}_L \cdot \frac{\partial L}{\partial t_3} = 1 \cdot 2 t_3 = 6, \] \[ \bar{a}_{t_2} = \bar{a}_{t_3} \cdot \frac{\partial t_3}{\partial t_2} = 6 \cdot 1 = 6, \] \[ \bar{a}_{t_1} = \bar{a}_{t_2} \cdot \frac{\partial t_2}{\partial t_1} = 6 \cdot 1 = 6. \]

The parameter gradients then read off directly from the multiply and add nodes:

\[ \bar{a}_{w} = \bar{a}_{t_1} \cdot \frac{\partial t_1}{\partial w} = 6 \cdot x = 12, \qquad \bar{a}_{b} = \bar{a}_{t_2} \cdot \frac{\partial t_2}{\partial b} = 6 \cdot 1 = 6. \]

We can verify $\bar{a}_w$ against the closed form. Since $L = (xw + b - y)^2$, we have $\partial L / \partial w = 2(xw + b - y) \cdot x = 2 \cdot 3 \cdot 2 = 12$, which matches. The point is that the graph reached the same exact answer mechanically, by composing simple local rules, never differentiating the full expression symbolically and never perturbing inputs numerically.

195.4.3 4.3 Reverse Mode and Backpropagation

195.4.4 4.2 Reverse Mode and Backpropagation

Reverse mode AD computes the full gradient $\nabla_\theta L$ with a single backward sweep whose cost is a small constant multiple of one forward evaluation, independent of the number of parameters. This property is the reason deep networks are trainable: a model may have billions of parameters $\theta$ but a single scalar loss $L$, so a method whose cost is independent of the parameter count and proportional to the function evaluation is essential. Backpropagation is exactly reverse mode AD specialized to neural network graphs.

Forward mode AD, by contrast, propagates derivatives from inputs to outputs and costs a constant multiple per input direction. It is efficient when there are few inputs and many outputs, the opposite regime. For a function $f : \mathbb{R}^n \to \mathbb{R}^m$, reverse mode is preferred when $m \ll n$ and forward mode when $n \ll m$. Training, with $m = 1$ and $n$ enormous, sits firmly in the reverse mode regime. The table below summarizes the trade off.

Property	Forward mode	Reverse mode
Direction of sweep	inputs to outputs	outputs to inputs
Quantity computed per sweep	one column of the Jacobian (one input direction)	one row of the Jacobian (one output direction)
Cost relative to one forward pass	proportional to number of inputs $n$	proportional to number of outputs $m$
Extra memory for the tape	none beyond the dual numbers	stores forward activations until consumed
Best regime	$n \ll m$ (few inputs)	$m \ll n$ (few outputs)
Typical use	Jacobian vector products, sensitivity in a direction	training a scalar loss, backpropagation

A useful intuition is that one reverse sweep yields the gradient of a single scalar with respect to everything, whereas one forward sweep yields the derivative of everything with respect to a single input. Training optimizes one scalar loss against many parameters, so reverse mode wins decisively.

195.4.5 4.4 Vector Jacobian Products

In practice AD frameworks never materialize a full Jacobian matrix. Reverse mode is implemented as a sequence of vector Jacobian products (VJPs): given an upstream adjoint vector $\bar{a}_c$, each backward rule returns $\bar{a}_c^\top J$ where $J = \partial a_c / \partial a_v$, without ever forming $J$ explicitly. For a linear node $y = W x$, the VJP with respect to $x$ is simply $W^\top \bar{a}_y$, a matrix vector product. Composing VJPs along the graph yields the gradient at the cost of ordinary forward style operations.

195.5 5. Graph-Level Optimizations

Once a computation is captured as a graph, it becomes a target for a compiler. Optimizations operate on the graph as a whole, transforming it into an equivalent graph that runs faster or uses less memory. Systems such as XLA, TorchInductor, TVM, and ONNX Runtime apply these passes.

195.5.1 5.1 Common Optimizations

Operation fusion merges several adjacent elementwise nodes into a single kernel. A chain such as multiply, add, then $\text{ReLU}$ would otherwise read and write intermediate arrays to memory three times. Fused into one kernel, the data is loaded once and the intermediates stay in registers, which on memory bound hardware is often the single largest speedup.

Constant folding evaluates subgraphs whose inputs are all known at compile time and replaces them with literal values, removing work from every execution.

Common subexpression elimination detects nodes that compute the same operation on the same inputs and shares a single node, mirroring the value caching described in Section 1.

Algebraic simplification rewrites subgraphs using mathematical identities, for example replacing $x + 0$, collapsing redundant transposes, or fusing a batch normalization into an adjacent convolution at inference time.

Memory planning and liveness analysis compute, from the graph topology, when each buffer is needed and reuse memory once a value is dead. Knowing the full graph in advance enables tight static allocation rather than per operation allocation.

Layout and device assignment choose data layouts such as row major versus channel last and place nodes on specific devices, then insert the communication nodes needed when an edge crosses a device boundary.

195.5.2 5.2 Tracing, Capture, and Lowering

To optimize dynamic code, frameworks first capture a graph. Tracing runs the function once with example inputs and records the operations that occurred, producing a static graph valid for that input structure. Because tracing only sees one path, data dependent control flow must be handled by guards that re-trace when assumptions break, or by graph capture systems that represent control flow explicitly. The captured graph is then lowered through progressively lower level intermediate representations, where the optimizations above are applied, before code generation emits hardware specific kernels. This pipeline is what lets a user write natural define-by-run code and still obtain near static graph performance.

195.5.3 5.3 Correctness Constraints

Every graph transformation must preserve semantics. Two constraints dominate. First, floating point arithmetic is not associative, so reorderings and fusions can change results at the level of rounding error; reputable compilers keep these changes within documented tolerances or behind explicit flags. Second, operations with side effects, such as in place mutation, random number generation, and reads of external state, impose ordering edges that pure dataflow does not capture. Compilers add explicit dependency edges to preserve such ordering, which is why a graph intended for aggressive optimization is easiest to reason about when its operations are pure functions of their inputs.

195.6 6. Practical Guidance and Pitfalls

The graph view of computation is a mental model that pays off well beyond framework internals. A few practical points recur often enough to state plainly.

Choosing static versus dynamic. Reach for dynamic define-by-run code by default. It is the easiest to write, debug, and reason about, and modern just-in-time compilers recover most of the performance through tracing. Move toward an explicitly captured or compiled graph when a model is stable in structure and dominated by a hot loop, when you need to deploy to an environment without a Python interpreter, or when a global optimization such as cross operation fusion materially changes throughput. The two are not exclusive: write in eager mode, then compile the steady state.

Mind the activation memory. The single most common training pitfall traced to the graph is running out of memory because every forward activation needed by a backward rule is held alive until the backward pass consumes it. Training memory therefore scales with graph depth and batch size, not just parameter count. Gradient checkpointing trades compute for memory by discarding selected activations and recomputing them, and is the standard remedy for deep graphs.

Watch for accidental graph retention. In eager frameworks the tape is normally freed after one backward pass. Keeping a reference to a loss tensor, or appending a tensor that still carries its graph to a running total, can pin the entire graph in memory across iterations. Detaching a tensor from the graph, or accumulating only its plain numeric value, avoids a slow memory leak that grows with the number of steps.

Purity helps the compiler. Operations that are pure functions of their inputs can be freely reordered, fused, and deduplicated. Side effects such as in place mutation, random number draws, and reads of external state impose ordering constraints that a pure dataflow graph cannot express, so compilers must add explicit dependency edges and forgo some optimizations. Code written as pure transformations is both easier to differentiate and easier to optimize.

Non differentiable points are real. AD differentiates the specific computation path taken, not the mathematical function in the abstract. At a kink such as $\mathrm{ReLU}$ at zero, the framework returns a chosen subgradient, and operations like rounding or argmax have zero gradient almost everywhere. A model that trains poorly can be a sign that gradient is not flowing through a discrete or piecewise constant operation.

195.7 7. Summary

A computational graph reduces a model to nodes that compute primitives and edges that carry values. The timing of graph construction separates static define-and-run frameworks, which optimize globally at the cost of flexibility, from dynamic define-by-run frameworks, which build the graph as code runs and recover performance through tracing. The graph structure is what makes reverse mode automatic differentiation possible: a single backward sweep in reverse topological order composes local vector Jacobian products into an exact gradient at a cost independent of the parameter count. The same structure is a compilation target, where fusion, constant folding, memory planning, and layout assignment turn a hardware agnostic model into efficient executable code. These ideas, graph, gradient, and compiler, are the foundation on which every modern training and inference system is built.

195.8 References

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. “Automatic Differentiation in Machine Learning: a Survey.” Journal of Machine Learning Research, 2018. https://jmlr.org/papers/v18/17-468.html
Abadi, M., et al. “TensorFlow: A System for Large-Scale Machine Learning.” OSDI, 2016. https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
Paszke, A., et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” NeurIPS, 2019. https://papers.nips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
Griewank, A., and Walther, A. “Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation.” 2nd ed., SIAM, 2008. https://epubs.siam.org/doi/book/10.1137/1.9780898717761
Chen, T., et al. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” OSDI, 2018. https://www.usenix.org/system/files/osdi18-chen.pdf
The XLA Authors. “XLA: Optimizing Compiler for Machine Learning.” TensorFlow Documentation. https://www.tensorflow.org/xla
Ansel, J., et al. “PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation.” ASPLOS, 2024. https://pytorch.org/assets/pytorch2-2.pdf
Wengert, R. E. “A Simple Automatic Derivative Evaluation Program.” Communications of the ACM, 1964. https://doi.org/10.1145/355586.364791
Chen, T., Xu, B., Zhang, C., and Guestrin, C. “Training Deep Nets with Sublinear Memory Cost.” arXiv, 2016. https://doi.org/10.48550/arXiv.1604.06174

# Computational Graphs A computational graph is the central data structure behind modern machine learning frameworks. It records how an output is produced from inputs through a sequence of primitive operations, and it turns differentiation, scheduling, and optimization into problems on a directed graph. Understanding computational graphs clarifies why automatic differentiation is exact rather than numerical, why some frameworks feel rigid while others feel like ordinary programming, and how compilers extract performance from a model that was written without any thought for hardware. ## 1. Nodes, Edges, and the Structure of Computation A computational graph is a directed acyclic graph (DAG) $G = (V, E)$. Each node $v \in V$ represents either an input value, a constant, a parameter, or a primitive operation such as addition, matrix multiplication, or a nonlinearity. Each directed edge $(u, v) \in E$ indicates that the value produced at node $u$ is consumed as an argument by node $v$. Because data flows from producers to consumers and a value cannot depend on itself, the graph is acyclic, which guarantees that a topological order exists. Consider the scalar expression $L = (x w + b - y)^2$, a single squared error term. We decompose it into primitives: ```text t1 = x * w # multiply t2 = t1 + b # add t3 = t2 - y # subtract L = t3 * t3 # square ``` The nodes are $\{x, w, b, y, t_1, t_2, t_3, L\}$ and the edges encode the dependencies, for example $x \to t_1$ and $w \to t_1$. Each intermediate node stores both the operation it performs and, at execution time, the cached value it produced. This caching matters: the same intermediate can feed several downstream nodes, and storing it once avoids recomputation. The graph for this expression is small enough to draw in full. Inputs and parameters sit at the top, primitives in the middle, and the scalar loss at the bottom. ```{mermaid} graph TD x["x"] --> t1["t1 = x * w"] w["w"] --> t1 t1 --> t2["t2 = t1 + b"] b["b"] --> t2 t2 --> t3["t3 = t2 - y"] y["y"] --> t3 t3 --> L["L = t3 * t3"] ``` We can fix the structure once and for all with a precise definition. ::: {.callout-note title="Definition: computational graph"} A computational graph is a directed acyclic graph $G = (V, E)$ together with, for each non input node $v$, a primitive function $f_v$ and an ordered list of parent nodes $\mathrm{pa}(v) = (u_1, \ldots, u_k)$ such that $(u_i, v) \in E$. The value at $v$ is $a_v = f_v(a_{u_1}, \ldots, a_{u_k})$. Input, constant, and parameter nodes have no parents and carry a value directly. Acyclicity guarantees that the values $a_v$ are well defined and computable in a single pass. ::: The granularity of a node is a design choice. A graph may be coarse, with a whole layer such as $\text{LayerNorm}$ as one node, or fine, with elementwise multiply and reduce-sum as separate nodes. Coarse graphs are easier to reason about and map cleanly to optimized library kernels. Fine graphs expose more structure to a compiler and allow operation fusion. Frameworks usually expose a coarse user level graph that is lowered into a finer internal representation before optimization. Two structural properties are worth naming. First, a topological order is any ordering of $V$ such that every edge points from an earlier node to a later node. Forward evaluation visits nodes in topological order. Second, the reverse topological order is what gradient computation uses, because a gradient at a node depends on gradients at all of its consumers. ## 2. Static Versus Dynamic Graphs Frameworks differ in when the graph is constructed relative to when it is executed. This distinction shapes the entire developer experience. ### 2.1 Static Graphs (Define-and-Run) In the static, or define-and-run, model the program first builds a complete symbolic graph and then executes it repeatedly with concrete data. TensorFlow 1.x and Theano popularized this style. The user constructs placeholder nodes, wires operations together, and only later feeds numerical arrays through a session. Control flow such as loops and conditionals must be expressed with special graph operators, for example a functional `while_loop`, because ordinary Python control flow runs only once during graph construction. The advantage is that the whole computation is known before any number is computed. The framework can analyze the graph globally, allocate memory ahead of time, schedule operations across devices, and apply aggressive compilation. The cost is expressiveness and debuggability. A model whose structure depends on the data, such as a recursive network over a parse tree of varying shape, is awkward to write, and a runtime error surfaces inside the opaque execution engine rather than at the Python line that caused it. ### 2.2 Dynamic Graphs (Define-by-Run) In the dynamic, or define-by-run, model the graph is built on the fly as the forward computation executes. Every operation both computes its result and records a node. PyTorch, Chainer, and TensorFlow 2.x eager mode follow this style. Ordinary Python `if` statements and `for` loops control the shape of the graph directly, so a different input can produce a different graph on each call. ```text h = x for layer in layers: # ordinary Python loop h = relu(layer(h)) # each op appends a node if h.norm() > threshold: # data dependent branch h = normalize(h) loss = mse(h, y) ``` Define-by-run trades some global optimizability for immediacy. Errors appear at the offending operation, intermediate values can be inspected with a debugger or print statement, and dynamic architectures are expressed in plain code. The historical gap in performance has narrowed because modern frameworks add a tracing or just-in-time compilation step, discussed in Section 5, that recovers a static graph from dynamic code when the structure is stable. ## 3. Define-by-Run and the Recorded Tape The mechanism that makes define-by-run support differentiation is the tape, sometimes called a Wengert list. As the forward pass runs, the framework appends to an ordered record every primitive operation, along with references to its input nodes and its output. The tape is precisely a linearized computational graph captured in execution order. Each recorded entry stores enough information to later compute a local derivative. For a primitive $y = f(x_1, \ldots, x_k)$ the framework knows a backward rule that, given the gradient of the loss with respect to $y$, returns the gradient with respect to each $x_i$. The tape does not store the symbolic form of $f$; it stores the cached inputs and outputs needed to evaluate the backward rule numerically at the current point. A subtle but important consequence is memory. Because backward rules often need the forward activations, those activations must be retained until the backward pass consumes them. This is why training memory scales with the depth of the graph and why techniques such as gradient checkpointing exist: they discard some activations and recompute them during the backward pass, trading compute for memory. In eager frameworks the tape is typically discarded after one backward pass, so a second backward call requires either retaining the graph explicitly or rebuilding it. ## 4. How Graphs Enable Automatic Differentiation The computational graph is what makes automatic differentiation (AD) exact and efficient. AD is neither symbolic differentiation, which manipulates expressions and can blow up in size, nor numerical differentiation, which approximates with finite differences and suffers truncation and rounding error. Instead AD applies the chain rule mechanically along the edges of the graph, computing derivatives that are exact up to floating point. ### 4.1 The Chain Rule on a Graph Let the loss be $L$ and let node $v$ have value $a_v$. The reverse mode adjoint of $v$ is defined as $\bar{a}_v = \partial L / \partial a_v$. For the output node, $\bar{a}_L = 1$. For any other node $v$, the chain rule sums contributions over all consumers $c$ of $v$: $$ \bar{a}_v = \sum_{c \,:\, (v, c) \in E} \bar{a}_c \, \frac{\partial a_c}{\partial a_v}. $$ The factor $\partial a_c / \partial a_v$ is the local Jacobian of the operation at node $c$ with respect to its input $v$, and it is exactly what the per primitive backward rule computes. Evaluating this recurrence in reverse topological order guarantees that every consumer's adjoint $\bar{a}_c$ is already available when $\bar{a}_v$ is needed. The sum over consumers is the part most often misremembered. When a node $v$ feeds more than one downstream node, its adjoint is the sum of the contributions routed back through each consumer, not just the contribution along one path. This is the multivariable chain rule, and it is why a tensor that is reused several times in a forward pass accumulates gradient from every use during the backward pass. ### 4.2 A Worked Example Return to $L = (xw + b - y)^2$ and pick concrete numbers $x = 2$, $w = 3$, $b = 1$, $y = 4$. The forward pass fills in the cached values: $$ t_1 = xw = 6, \quad t_2 = t_1 + b = 7, \quad t_3 = t_2 - y = 3, \quad L = t_3^2 = 9. $$ The backward pass seeds $\bar{a}_L = \partial L / \partial L = 1$ and walks the graph in reverse, applying one local rule per node: $$ \bar{a}_{t_3} = \bar{a}_L \cdot \frac{\partial L}{\partial t_3} = 1 \cdot 2 t_3 = 6, $$ $$ \bar{a}_{t_2} = \bar{a}_{t_3} \cdot \frac{\partial t_3}{\partial t_2} = 6 \cdot 1 = 6, $$ $$ \bar{a}_{t_1} = \bar{a}_{t_2} \cdot \frac{\partial t_2}{\partial t_1} = 6 \cdot 1 = 6. $$ The parameter gradients then read off directly from the multiply and add nodes: $$ \bar{a}_{w} = \bar{a}_{t_1} \cdot \frac{\partial t_1}{\partial w} = 6 \cdot x = 12, \qquad \bar{a}_{b} = \bar{a}_{t_2} \cdot \frac{\partial t_2}{\partial b} = 6 \cdot 1 = 6. $$ We can verify $\bar{a}_w$ against the closed form. Since $L = (xw + b - y)^2$, we have $\partial L / \partial w = 2(xw + b - y) \cdot x = 2 \cdot 3 \cdot 2 = 12$, which matches. The point is that the graph reached the same exact answer mechanically, by composing simple local rules, never differentiating the full expression symbolically and never perturbing inputs numerically. ### 4.3 Reverse Mode and Backpropagation ### 4.2 Reverse Mode and Backpropagation Reverse mode AD computes the full gradient $\nabla_\theta L$ with a single backward sweep whose cost is a small constant multiple of one forward evaluation, independent of the number of parameters. This property is the reason deep networks are trainable: a model may have billions of parameters $\theta$ but a single scalar loss $L$, so a method whose cost is independent of the parameter count and proportional to the function evaluation is essential. Backpropagation is exactly reverse mode AD specialized to neural network graphs. Forward mode AD, by contrast, propagates derivatives from inputs to outputs and costs a constant multiple per input direction. It is efficient when there are few inputs and many outputs, the opposite regime. For a function $f : \mathbb{R}^n \to \mathbb{R}^m$, reverse mode is preferred when $m \ll n$ and forward mode when $n \ll m$. Training, with $m = 1$ and $n$ enormous, sits firmly in the reverse mode regime. The table below summarizes the trade off. | Property | Forward mode | Reverse mode | |---|---|---| | Direction of sweep | inputs to outputs | outputs to inputs | | Quantity computed per sweep | one column of the Jacobian (one input direction) | one row of the Jacobian (one output direction) | | Cost relative to one forward pass | proportional to number of inputs $n$ | proportional to number of outputs $m$ | | Extra memory for the tape | none beyond the dual numbers | stores forward activations until consumed | | Best regime | $n \ll m$ (few inputs) | $m \ll n$ (few outputs) | | Typical use | Jacobian vector products, sensitivity in a direction | training a scalar loss, backpropagation | A useful intuition is that one reverse sweep yields the gradient of a single scalar with respect to everything, whereas one forward sweep yields the derivative of everything with respect to a single input. Training optimizes one scalar loss against many parameters, so reverse mode wins decisively. ### 4.4 Vector Jacobian Products In practice AD frameworks never materialize a full Jacobian matrix. Reverse mode is implemented as a sequence of vector Jacobian products (VJPs): given an upstream adjoint vector $\bar{a}_c$, each backward rule returns $\bar{a}_c^\top J$ where $J = \partial a_c / \partial a_v$, without ever forming $J$ explicitly. For a linear node $y = W x$, the VJP with respect to $x$ is simply $W^\top \bar{a}_y$, a matrix vector product. Composing VJPs along the graph yields the gradient at the cost of ordinary forward style operations. ## 5. Graph-Level Optimizations Once a computation is captured as a graph, it becomes a target for a compiler. Optimizations operate on the graph as a whole, transforming it into an equivalent graph that runs faster or uses less memory. Systems such as XLA, TorchInductor, TVM, and ONNX Runtime apply these passes. ### 5.1 Common Optimizations **Operation fusion** merges several adjacent elementwise nodes into a single kernel. A chain such as multiply, add, then $\text{ReLU}$ would otherwise read and write intermediate arrays to memory three times. Fused into one kernel, the data is loaded once and the intermediates stay in registers, which on memory bound hardware is often the single largest speedup. **Constant folding** evaluates subgraphs whose inputs are all known at compile time and replaces them with literal values, removing work from every execution. **Common subexpression elimination** detects nodes that compute the same operation on the same inputs and shares a single node, mirroring the value caching described in Section 1. **Algebraic simplification** rewrites subgraphs using mathematical identities, for example replacing $x + 0$, collapsing redundant transposes, or fusing a batch normalization into an adjacent convolution at inference time. **Memory planning and liveness analysis** compute, from the graph topology, when each buffer is needed and reuse memory once a value is dead. Knowing the full graph in advance enables tight static allocation rather than per operation allocation. **Layout and device assignment** choose data layouts such as row major versus channel last and place nodes on specific devices, then insert the communication nodes needed when an edge crosses a device boundary. ### 5.2 Tracing, Capture, and Lowering To optimize dynamic code, frameworks first capture a graph. Tracing runs the function once with example inputs and records the operations that occurred, producing a static graph valid for that input structure. Because tracing only sees one path, data dependent control flow must be handled by guards that re-trace when assumptions break, or by graph capture systems that represent control flow explicitly. The captured graph is then lowered through progressively lower level intermediate representations, where the optimizations above are applied, before code generation emits hardware specific kernels. This pipeline is what lets a user write natural define-by-run code and still obtain near static graph performance. ### 5.3 Correctness Constraints Every graph transformation must preserve semantics. Two constraints dominate. First, floating point arithmetic is not associative, so reorderings and fusions can change results at the level of rounding error; reputable compilers keep these changes within documented tolerances or behind explicit flags. Second, operations with side effects, such as in place mutation, random number generation, and reads of external state, impose ordering edges that pure dataflow does not capture. Compilers add explicit dependency edges to preserve such ordering, which is why a graph intended for aggressive optimization is easiest to reason about when its operations are pure functions of their inputs. ## 6. Practical Guidance and Pitfalls The graph view of computation is a mental model that pays off well beyond framework internals. A few practical points recur often enough to state plainly. **Choosing static versus dynamic.** Reach for dynamic define-by-run code by default. It is the easiest to write, debug, and reason about, and modern just-in-time compilers recover most of the performance through tracing. Move toward an explicitly captured or compiled graph when a model is stable in structure and dominated by a hot loop, when you need to deploy to an environment without a Python interpreter, or when a global optimization such as cross operation fusion materially changes throughput. The two are not exclusive: write in eager mode, then compile the steady state. **Mind the activation memory.** The single most common training pitfall traced to the graph is running out of memory because every forward activation needed by a backward rule is held alive until the backward pass consumes it. Training memory therefore scales with graph depth and batch size, not just parameter count. Gradient checkpointing trades compute for memory by discarding selected activations and recomputing them, and is the standard remedy for deep graphs. **Watch for accidental graph retention.** In eager frameworks the tape is normally freed after one backward pass. Keeping a reference to a loss tensor, or appending a tensor that still carries its graph to a running total, can pin the entire graph in memory across iterations. Detaching a tensor from the graph, or accumulating only its plain numeric value, avoids a slow memory leak that grows with the number of steps. **Purity helps the compiler.** Operations that are pure functions of their inputs can be freely reordered, fused, and deduplicated. Side effects such as in place mutation, random number draws, and reads of external state impose ordering constraints that a pure dataflow graph cannot express, so compilers must add explicit dependency edges and forgo some optimizations. Code written as pure transformations is both easier to differentiate and easier to optimize. **Non differentiable points are real.** AD differentiates the specific computation path taken, not the mathematical function in the abstract. At a kink such as $\mathrm{ReLU}$ at zero, the framework returns a chosen subgradient, and operations like rounding or argmax have zero gradient almost everywhere. A model that trains poorly can be a sign that gradient is not flowing through a discrete or piecewise constant operation. ## 7. Summary A computational graph reduces a model to nodes that compute primitives and edges that carry values. The timing of graph construction separates static define-and-run frameworks, which optimize globally at the cost of flexibility, from dynamic define-by-run frameworks, which build the graph as code runs and recover performance through tracing. The graph structure is what makes reverse mode automatic differentiation possible: a single backward sweep in reverse topological order composes local vector Jacobian products into an exact gradient at a cost independent of the parameter count. The same structure is a compilation target, where fusion, constant folding, memory planning, and layout assignment turn a hardware agnostic model into efficient executable code. These ideas, graph, gradient, and compiler, are the foundation on which every modern training and inference system is built. ## References 1. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. "Automatic Differentiation in Machine Learning: a Survey." Journal of Machine Learning Research, 2018. https://jmlr.org/papers/v18/17-468.html 2. Abadi, M., et al. "TensorFlow: A System for Large-Scale Machine Learning." OSDI, 2016. https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf 3. Paszke, A., et al. "PyTorch: An Imperative Style, High-Performance Deep Learning Library." NeurIPS, 2019. https://papers.nips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html 4. Griewank, A., and Walther, A. "Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation." 2nd ed., SIAM, 2008. https://epubs.siam.org/doi/book/10.1137/1.9780898717761 5. Chen, T., et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." OSDI, 2018. https://www.usenix.org/system/files/osdi18-chen.pdf 6. The XLA Authors. "XLA: Optimizing Compiler for Machine Learning." TensorFlow Documentation. https://www.tensorflow.org/xla 7. Ansel, J., et al. "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation." ASPLOS, 2024. https://pytorch.org/assets/pytorch2-2.pdf 8. Wengert, R. E. "A Simple Automatic Derivative Evaluation Program." Communications of the ACM, 1964. https://doi.org/10.1145/355586.364791 9. Chen, T., Xu, B., Zhang, C., and Guestrin, C. "Training Deep Nets with Sublinear Memory Cost." arXiv, 2016. https://doi.org/10.48550/arXiv.1604.06174