11 The AI Technology Stack

Modern artificial intelligence is not a single technology but a deep, layered stack of cooperating systems. A trained model that classifies images or generates text is the visible tip of a pyramid whose base reaches down through orchestration frameworks, serving runtimes, deep learning libraries, numerical kernels, device drivers, and finally the silicon that performs the arithmetic. Understanding AI in practice means understanding how these layers fit together, where the abstractions leak, and which tradeoffs dominate at each boundary. This chapter surveys the stack from the bottom up, treating each layer as an engineering subsystem with its own constraints, vendors, and design decisions.

11.1 1. Why Think in Layers

The layered view borrows its logic from operating systems and networking. Each layer exposes an interface to the one above it and hides implementation detail below it. A researcher writing a training loop in PyTorch rarely thinks about warp scheduling on a streaming multiprocessor, just as a web developer rarely thinks about TCP retransmission. The abstraction is what makes productivity possible at scale.

The catch is that AI abstractions are unusually leaky. Performance, which is the dominant currency of the field, depends on details that cross many layers at once. A model that runs in milliseconds or in seconds, that fits in memory or overflows it, that costs cents or dollars per query, is determined jointly by the choice of accelerator, the memory layout of the tensors, the kernel implementation, the framework’s execution mode, and the serving strategy. Practitioners therefore need at least a working mental model of the whole stack even when they operate primarily at one level.

A second reason to think in layers is portability versus performance. The higher you write your code, the more portable it is and the less control you have. The lower you write it, the faster it can run and the more it locks you to a vendor. Every serious AI decision sits somewhere on this spectrum, and naming the layers makes the spectrum legible.

It is worth defining the central terms precisely, because they recur at every layer.

Throughput is the rate of useful work, measured in floating-point operations per second (FLOP/s) for compute, in bytes per second for memory and network, or in requests or tokens per second for serving.
Latency is the time from issuing a request to receiving its result. Throughput and latency are distinct: a system can have high throughput and high latency at once, for example by processing large batches.
Arithmetic intensity is the ratio of arithmetic operations to bytes moved from memory, with units of FLOP per byte. It is the single number that determines whether a kernel is limited by computation or by memory traffic.
Utilization is achieved throughput divided by peak throughput. A model achieving forty percent of an accelerator’s peak FLOP/s has forty percent compute utilization. Low utilization is the normal state of affairs, and most performance engineering is the work of raising it.

These four quantities, plus capacity (how many bytes a device can hold), are the vocabulary in which every tradeoff in this chapter is ultimately denominated.

The figure below shows the canonical stack as a vertical diagram. Read it from the bottom, where electrons move, to the top, where users interact.

flowchart TD
    L8["Layer 8: Applications"]
    L7["Layer 7: Orchestration and MLOps"]
    L6["Layer 6: Model Hubs and Serving"]
    L5["Layer 5: Data and Pipeline Tooling"]
    L4["Layer 4: Deep Learning Frameworks"]
    L3["Layer 3: Numerical and Tensor Libraries"]
    L2["Layer 2: Systems and Drivers"]
    L1["Layer 1: Hardware Accelerators"]
    L8 --> L7 --> L6 --> L5 --> L4 --> L3 --> L2 --> L1

Layer	Representative tools and components
8. Applications	chatbots, copilots, RAG apps, agents, recommenders
7. Orchestration and MLOps	Kubernetes, Ray, Airflow, MLflow, Kubeflow, Weights and Biases
6. Model hubs and serving	Hugging Face Hub, vLLM, TGI, Triton, TorchServe, KServe
5. Data and pipeline tooling	Parquet, Arrow, Spark, Ray Data, DALI, tf.data, Dask
4. Deep learning frameworks	PyTorch, JAX, TensorFlow, Keras
3. Numerical and tensor libraries	cuBLAS, cuDNN, NCCL, MKL, oneDNN, BLAS, LAPACK, NumPy
2. Systems and drivers	CUDA, ROCm, oneAPI, device drivers, compilers such as NVCC, XLA, LLVM
1. Hardware accelerators	GPUs, TPUs, CPUs, NPUs, interconnect such as NVLink and InfiniBand

The arrows of dependency point upward: each layer assumes the correctness and availability of everything beneath it. The arrows of demand point downward: each layer makes requests that ultimately resolve into floating-point operations on silicon.

11.2 2. Layer 1: Hardware and Accelerators

At the foundation sits the physical machinery that executes arithmetic. The central fact of modern AI is that deep learning is dominated by dense linear algebra, principally matrix multiplication, and that this workload maps poorly onto the general-purpose CPU and beautifully onto massively parallel accelerators.

11.2.1 2.1 GPUs, TPUs, and NPUs

The graphics processing unit (GPU) became the workhorse of deep learning because it offers thousands of arithmetic units operating in parallel, very high memory bandwidth, and specialized matrix engines. NVIDIA’s data center GPUs, the A100, H100, and the Blackwell generation, include tensor cores that perform fused multiply-accumulate operations on small matrix tiles at very low precision, which is exactly the operation that dominates transformer training and inference (1). High-bandwidth memory (HBM) sits beside the compute die and feeds it at terabytes per second, because the binding constraint for many AI kernels is memory bandwidth rather than raw arithmetic throughput.

The tensor processing unit (TPU), designed by Google, is an application-specific integrated circuit built around a large systolic array for matrix multiplication (2). Where a GPU is a flexible parallel processor, a TPU is a more specialized device that trades generality for efficiency on the narrow workload of neural network math. Neural processing units (NPUs) bring similar ideas to edge and mobile devices, prioritizing energy efficiency for on-device inference.

11.2.2 2.2 The Roofline Model: When Memory Is the Limit

The recurring claim that AI is bound by memory rather than arithmetic can be made precise with the roofline model, a simple but powerful performance bound (15). Consider a kernel that performs $W$ floating-point operations while moving $Q$ bytes between the accelerator’s compute units and its memory. Define its arithmetic intensity as

\[ I = \frac{W}{Q} \quad \text{(FLOP per byte)}. \]

Let $\pi$ denote the device’s peak compute rate in FLOP/s and $\beta$ its peak memory bandwidth in bytes/s. The time to run the kernel is bounded below by whichever resource is the bottleneck, so the achievable throughput in FLOP/s obeys

\[ P \;\le\; \min\bigl(\pi,\; \beta \cdot I\bigr). \]

The two terms cross at the ridge point $I^\* = \pi / \beta$. A kernel with $I < I^\*$ is memory-bound: it cannot use the full compute capacity because it spends its time waiting on memory, and the only way to speed it up is to move fewer bytes (better data reuse, higher precision packing, fusion) or to move them faster. A kernel with $I > I^\*$ is compute-bound and is limited by the arithmetic units themselves.

The ridge point of modern accelerators is high. A device with roughly $1 \times 10^{15}$ FLOP/s of low-precision compute and roughly $3 \times 10^{12}$ bytes/s of memory bandwidth has $I^\* \approx 300$ FLOP per byte. Many important kernels fall well below this. A matrix multiplication of two large square matrices has intensity that grows with the matrix dimension and is comfortably compute-bound, which is why dense training saturates accelerators well. By contrast, the per-token work of autoregressive decoding, an elementwise activation, or a small batched attention step has low intensity and lands on the memory-bound slope. This single inequality explains why so much of the engineering described in later sections, paged attention, quantization, operator fusion, and high-bandwidth memory, is aimed at the byte-movement term $Q$ and the bandwidth $\beta$ rather than at raw arithmetic.

11.2.3 2.3 Interconnect and Scale

A single accelerator is rarely enough for frontier models. Training a large language model requires hundreds or thousands of devices cooperating, which makes the interconnect a first-class part of the hardware layer. NVLink connects GPUs within a server at high bandwidth, while InfiniBand or specialized Ethernet fabrics connect servers within a cluster. The efficiency of distributed training depends heavily on how fast gradients can be exchanged across this fabric, so interconnect topology is as important as per-chip throughput.

The cost of synchronization can be quantified. Data-parallel training requires an all-reduce of the gradients after every step, summing each parameter’s gradient across all $p$ devices and returning the result to each. A naive implementation would send the full gradient buffer to a central node and incur traffic proportional to $p$. The standard ring all-reduce instead arranges devices in a ring and pipelines the reduction, so that each device sends and receives only

\[ 2 \,\frac{p-1}{p}\, M \;\approx\; 2M \quad \text{bytes} \]

where $M$ is the size of the gradient buffer in bytes. The remarkable property is that this per-device cost is essentially independent of $p$: it approaches $2M$ as the cluster grows, so bandwidth, not device count, sets the floor on communication time (16). With link bandwidth $\beta_{\text{net}}$, the all-reduce takes roughly $2M / \beta_{\text{net}}$ seconds, and the fraction of each training step lost to communication is this time divided by the step time. Keeping that fraction small is the reason interconnect bandwidth is engineered as aggressively as compute, and the reason gradient compression, overlapping communication with backward computation, and topology-aware collective algorithms all exist.

11.2.4 2.4 Tradeoffs at the Hardware Layer

The dominant tradeoff is generality versus efficiency. CPUs are the most general and the least efficient per watt for dense linear algebra. GPUs occupy a productive middle ground, flexible enough to run arbitrary research code yet fast enough for production. TPUs and other ASICs push toward maximum efficiency at the cost of flexibility and ecosystem breadth. A second tradeoff is capital versus operating cost: owning accelerators is expensive up front but can be cheaper at scale than renting from a cloud provider, while renting offers elasticity at a premium. Memory capacity is the constraint practitioners feel most acutely, because a model that does not fit on a device forces either smaller batches, model parallelism, or quantization.

11.3 3. Layer 2: Systems and Drivers

Silicon is useless without software to drive it. The systems layer comprises the device drivers, low-level runtimes, and compilers that let higher software issue work to accelerators.

11.3.1 3.1 CUDA and the NVIDIA Moat

CUDA is NVIDIA’s parallel computing platform and programming model, and it is arguably the single most important reason the company dominates AI (3). CUDA exposes the GPU as a programmable device through a C-like language, a runtime, and a vast ecosystem of optimized libraries. Crucially, nearly every deep learning framework targets CUDA first and best. This creates a powerful network effect: researchers write for CUDA because the tooling is mature, and the tooling is mature because researchers write for CUDA. The result is a software moat that is harder to cross than any hardware advantage.

11.3.2 3.2 ROCm, oneAPI, and the Challengers

AMD’s answer is ROCm, an open-source platform that aims for CUDA parity and offers HIP, a portability layer that lets CUDA code be translated to run on AMD hardware (4). Intel’s oneAPI pursues a similar goal of a unified, cross-vendor programming model. These efforts have narrowed the gap, particularly for inference and for the most common operations, but they still lag CUDA in breadth of optimized kernels and in the long tail of community support. The strategic stakes are high, because a credible open alternative would loosen NVIDIA’s grip on pricing and supply.

11.3.3 3.3 Compilers as a Battleground

Increasingly, the systems layer includes domain-specific compilers that transform high-level tensor programs into optimized device code. XLA compiles computations from JAX and TensorFlow, fusing operations and tuning memory layout for the target accelerator. Triton, a Python-embedded language from OpenAI, lets developers write custom GPU kernels at a higher level than raw CUDA while approaching hand-tuned performance. These compilers matter because the gap between a naive kernel and a fused, well-tiled one can be an order of magnitude in speed.

11.4 4. Layer 3: Numerical and Tensor Libraries

Above the driver sit the numerical libraries that implement the actual mathematical primitives. Frameworks do not reimplement matrix multiplication; they call into these libraries.

On NVIDIA hardware, cuBLAS provides dense linear algebra, cuDNN provides tuned implementations of convolution, attention, and other neural network primitives, and NCCL provides the collective communication operations (all-reduce, all-gather) that distributed training depends on (5). On CPUs, the long lineage of BLAS and LAPACK, accelerated by vendor libraries such as Intel MKL and oneDNN, plays the analogous role. NumPy, although usually thought of as a Python convenience, is itself a thin and elegant wrapper over these CPU kernels and established the n-dimensional array as the lingua franca of scientific computing in Python (6).

The tradeoff at this layer is invisible to most users by design, but it is enormous in effect. These libraries are written and maintained by performance specialists who exploit cache hierarchies, vector instructions, and accelerator microarchitecture in ways that ordinary application code never could. The cost is that they are closed or semi-closed, vendor-specific, and slow to support new hardware. A new accelerator is only as useful as its cuDNN-equivalent, which is why building these kernels is one of the hardest parts of bringing new silicon to market.

11.5 5. Layer 4: Deep Learning Frameworks

The framework layer is where most practitioners spend their time. A framework provides three things: an n-dimensional tensor type with operations that dispatch to the numerical libraries below, automatic differentiation so that gradients need not be derived by hand, and a set of building blocks for constructing and training models.

The middle item deserves a precise statement, because it is the mathematical engine of the whole field. Automatic differentiation (autodiff) is not numerical differentiation by finite differences, and it is not symbolic differentiation that manipulates formulas. It is the systematic application of the chain rule to the elementary operations recorded while a program runs. A framework represents a model as a composition of differentiable primitives $f = f_L \circ \cdots \circ f_2 \circ f_1$. By the chain rule the Jacobian of the whole is the product of the per-layer Jacobians,

\[ J_f = J_{f_L} \, J_{f_{L-1}} \cdots J_{f_1}. \]

Training needs the gradient of a scalar loss with respect to many parameters, which is a vector-Jacobian product. Reverse-mode autodiff, known in this setting as backpropagation, evaluates this product right to left, propagating a single vector backward through the recorded operations. Its cost is a small constant multiple of the cost of the forward evaluation, independent of the number of parameters, which is exactly why it scales to networks with billions of weights. The price is memory: the intermediate activations needed for the backward pass must be stored, which makes activation memory a first-order constraint and motivates techniques such as gradient checkpointing that trade recomputation for storage. This asymmetry, cheap gradients but expensive activation memory, is felt at every layer above.

11.5.1 5.1 PyTorch

PyTorch has become the dominant research framework and is increasingly dominant in production as well (7). Its defining design choice was eager, define-by-run execution: the computational graph is built dynamically as Python executes, which makes models feel like ordinary imperative programs and makes debugging natural. This came at a historical cost in performance and deployment, which the project has addressed with torch.compile, a graph-capture and compilation system that recovers much of the speed of static graphs without sacrificing the eager programming model.

11.5.2 5.2 JAX

JAX takes a different philosophy rooted in functional programming (8). It offers composable function transformations: grad for automatic differentiation, jit for just-in-time compilation through XLA, vmap for automatic vectorization, and pmap or its successors for parallelism across devices. JAX favors pure functions and explicit state, which suits large-scale, highly parallel research, and it has become the framework of choice for much frontier model work, particularly inside Google and in the academic community studying scaling.

11.5.3 5.3 TensorFlow and Keras

TensorFlow, the framework that catalyzed the deep learning industry, pioneered the static-graph approach in which the full computation is defined before execution, enabling aggressive optimization and straightforward deployment to servers, mobile, and the browser (9). Keras, now a multi-backend high-level API, provides an accessible, layer-oriented interface and runs atop TensorFlow, JAX, or PyTorch. While TensorFlow’s share of new research has declined, its production tooling and deployment story remain strong.

11.5.4 5.4 Framework Tradeoffs

The core tension is eager flexibility versus compiled performance, and the field has largely converged on a synthesis: write in an eager, Pythonic style, then compile hot paths for speed. The remaining differentiators are ecosystem (PyTorch’s library breadth is unmatched), parallelism model (JAX’s transformations are uniquely composable), and deployment maturity. Lock-in is real but softening, since interchange formats and multi-backend tools let models move between frameworks more easily than before.

11.6 6. Layer 5: Data and Pipeline Tooling

Models are only as good as the data fed to them, and at scale, getting bytes from storage to the accelerator without starving it becomes a serious engineering problem. The data layer handles storage formats, transformation, and high-throughput loading.

Columnar formats such as Apache Parquet and the in-memory Apache Arrow standard allow efficient storage and zero-copy sharing of large tabular datasets (10). Distributed processing engines such as Apache Spark, Dask, and Ray Data transform and clean data across clusters before it ever reaches training. At the boundary with the accelerator, loaders such as PyTorch’s DataLoader, tf.data, and NVIDIA DALI overlap data preparation with computation so that the expensive GPU is never idle waiting for the next batch.

The governing principle here is that the pipeline must keep the accelerator saturated. An accelerator costing many dollars per hour that sits idle waiting for data is pure waste, so the data layer is engineered around throughput, prefetching, and parallel decoding. The tradeoff is complexity: sophisticated pipelines with sharding, caching, and augmentation are powerful but brittle, and a surprising fraction of real-world training failures trace to the data path rather than the model.

11.7 7. Layer 6: Model Hubs and Serving

Once a model exists, it must be distributed and then executed on behalf of users. These are distinct concerns, and the stack provides distinct tooling for each.

11.7.1 7.1 Model Hubs

The Hugging Face Hub has become the de facto registry for sharing pretrained models, datasets, and demos, hosting hundreds of thousands of models with versioning, model cards, and a standardized loading interface (11). The companion transformers library turned the use of a state-of-the-art model into a few lines of code, which dramatically lowered the barrier to applied AI. The hub model mirrors the package registries of software engineering, bringing the same benefits of reuse and the same risks around provenance, licensing, and supply-chain trust.

11.7.2 7.2 Serving Runtimes

Serving a large model efficiently is a specialized problem, particularly for autoregressive language models whose generation is memory-bound and sequential. The reason is a direct consequence of the roofline model. Generating one token of output requires reading every model weight from memory exactly once to compute a single forward pass. For a model with $N$ parameters stored at $b$ bytes each, decoding a single sequence moves about $bN$ bytes but performs only about $2N$ FLOP, an arithmetic intensity near $2/b$, which is tiny. Single-stream decoding therefore sits far down the memory-bound slope, and its speed is set almost entirely by memory bandwidth: the time per token is approximately $bN / \beta$, where $\beta$ is the device bandwidth.

This analysis also reveals the cure. The weights are read once per forward pass regardless of how many sequences are processed together, so batching many requests into one forward pass amortizes that fixed byte-movement cost across many tokens, raising arithmetic intensity and pushing the kernel back toward the compute-bound region where the accelerator is well used. The obstacle is the per-request key-value cache, the stored attention keys and values that grow with sequence length and consume the memory that batching needs. Purpose-built inference servers address exactly this. vLLM introduced PagedAttention, which manages the key-value cache like virtual memory in pages, eliminates fragmentation, and allows many requests to share GPU memory efficiently, which in turn allows much larger batches and sharply higher throughput (12). Hugging Face Text Generation Inference, NVIDIA Triton Inference Server, TorchServe, and KServe provide production features such as continuous (dynamic) batching, multi-model hosting, and standardized inference protocols.

The dominant tradeoffs are latency versus throughput (batching more requests raises throughput but can raise the time an individual request waits and is processed) and cost versus quality (quantization and smaller models cut cost at some risk to accuracy).

11.7.3 7.3 Worked Example: Reading the Stack Through One Number

Consider serving a 7 billion parameter language model on a single accelerator with peak bandwidth $\beta = 3 \times 10^{12}$ bytes/s. The example is illustrative; the point is the method, not the exact figures.

Store the weights in 16-bit precision, so $b = 2$ bytes and the weights occupy $bN = 2 \times 7 \times 10^9 = 1.4 \times 10^{10}$ bytes, about 14 GB. A single decoding stream must read all of this per token, so the lower bound on time per token is

\[ t \;\approx\; \frac{bN}{\beta} \;=\; \frac{1.4 \times 10^{10}}{3 \times 10^{12}} \;\approx\; 4.7 \times 10^{-3}\ \text{s}, \]

roughly 210 tokens per second as a bandwidth ceiling for one stream, before any overhead. Now quantize the weights to 8 bits, $b = 1$. The bytes moved per token halve, the weights occupy about 7 GB, and the bandwidth ceiling roughly doubles to about 420 tokens per second. This is the concrete mechanism behind the claim that quantization helps inference: its primary benefit for memory-bound decoding is not fewer FLOP but fewer bytes moved, which is the quantity the roofline says actually governs the latency. The freed memory also leaves room for a larger key-value cache and hence larger batches, compounding the throughput gain. The same single number, arithmetic intensity, that classified the kernel in Layer 1 thus dictates a Layer 6 serving decision, an illustration of how tightly the layers are coupled.

11.8 8. Layer 7: Orchestration and MLOps

Production AI is a continuous process, not a one-time artifact, and the orchestration layer manages that process across teams and time. This is the domain of MLOps, the application of DevOps discipline to machine learning.

Kubernetes provides the general substrate for running containerized workloads across clusters, and Ray offers a Python-native framework for scaling training, tuning, and serving (13). Workflow engines such as Apache Airflow, Kubeflow Pipelines, and Metaflow schedule the multi-step pipelines that ingest data, train, evaluate, and deploy. Experiment tracking and registry tools such as MLflow and Weights and Biases record the parameters, metrics, and artifacts of every run so that results are reproducible and models are governed (14). Feature stores and monitoring systems close the loop by serving consistent features and detecting drift once a model is live.

The central insight of this layer is that machine learning systems decay. Data distributions shift, dependencies change, and yesterday’s accurate model degrades silently. The orchestration layer exists to make training reproducible, deployment repeatable, and degradation observable. Its tradeoff is the familiar one of platform engineering: heavyweight, integrated MLOps platforms reduce operational toil but impose process and lock-in, while lightweight, composed tooling stays flexible at the cost of more glue code and more discipline.

11.9 9. Layer 8: The Application Layer

At the top sits the layer that delivers value to people: the chatbots, coding copilots, retrieval-augmented question answering systems, autonomous agents, recommendation engines, and search experiences that constitute the product. With the rise of capable foundation models served behind APIs, a great deal of application development now happens here without any direct contact with the lower layers at all.

This layer has developed its own emerging stack. Orchestration libraries such as LangChain and LlamaIndex compose model calls, tool use, and memory. Vector databases such as Pinecone, Weaviate, and the open-source FAISS library store embeddings for semantic retrieval, the backbone of retrieval-augmented generation. Protocols such as the Model Context Protocol are beginning to standardize how applications connect models to external tools and data sources. The tradeoff at the application layer is build versus buy taken to its logical end: a team can call a hosted model and own almost nothing of the stack below, gaining speed and giving up control, cost predictability, and data sovereignty, or it can self-host and own everything, inverting every term of that bargain.

11.10 10. How the Layers Fit Together

The power of the stack comes from its composition, and a single inference request illustrates the cooperation. A user types a question into a chat application at layer 8. The application embeds the query and retrieves context from a vector store at the application layer, then issues a request to a serving runtime at layer 6. The runtime, perhaps vLLM, schedules the request, manages its key-value cache, and invokes a model defined in a framework at layer 4. The framework dispatches the model’s matrix multiplications and attention operations to numerical libraries at layer 3, such as cuBLAS and cuDNN. Those libraries issue work through the CUDA runtime at layer 2, which drives the GPU at layer 1, where tensor cores finally multiply the numbers. The generated tokens travel back up the same chain. Meanwhile, layer 7 has provisioned the hardware, deployed the model, and is recording metrics for the whole transaction.

The diagram below traces that request as it descends the stack and the generated tokens as they return.

flowchart TD
    U["User question"]
    A["Layer 8: chat app embeds query and retrieves context"]
    S["Layer 6: serving runtime schedules request and manages KV cache"]
    F["Layer 4: framework runs model forward pass"]
    N["Layer 3: cuBLAS and cuDNN compute matmul and attention"]
    D["Layer 2: CUDA runtime issues work to the device"]
    H["Layer 1: tensor cores multiply the numbers"]
    U --> A --> S --> F --> N --> D --> H
    H -. "generated tokens travel back up" .-> U

Two cross-cutting principles govern the whole edifice. First, the binding constraint is usually memory, not arithmetic, which is why so much engineering at every layer (quantization, paged attention, HBM, prefetching pipelines) targets the movement and storage of data rather than the speed of computation. Second, abstractions leak in the direction of performance: a practitioner can ignore the lower layers right up until performance, cost, or memory forces them to look down, at which point a working model of the entire stack becomes indispensable. The stack is therefore best understood not as a set of independent choices but as a coupled system in which decisions at one layer ripple through all the others.

11.11 11. When to Look Down the Stack, and Common Pitfalls

A practitioner does not need to optimize every layer at once. The discipline is knowing which layer the current problem lives in, because effort spent at the wrong layer is wasted.

When to descend. Stay high in the stack by default, since higher layers are more portable and more productive. Descend only when a measured constraint forces it. If latency or cost is unacceptable, first identify whether the workload is memory-bound or compute-bound using the arithmetic-intensity test, because the two require opposite remedies. A memory-bound decode is helped by quantization, fusion, and a better key-value cache (Layers 1, 3, and 6), and not at all by a faster matrix-multiply algorithm. A compute-bound training step is helped by higher-precision tensor cores and better tiling, and barely at all by bandwidth tricks. If a single device cannot hold the model, the choice between quantization, model parallelism, and a larger accelerator is again a Layer 1 through Layer 6 decision driven by memory capacity.

Common pitfalls.

Profiling the wrong layer. Blaming the framework for slowness that is actually a starved data pipeline (Layer 5) or a memory-bound kernel (Layer 1) is the most frequent mistake. Always measure utilization before optimizing; low GPU utilization usually points to the data path or to a memory bound, not to the model code.
Ignoring the data path. A surprising fraction of real training failures and slowdowns trace to Layer 5 rather than the model. An accelerator idle while it waits for the next batch is pure waste.
Premature low-level optimization. Writing custom kernels before confirming the kernel is the bottleneck trades large effort for small gains. Reach for compilers (torch.compile, XLA, Triton) before hand-written device code.
Mistaking quantization’s mechanism. Quantization speeds memory-bound inference chiefly by moving fewer bytes, not by reducing arithmetic. Expecting it to help a compute-bound workload, or applying it without measuring the accuracy cost, leads to disappointment.
Underestimating lock-in and decay. Choosing the lowest, most vendor-specific layer for a marginal speedup can trap a project on one accelerator. Deploying without Layer 7 monitoring lets a model degrade silently as data drifts.

The unifying lesson is diagnostic: name the layer, measure the binding resource, and apply the remedy that matches it.

11.12 References

NVIDIA. “NVIDIA H100 Tensor Core GPU Architecture.” NVIDIA Corporation. https://www.nvidia.com/en-us/data-center/h100/
Jouppi, N. P., et al. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. https://arxiv.org/abs/1704.04760
NVIDIA. “CUDA Toolkit Documentation.” NVIDIA Corporation. https://docs.nvidia.com/cuda/
AMD. “ROCm Open Software Platform Documentation.” Advanced Micro Devices. https://rocm.docs.amd.com/
NVIDIA. “cuDNN, cuBLAS, and NCCL Developer Libraries.” NVIDIA Corporation. https://developer.nvidia.com/cudnn
Harris, C. R., et al. “Array Programming with NumPy.” Nature, vol. 585, 2020, pp. 357-362. https://www.nature.com/articles/s41586-020-2649-2
Paszke, A., et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703
Bradbury, J., et al. “JAX: Composable Transformations of Python and NumPy Programs.” Google Research. https://github.com/jax-ml/jax
Abadi, M., et al. “TensorFlow: A System for Large-Scale Machine Learning.” Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016. https://www.tensorflow.org/
Apache Software Foundation. “Apache Arrow: A Cross-Language Development Platform for In-Memory Data.” https://arrow.apache.org/
Wolf, T., et al. “Transformers: State-of-the-Art Natural Language Processing.” Proceedings of EMNLP: System Demonstrations, 2020. https://huggingface.co/docs/hub/
Kwon, W., et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), 2023. https://arxiv.org/abs/2309.06180
Moritz, P., et al. “Ray: A Distributed Framework for Emerging AI Applications.” Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018. https://arxiv.org/abs/1712.05889
Zaharia, M., et al. “Accelerating the Machine Learning Lifecycle with MLflow.” IEEE Data Engineering Bulletin, vol. 41, no. 4, 2018. https://mlflow.org/
Williams, S., Waterman, A., and Patterson, D. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM, vol. 52, no. 4, 2009, pp. 65-76. https://doi.org/10.1145/1498765.1498785
Sergeev, A., and Del Balso, M. “Horovod: Fast and Easy Distributed Deep Learning in TensorFlow.” arXiv preprint, 2018. https://arxiv.org/abs/1802.05799

# The AI Technology Stack Modern artificial intelligence is not a single technology but a deep, layered stack of cooperating systems. A trained model that classifies images or generates text is the visible tip of a pyramid whose base reaches down through orchestration frameworks, serving runtimes, deep learning libraries, numerical kernels, device drivers, and finally the silicon that performs the arithmetic. Understanding AI in practice means understanding how these layers fit together, where the abstractions leak, and which tradeoffs dominate at each boundary. This chapter surveys the stack from the bottom up, treating each layer as an engineering subsystem with its own constraints, vendors, and design decisions. ## 1. Why Think in Layers The layered view borrows its logic from operating systems and networking. Each layer exposes an interface to the one above it and hides implementation detail below it. A researcher writing a training loop in PyTorch rarely thinks about warp scheduling on a streaming multiprocessor, just as a web developer rarely thinks about TCP retransmission. The abstraction is what makes productivity possible at scale. The catch is that AI abstractions are unusually leaky. Performance, which is the dominant currency of the field, depends on details that cross many layers at once. A model that runs in milliseconds or in seconds, that fits in memory or overflows it, that costs cents or dollars per query, is determined jointly by the choice of accelerator, the memory layout of the tensors, the kernel implementation, the framework's execution mode, and the serving strategy. Practitioners therefore need at least a working mental model of the whole stack even when they operate primarily at one level. A second reason to think in layers is portability versus performance. The higher you write your code, the more portable it is and the less control you have. The lower you write it, the faster it can run and the more it locks you to a vendor. Every serious AI decision sits somewhere on this spectrum, and naming the layers makes the spectrum legible. It is worth defining the central terms precisely, because they recur at every layer. - **Throughput** is the rate of useful work, measured in floating-point operations per second (FLOP/s) for compute, in bytes per second for memory and network, or in requests or tokens per second for serving. - **Latency** is the time from issuing a request to receiving its result. Throughput and latency are distinct: a system can have high throughput and high latency at once, for example by processing large batches. - **Arithmetic intensity** is the ratio of arithmetic operations to bytes moved from memory, with units of FLOP per byte. It is the single number that determines whether a kernel is limited by computation or by memory traffic. - **Utilization** is achieved throughput divided by peak throughput. A model achieving forty percent of an accelerator's peak FLOP/s has forty percent compute utilization. Low utilization is the normal state of affairs, and most performance engineering is the work of raising it. These four quantities, plus capacity (how many bytes a device can hold), are the vocabulary in which every tradeoff in this chapter is ultimately denominated. The figure below shows the canonical stack as a vertical diagram. Read it from the bottom, where electrons move, to the top, where users interact. ```{mermaid} flowchart TD L8["Layer 8: Applications"] L7["Layer 7: Orchestration and MLOps"] L6["Layer 6: Model Hubs and Serving"] L5["Layer 5: Data and Pipeline Tooling"] L4["Layer 4: Deep Learning Frameworks"] L3["Layer 3: Numerical and Tensor Libraries"] L2["Layer 2: Systems and Drivers"] L1["Layer 1: Hardware Accelerators"] L8 --> L7 --> L6 --> L5 --> L4 --> L3 --> L2 --> L1 ``` | Layer | Representative tools and components | |---|---| | 8. Applications | chatbots, copilots, RAG apps, agents, recommenders | | 7. Orchestration and MLOps | Kubernetes, Ray, Airflow, MLflow, Kubeflow, Weights and Biases | | 6. Model hubs and serving | Hugging Face Hub, vLLM, TGI, Triton, TorchServe, KServe | | 5. Data and pipeline tooling | Parquet, Arrow, Spark, Ray Data, DALI, tf.data, Dask | | 4. Deep learning frameworks | PyTorch, JAX, TensorFlow, Keras | | 3. Numerical and tensor libraries | cuBLAS, cuDNN, NCCL, MKL, oneDNN, BLAS, LAPACK, NumPy | | 2. Systems and drivers | CUDA, ROCm, oneAPI, device drivers, compilers such as NVCC, XLA, LLVM | | 1. Hardware accelerators | GPUs, TPUs, CPUs, NPUs, interconnect such as NVLink and InfiniBand | The arrows of dependency point upward: each layer assumes the correctness and availability of everything beneath it. The arrows of demand point downward: each layer makes requests that ultimately resolve into floating-point operations on silicon. ## 2. Layer 1: Hardware and Accelerators At the foundation sits the physical machinery that executes arithmetic. The central fact of modern AI is that deep learning is dominated by dense linear algebra, principally matrix multiplication, and that this workload maps poorly onto the general-purpose CPU and beautifully onto massively parallel accelerators. ### 2.1 GPUs, TPUs, and NPUs The graphics processing unit (GPU) became the workhorse of deep learning because it offers thousands of arithmetic units operating in parallel, very high memory bandwidth, and specialized matrix engines. NVIDIA's data center GPUs, the A100, H100, and the Blackwell generation, include tensor cores that perform fused multiply-accumulate operations on small matrix tiles at very low precision, which is exactly the operation that dominates transformer training and inference (1). High-bandwidth memory (HBM) sits beside the compute die and feeds it at terabytes per second, because the binding constraint for many AI kernels is memory bandwidth rather than raw arithmetic throughput. The tensor processing unit (TPU), designed by Google, is an application-specific integrated circuit built around a large systolic array for matrix multiplication (2). Where a GPU is a flexible parallel processor, a TPU is a more specialized device that trades generality for efficiency on the narrow workload of neural network math. Neural processing units (NPUs) bring similar ideas to edge and mobile devices, prioritizing energy efficiency for on-device inference. ### 2.2 The Roofline Model: When Memory Is the Limit The recurring claim that AI is bound by memory rather than arithmetic can be made precise with the roofline model, a simple but powerful performance bound (15). Consider a kernel that performs $W$ floating-point operations while moving $Q$ bytes between the accelerator's compute units and its memory. Define its arithmetic intensity as $$ I = \frac{W}{Q} \quad \text{(FLOP per byte)}. $$ Let $\pi$ denote the device's peak compute rate in FLOP/s and $\beta$ its peak memory bandwidth in bytes/s. The time to run the kernel is bounded below by whichever resource is the bottleneck, so the achievable throughput in FLOP/s obeys $$ P \;\le\; \min\bigl(\pi,\; \beta \cdot I\bigr). $$ The two terms cross at the **ridge point** $I^\* = \pi / \beta$. A kernel with $I < I^\*$ is **memory-bound**: it cannot use the full compute capacity because it spends its time waiting on memory, and the only way to speed it up is to move fewer bytes (better data reuse, higher precision packing, fusion) or to move them faster. A kernel with $I > I^\*$ is **compute-bound** and is limited by the arithmetic units themselves. The ridge point of modern accelerators is high. A device with roughly $1 \times 10^{15}$ FLOP/s of low-precision compute and roughly $3 \times 10^{12}$ bytes/s of memory bandwidth has $I^\* \approx 300$ FLOP per byte. Many important kernels fall well below this. A matrix multiplication of two large square matrices has intensity that grows with the matrix dimension and is comfortably compute-bound, which is why dense training saturates accelerators well. By contrast, the per-token work of autoregressive decoding, an elementwise activation, or a small batched attention step has low intensity and lands on the memory-bound slope. This single inequality explains why so much of the engineering described in later sections, paged attention, quantization, operator fusion, and high-bandwidth memory, is aimed at the byte-movement term $Q$ and the bandwidth $\beta$ rather than at raw arithmetic. ### 2.3 Interconnect and Scale A single accelerator is rarely enough for frontier models. Training a large language model requires hundreds or thousands of devices cooperating, which makes the interconnect a first-class part of the hardware layer. NVLink connects GPUs within a server at high bandwidth, while InfiniBand or specialized Ethernet fabrics connect servers within a cluster. The efficiency of distributed training depends heavily on how fast gradients can be exchanged across this fabric, so interconnect topology is as important as per-chip throughput. The cost of synchronization can be quantified. Data-parallel training requires an **all-reduce** of the gradients after every step, summing each parameter's gradient across all $p$ devices and returning the result to each. A naive implementation would send the full gradient buffer to a central node and incur traffic proportional to $p$. The standard ring all-reduce instead arranges devices in a ring and pipelines the reduction, so that each device sends and receives only $$ 2 \,\frac{p-1}{p}\, M \;\approx\; 2M \quad \text{bytes} $$ where $M$ is the size of the gradient buffer in bytes. The remarkable property is that this per-device cost is essentially independent of $p$: it approaches $2M$ as the cluster grows, so bandwidth, not device count, sets the floor on communication time (16). With link bandwidth $\beta_{\text{net}}$, the all-reduce takes roughly $2M / \beta_{\text{net}}$ seconds, and the fraction of each training step lost to communication is this time divided by the step time. Keeping that fraction small is the reason interconnect bandwidth is engineered as aggressively as compute, and the reason gradient compression, overlapping communication with backward computation, and topology-aware collective algorithms all exist. ### 2.4 Tradeoffs at the Hardware Layer The dominant tradeoff is generality versus efficiency. CPUs are the most general and the least efficient per watt for dense linear algebra. GPUs occupy a productive middle ground, flexible enough to run arbitrary research code yet fast enough for production. TPUs and other ASICs push toward maximum efficiency at the cost of flexibility and ecosystem breadth. A second tradeoff is capital versus operating cost: owning accelerators is expensive up front but can be cheaper at scale than renting from a cloud provider, while renting offers elasticity at a premium. Memory capacity is the constraint practitioners feel most acutely, because a model that does not fit on a device forces either smaller batches, model parallelism, or quantization. ## 3. Layer 2: Systems and Drivers Silicon is useless without software to drive it. The systems layer comprises the device drivers, low-level runtimes, and compilers that let higher software issue work to accelerators. ### 3.1 CUDA and the NVIDIA Moat CUDA is NVIDIA's parallel computing platform and programming model, and it is arguably the single most important reason the company dominates AI (3). CUDA exposes the GPU as a programmable device through a C-like language, a runtime, and a vast ecosystem of optimized libraries. Crucially, nearly every deep learning framework targets CUDA first and best. This creates a powerful network effect: researchers write for CUDA because the tooling is mature, and the tooling is mature because researchers write for CUDA. The result is a software moat that is harder to cross than any hardware advantage. ### 3.2 ROCm, oneAPI, and the Challengers AMD's answer is ROCm, an open-source platform that aims for CUDA parity and offers HIP, a portability layer that lets CUDA code be translated to run on AMD hardware (4). Intel's oneAPI pursues a similar goal of a unified, cross-vendor programming model. These efforts have narrowed the gap, particularly for inference and for the most common operations, but they still lag CUDA in breadth of optimized kernels and in the long tail of community support. The strategic stakes are high, because a credible open alternative would loosen NVIDIA's grip on pricing and supply. ### 3.3 Compilers as a Battleground Increasingly, the systems layer includes domain-specific compilers that transform high-level tensor programs into optimized device code. XLA compiles computations from JAX and TensorFlow, fusing operations and tuning memory layout for the target accelerator. Triton, a Python-embedded language from OpenAI, lets developers write custom GPU kernels at a higher level than raw CUDA while approaching hand-tuned performance. These compilers matter because the gap between a naive kernel and a fused, well-tiled one can be an order of magnitude in speed. ## 4. Layer 3: Numerical and Tensor Libraries Above the driver sit the numerical libraries that implement the actual mathematical primitives. Frameworks do not reimplement matrix multiplication; they call into these libraries. On NVIDIA hardware, cuBLAS provides dense linear algebra, cuDNN provides tuned implementations of convolution, attention, and other neural network primitives, and NCCL provides the collective communication operations (all-reduce, all-gather) that distributed training depends on (5). On CPUs, the long lineage of BLAS and LAPACK, accelerated by vendor libraries such as Intel MKL and oneDNN, plays the analogous role. NumPy, although usually thought of as a Python convenience, is itself a thin and elegant wrapper over these CPU kernels and established the n-dimensional array as the lingua franca of scientific computing in Python (6). The tradeoff at this layer is invisible to most users by design, but it is enormous in effect. These libraries are written and maintained by performance specialists who exploit cache hierarchies, vector instructions, and accelerator microarchitecture in ways that ordinary application code never could. The cost is that they are closed or semi-closed, vendor-specific, and slow to support new hardware. A new accelerator is only as useful as its cuDNN-equivalent, which is why building these kernels is one of the hardest parts of bringing new silicon to market. ## 5. Layer 4: Deep Learning Frameworks The framework layer is where most practitioners spend their time. A framework provides three things: an n-dimensional tensor type with operations that dispatch to the numerical libraries below, automatic differentiation so that gradients need not be derived by hand, and a set of building blocks for constructing and training models. The middle item deserves a precise statement, because it is the mathematical engine of the whole field. Automatic differentiation (autodiff) is not numerical differentiation by finite differences, and it is not symbolic differentiation that manipulates formulas. It is the systematic application of the chain rule to the elementary operations recorded while a program runs. A framework represents a model as a composition of differentiable primitives $f = f_L \circ \cdots \circ f_2 \circ f_1$. By the chain rule the Jacobian of the whole is the product of the per-layer Jacobians, $$ J_f = J_{f_L} \, J_{f_{L-1}} \cdots J_{f_1}. $$ Training needs the gradient of a scalar loss with respect to many parameters, which is a vector-Jacobian product. **Reverse-mode** autodiff, known in this setting as backpropagation, evaluates this product right to left, propagating a single vector backward through the recorded operations. Its cost is a small constant multiple of the cost of the forward evaluation, independent of the number of parameters, which is exactly why it scales to networks with billions of weights. The price is memory: the intermediate activations needed for the backward pass must be stored, which makes activation memory a first-order constraint and motivates techniques such as gradient checkpointing that trade recomputation for storage. This asymmetry, cheap gradients but expensive activation memory, is felt at every layer above. ### 5.1 PyTorch PyTorch has become the dominant research framework and is increasingly dominant in production as well (7). Its defining design choice was eager, define-by-run execution: the computational graph is built dynamically as Python executes, which makes models feel like ordinary imperative programs and makes debugging natural. This came at a historical cost in performance and deployment, which the project has addressed with `torch.compile`, a graph-capture and compilation system that recovers much of the speed of static graphs without sacrificing the eager programming model. ### 5.2 JAX JAX takes a different philosophy rooted in functional programming (8). It offers composable function transformations: `grad` for automatic differentiation, `jit` for just-in-time compilation through XLA, `vmap` for automatic vectorization, and `pmap` or its successors for parallelism across devices. JAX favors pure functions and explicit state, which suits large-scale, highly parallel research, and it has become the framework of choice for much frontier model work, particularly inside Google and in the academic community studying scaling. ### 5.3 TensorFlow and Keras TensorFlow, the framework that catalyzed the deep learning industry, pioneered the static-graph approach in which the full computation is defined before execution, enabling aggressive optimization and straightforward deployment to servers, mobile, and the browser (9). Keras, now a multi-backend high-level API, provides an accessible, layer-oriented interface and runs atop TensorFlow, JAX, or PyTorch. While TensorFlow's share of new research has declined, its production tooling and deployment story remain strong. ### 5.4 Framework Tradeoffs The core tension is eager flexibility versus compiled performance, and the field has largely converged on a synthesis: write in an eager, Pythonic style, then compile hot paths for speed. The remaining differentiators are ecosystem (PyTorch's library breadth is unmatched), parallelism model (JAX's transformations are uniquely composable), and deployment maturity. Lock-in is real but softening, since interchange formats and multi-backend tools let models move between frameworks more easily than before. ## 6. Layer 5: Data and Pipeline Tooling Models are only as good as the data fed to them, and at scale, getting bytes from storage to the accelerator without starving it becomes a serious engineering problem. The data layer handles storage formats, transformation, and high-throughput loading. Columnar formats such as Apache Parquet and the in-memory Apache Arrow standard allow efficient storage and zero-copy sharing of large tabular datasets (10). Distributed processing engines such as Apache Spark, Dask, and Ray Data transform and clean data across clusters before it ever reaches training. At the boundary with the accelerator, loaders such as PyTorch's `DataLoader`, `tf.data`, and NVIDIA DALI overlap data preparation with computation so that the expensive GPU is never idle waiting for the next batch. The governing principle here is that the pipeline must keep the accelerator saturated. An accelerator costing many dollars per hour that sits idle waiting for data is pure waste, so the data layer is engineered around throughput, prefetching, and parallel decoding. The tradeoff is complexity: sophisticated pipelines with sharding, caching, and augmentation are powerful but brittle, and a surprising fraction of real-world training failures trace to the data path rather than the model. ## 7. Layer 6: Model Hubs and Serving Once a model exists, it must be distributed and then executed on behalf of users. These are distinct concerns, and the stack provides distinct tooling for each. ### 7.1 Model Hubs The Hugging Face Hub has become the de facto registry for sharing pretrained models, datasets, and demos, hosting hundreds of thousands of models with versioning, model cards, and a standardized loading interface (11). The companion `transformers` library turned the use of a state-of-the-art model into a few lines of code, which dramatically lowered the barrier to applied AI. The hub model mirrors the package registries of software engineering, bringing the same benefits of reuse and the same risks around provenance, licensing, and supply-chain trust. ### 7.2 Serving Runtimes Serving a large model efficiently is a specialized problem, particularly for autoregressive language models whose generation is memory-bound and sequential. The reason is a direct consequence of the roofline model. Generating one token of output requires reading every model weight from memory exactly once to compute a single forward pass. For a model with $N$ parameters stored at $b$ bytes each, decoding a single sequence moves about $bN$ bytes but performs only about $2N$ FLOP, an arithmetic intensity near $2/b$, which is tiny. Single-stream decoding therefore sits far down the memory-bound slope, and its speed is set almost entirely by memory bandwidth: the time per token is approximately $bN / \beta$, where $\beta$ is the device bandwidth. This analysis also reveals the cure. The weights are read once per forward pass regardless of how many sequences are processed together, so **batching** many requests into one forward pass amortizes that fixed byte-movement cost across many tokens, raising arithmetic intensity and pushing the kernel back toward the compute-bound region where the accelerator is well used. The obstacle is the per-request **key-value cache**, the stored attention keys and values that grow with sequence length and consume the memory that batching needs. Purpose-built inference servers address exactly this. vLLM introduced PagedAttention, which manages the key-value cache like virtual memory in pages, eliminates fragmentation, and allows many requests to share GPU memory efficiently, which in turn allows much larger batches and sharply higher throughput (12). Hugging Face Text Generation Inference, NVIDIA Triton Inference Server, TorchServe, and KServe provide production features such as continuous (dynamic) batching, multi-model hosting, and standardized inference protocols. The dominant tradeoffs are latency versus throughput (batching more requests raises throughput but can raise the time an individual request waits and is processed) and cost versus quality (quantization and smaller models cut cost at some risk to accuracy). ### 7.3 Worked Example: Reading the Stack Through One Number Consider serving a 7 billion parameter language model on a single accelerator with peak bandwidth $\beta = 3 \times 10^{12}$ bytes/s. The example is illustrative; the point is the method, not the exact figures. Store the weights in 16-bit precision, so $b = 2$ bytes and the weights occupy $bN = 2 \times 7 \times 10^9 = 1.4 \times 10^{10}$ bytes, about 14 GB. A single decoding stream must read all of this per token, so the lower bound on time per token is $$ t \;\approx\; \frac{bN}{\beta} \;=\; \frac{1.4 \times 10^{10}}{3 \times 10^{12}} \;\approx\; 4.7 \times 10^{-3}\ \text{s}, $$ roughly 210 tokens per second as a bandwidth ceiling for one stream, before any overhead. Now quantize the weights to 8 bits, $b = 1$. The bytes moved per token halve, the weights occupy about 7 GB, and the bandwidth ceiling roughly doubles to about 420 tokens per second. This is the concrete mechanism behind the claim that quantization helps inference: its primary benefit for memory-bound decoding is not fewer FLOP but fewer bytes moved, which is the quantity the roofline says actually governs the latency. The freed memory also leaves room for a larger key-value cache and hence larger batches, compounding the throughput gain. The same single number, arithmetic intensity, that classified the kernel in Layer 1 thus dictates a Layer 6 serving decision, an illustration of how tightly the layers are coupled. ## 8. Layer 7: Orchestration and MLOps Production AI is a continuous process, not a one-time artifact, and the orchestration layer manages that process across teams and time. This is the domain of MLOps, the application of DevOps discipline to machine learning. Kubernetes provides the general substrate for running containerized workloads across clusters, and Ray offers a Python-native framework for scaling training, tuning, and serving (13). Workflow engines such as Apache Airflow, Kubeflow Pipelines, and Metaflow schedule the multi-step pipelines that ingest data, train, evaluate, and deploy. Experiment tracking and registry tools such as MLflow and Weights and Biases record the parameters, metrics, and artifacts of every run so that results are reproducible and models are governed (14). Feature stores and monitoring systems close the loop by serving consistent features and detecting drift once a model is live. The central insight of this layer is that machine learning systems decay. Data distributions shift, dependencies change, and yesterday's accurate model degrades silently. The orchestration layer exists to make training reproducible, deployment repeatable, and degradation observable. Its tradeoff is the familiar one of platform engineering: heavyweight, integrated MLOps platforms reduce operational toil but impose process and lock-in, while lightweight, composed tooling stays flexible at the cost of more glue code and more discipline. ## 9. Layer 8: The Application Layer At the top sits the layer that delivers value to people: the chatbots, coding copilots, retrieval-augmented question answering systems, autonomous agents, recommendation engines, and search experiences that constitute the product. With the rise of capable foundation models served behind APIs, a great deal of application development now happens here without any direct contact with the lower layers at all. This layer has developed its own emerging stack. Orchestration libraries such as LangChain and LlamaIndex compose model calls, tool use, and memory. Vector databases such as Pinecone, Weaviate, and the open-source FAISS library store embeddings for semantic retrieval, the backbone of retrieval-augmented generation. Protocols such as the Model Context Protocol are beginning to standardize how applications connect models to external tools and data sources. The tradeoff at the application layer is build versus buy taken to its logical end: a team can call a hosted model and own almost nothing of the stack below, gaining speed and giving up control, cost predictability, and data sovereignty, or it can self-host and own everything, inverting every term of that bargain. ## 10. How the Layers Fit Together The power of the stack comes from its composition, and a single inference request illustrates the cooperation. A user types a question into a chat application at layer 8. The application embeds the query and retrieves context from a vector store at the application layer, then issues a request to a serving runtime at layer 6. The runtime, perhaps vLLM, schedules the request, manages its key-value cache, and invokes a model defined in a framework at layer 4. The framework dispatches the model's matrix multiplications and attention operations to numerical libraries at layer 3, such as cuBLAS and cuDNN. Those libraries issue work through the CUDA runtime at layer 2, which drives the GPU at layer 1, where tensor cores finally multiply the numbers. The generated tokens travel back up the same chain. Meanwhile, layer 7 has provisioned the hardware, deployed the model, and is recording metrics for the whole transaction. The diagram below traces that request as it descends the stack and the generated tokens as they return. ```{mermaid} flowchart TD U["User question"] A["Layer 8: chat app embeds query and retrieves context"] S["Layer 6: serving runtime schedules request and manages KV cache"] F["Layer 4: framework runs model forward pass"] N["Layer 3: cuBLAS and cuDNN compute matmul and attention"] D["Layer 2: CUDA runtime issues work to the device"] H["Layer 1: tensor cores multiply the numbers"] U --> A --> S --> F --> N --> D --> H H -. "generated tokens travel back up" .-> U ``` Two cross-cutting principles govern the whole edifice. First, the binding constraint is usually memory, not arithmetic, which is why so much engineering at every layer (quantization, paged attention, HBM, prefetching pipelines) targets the movement and storage of data rather than the speed of computation. Second, abstractions leak in the direction of performance: a practitioner can ignore the lower layers right up until performance, cost, or memory forces them to look down, at which point a working model of the entire stack becomes indispensable. The stack is therefore best understood not as a set of independent choices but as a coupled system in which decisions at one layer ripple through all the others. ## 11. When to Look Down the Stack, and Common Pitfalls A practitioner does not need to optimize every layer at once. The discipline is knowing which layer the current problem lives in, because effort spent at the wrong layer is wasted. **When to descend.** Stay high in the stack by default, since higher layers are more portable and more productive. Descend only when a measured constraint forces it. If latency or cost is unacceptable, first identify whether the workload is memory-bound or compute-bound using the arithmetic-intensity test, because the two require opposite remedies. A memory-bound decode is helped by quantization, fusion, and a better key-value cache (Layers 1, 3, and 6), and not at all by a faster matrix-multiply algorithm. A compute-bound training step is helped by higher-precision tensor cores and better tiling, and barely at all by bandwidth tricks. If a single device cannot hold the model, the choice between quantization, model parallelism, and a larger accelerator is again a Layer 1 through Layer 6 decision driven by memory capacity. **Common pitfalls.** - **Profiling the wrong layer.** Blaming the framework for slowness that is actually a starved data pipeline (Layer 5) or a memory-bound kernel (Layer 1) is the most frequent mistake. Always measure utilization before optimizing; low GPU utilization usually points to the data path or to a memory bound, not to the model code. - **Ignoring the data path.** A surprising fraction of real training failures and slowdowns trace to Layer 5 rather than the model. An accelerator idle while it waits for the next batch is pure waste. - **Premature low-level optimization.** Writing custom kernels before confirming the kernel is the bottleneck trades large effort for small gains. Reach for compilers (`torch.compile`, XLA, Triton) before hand-written device code. - **Mistaking quantization's mechanism.** Quantization speeds memory-bound inference chiefly by moving fewer bytes, not by reducing arithmetic. Expecting it to help a compute-bound workload, or applying it without measuring the accuracy cost, leads to disappointment. - **Underestimating lock-in and decay.** Choosing the lowest, most vendor-specific layer for a marginal speedup can trap a project on one accelerator. Deploying without Layer 7 monitoring lets a model degrade silently as data drifts. The unifying lesson is diagnostic: name the layer, measure the binding resource, and apply the remedy that matches it. ## References 1. NVIDIA. "NVIDIA H100 Tensor Core GPU Architecture." NVIDIA Corporation. https://www.nvidia.com/en-us/data-center/h100/ 2. Jouppi, N. P., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. https://arxiv.org/abs/1704.04760 3. NVIDIA. "CUDA Toolkit Documentation." NVIDIA Corporation. https://docs.nvidia.com/cuda/ 4. AMD. "ROCm Open Software Platform Documentation." Advanced Micro Devices. https://rocm.docs.amd.com/ 5. NVIDIA. "cuDNN, cuBLAS, and NCCL Developer Libraries." NVIDIA Corporation. https://developer.nvidia.com/cudnn 6. Harris, C. R., et al. "Array Programming with NumPy." Nature, vol. 585, 2020, pp. 357-362. https://www.nature.com/articles/s41586-020-2649-2 7. Paszke, A., et al. "PyTorch: An Imperative Style, High-Performance Deep Learning Library." Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1912.01703 8. Bradbury, J., et al. "JAX: Composable Transformations of Python and NumPy Programs." Google Research. https://github.com/jax-ml/jax 9. Abadi, M., et al. "TensorFlow: A System for Large-Scale Machine Learning." Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016. https://www.tensorflow.org/ 10. Apache Software Foundation. "Apache Arrow: A Cross-Language Development Platform for In-Memory Data." https://arrow.apache.org/ 11. Wolf, T., et al. "Transformers: State-of-the-Art Natural Language Processing." Proceedings of EMNLP: System Demonstrations, 2020. https://huggingface.co/docs/hub/ 12. Kwon, W., et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), 2023. https://arxiv.org/abs/2309.06180 13. Moritz, P., et al. "Ray: A Distributed Framework for Emerging AI Applications." Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018. https://arxiv.org/abs/1712.05889 14. Zaharia, M., et al. "Accelerating the Machine Learning Lifecycle with MLflow." IEEE Data Engineering Bulletin, vol. 41, no. 4, 2018. https://mlflow.org/ 15. Williams, S., Waterman, A., and Patterson, D. "Roofline: An Insightful Visual Performance Model for Multicore Architectures." Communications of the ACM, vol. 52, no. 4, 2009, pp. 65-76. https://doi.org/10.1145/1498765.1498785 16. Sergeev, A., and Del Balso, M. "Horovod: Fast and Easy Distributed Deep Learning in TensorFlow." arXiv preprint, 2018. https://arxiv.org/abs/1802.05799