15 A Preview of the Road Ahead

This chapter closes Part I (Foundations) by laying out a map of the territory that follows. You have now seen why artificial intelligence matters, how the field arrived at its present moment, and what conceptual vocabulary the rest of the book assumes. The purpose here is different. Rather than introduce new technical material, this chapter explains how the book is organized, why the parts are ordered the way they are, and how you can navigate them according to your background and goals. Think of it as a trailhead sign: a description of the trails, their difficulty, where they connect, and what view waits at the summit.

A map is only useful if you can read it precisely, so this chapter does three things. It states the organizing principle of the book as an explicit claim, not a slogan. It makes the dependencies between parts concrete enough to reason about, including a diagram you can consult whenever you are deciding what to read next. And it offers calibrated learning paths for different kinds of reader, with honest notes on where each path is likely to break down. Everything here is in service of one practical question: given who you are and what you need, what should you read, in what order, and what can you safely defer.

15.1 1. The Shape of the Book

“AI in Action” is built around a single conviction: that modern artificial intelligence is best understood as a layered stack, where each layer rests on the one beneath it and exposes a cleaner interface to the one above. You cannot reason carefully about transformers without linear algebra and probability. You cannot build a retrieval system without understanding embeddings, and you cannot understand embeddings without first understanding how data is represented. Agents, in turn, orchestrate language models, retrieval, and tools into systems that act in the world. The book follows this dependency structure deliberately.

The layering claim deserves a precise statement, because the rest of the book’s structure follows from it.

The layering principle. Each part of the book exposes an interface: a small set of concepts and capabilities that later parts can use without re-deriving how they were built. Part II exposes vectors, gradients, and probability distributions. Part V exposes a trained differentiable model. Part VI exposes a language model that maps a prompt to a distribution over continuations. Part VIII exposes an agent that takes goals and acts. A reader standing at any layer needs fluency with the interface immediately below, but not with the full internals of every layer further down.

This is the same discipline that makes software systems tractable. A web developer calls a sorting routine without re-deriving its average-case complexity, yet benefits from knowing one exists and roughly what it costs. The layering principle says learning AI can work the same way, provided you actually master each interface before you rely on it. The failure mode it warns against is using a layer whose interface you have not internalized, which produces the brittle, cargo-cult understanding that collapses the moment a system behaves unexpectedly.

The material is divided into eight parts.

15.1.1 1.1 Part I: Foundations

This part, which you are completing now, sets the intellectual and historical context. It motivates the field, surveys its trajectory, and establishes shared terminology. Its job is orientation rather than mastery. By design it is the part you can read fastest, because it asks you to absorb ideas rather than derive them.

15.1.2 1.2 Part II: Mathematical Foundations

Here the book becomes concrete and quantitative. We develop the linear algebra (vectors, matrices, norms, decompositions), the calculus of optimization (gradients, the chain rule, convexity), and the probability and statistics (distributions, expectation, estimation, information theory) that the rest of the book uses constantly. The treatment is practical: every concept is connected to a place where it later appears, so that the mathematics never feels like an arbitrary detour.

15.1.3 1.3 Part III: Data

No model is better than the data it learns from. This part covers data representation, collection, annotation, cleaning, handling of missing values, and the treatment of time and other structured signals. It also confronts the ethical and practical realities of privacy, consent, and bias. These chapters are where the abstractions of Part II meet the messy texture of the real world.

15.1.4 1.4 Part IV: Classical Machine Learning

Before neural networks dominated, a rich body of methods established the core ideas of learning from data: linear and logistic regression, decision trees, support vector machines, clustering, dimensionality reduction, and the discipline of evaluation. These methods remain the right tool for many problems, and they teach the concepts (overfitting, regularization, the bias and variance tradeoff, cross validation) that carry forward into deep learning.

15.1.5 1.5 Part V: Neural Networks

This part builds the deep learning toolkit from first principles: the perceptron, the multilayer network, backpropagation, and the practical machinery of training (initialization, optimizers, normalization, regularization). It then surveys the major architectural families, including convolutional networks for images and recurrent networks for sequences. By the end you understand not only how these models work but why they are trained the way they are.

15.1.6 1.6 Part VI: The Transformer and Language Models

At the center of the book sits the transformer, the architecture that reshaped the field. We derive attention carefully, build up the full encoder and decoder stacks, and then trace the path from pretraining to large language models, instruction tuning, and alignment. This part connects the mathematics, the data practices, and the neural network foundations into the systems that power today’s most capable AI.

15.1.7 1.7 Part VII: Retrieval and Knowledge Graphs

Language models are powerful but bounded by what they internalized during training. This part shows how to extend them with external memory: vector search, retrieval augmented generation, and knowledge graphs that encode structured relationships. Here embeddings become infrastructure, and the reader learns how to ground a model’s outputs in verifiable sources.

15.1.8 1.8 Part VIII: Agents and Applications

The final part moves from models to systems that act. We cover agent architectures, planning, tool use, multi-agent coordination, and the interoperability protocols that let agents communicate. We close with applications that integrate everything: building, evaluating, and deploying real systems responsibly. This is where the book delivers on its title, turning understanding into action.

15.2 2. How the Parts Build on Each Other

The ordering is not arbitrary, and it is worth seeing the dependencies explicitly so that you can decide what to read and what to defer. The structure is best understood as a directed acyclic graph of prerequisites. An edge from one part to another means the second part assumes fluency with the first. There are no cycles, which is exactly what makes a sensible reading order possible: a topological sort of this graph is a valid sequence in which to learn the material, and the book’s part numbering is one such sort.

flowchart TD
    P1["Part I Foundations"]
    P2["Part II Mathematical Foundations"]
    P3["Part III Data"]
    P4["Part IV Classical Machine Learning"]
    P5["Part V Neural Networks"]
    P6["Part VI Transformers and Language Models"]
    P7["Part VII Retrieval and Knowledge Graphs"]
    P8["Part VIII Agents and Applications"]

    P1 --> P2
    P2 --> P4
    P2 --> P5
    P4 --> P5
    P5 --> P6
    P3 --> P4
    P3 --> P6
    P3 --> P8
    P6 --> P7
    P6 --> P8
    P7 --> P8

Read the diagram as a set of constraints rather than a single forced route. Any ordering that respects the arrows is a legitimate path through the book. The vertical chain (Part II to Part IV to Part V to Part VI) is the load-bearing spine. Data (Part III) and retrieval (Part VII) attach to that spine at the points where they are first needed.

15.2.1 2.1 The Vertical Spine

There is a vertical spine running from mathematics upward: Part II supplies the language of vectors, gradients, and probability; Part IV uses that language to define learning and evaluation; Part V generalizes learning to deep networks; Part VI specializes deep networks into transformers and language models. Each step assumes fluency with the previous one. A reader who skips the gradient material in Part II will find backpropagation in Part V opaque, and a reader who skips backpropagation will find the training dynamics of language models in Part VI mysterious.

The spine is not merely a thematic grouping. Each link is a concrete technical dependency that you can name. To make this auditable rather than asserted, here are the load-bearing prerequisites along the spine.

To understand (later)	You must already command (earlier)	Why the link is hard to skip
Logistic regression, Part IV	The gradient and the chain rule, Part II	The decision boundary is fit by gradient descent on a log-loss whose derivative you must be able to take
Backpropagation, Part V	Partial derivatives and the chain rule, Part II	Backpropagation is the chain rule applied systematically over a computation graph
Attention, Part VI	Matrix multiplication, dot products, the softmax, Part II	An attention score is a scaled dot product passed through a softmax, then used to weight a matrix of values
Alignment and fine tuning, Part VI	Maximum likelihood and the cross entropy objective, Part II and Part IV	Pretraining and instruction tuning both minimize a cross entropy loss, and reward modeling rests on probabilistic estimation

The pattern is that ideas do not so much get replaced as get reused at a larger scale. The chain rule you learn for a two-variable function in Part II is the same chain rule that propagates gradients through a network with billions of parameters in Part VI. Skipping the small case does not save time; it defers the cost to a moment when the same idea is buried inside far more machinery and is correspondingly harder to extract.

15.2.2 2.2 The Horizontal Supports

Two parts act as horizontal supports that touch every layer. Part III (Data) is relevant whether you are fitting a logistic regression or pretraining a transformer, because the questions of representation, quality, and bias never go away. Part VII (Retrieval and Knowledge Graphs) draws on embeddings, which first appear in Part V and mature in Part VI, but it serves the application goals of Part VIII. You can think of data and retrieval as the connective tissue between the vertical spine and the systems built on top of it.

15.2.3 2.3 The Convergence Point

Part VIII is where the threads converge. An agent that answers a customer’s question may call a language model (Part VI), retrieve grounding documents (Part VII), apply a classifier to route the request (Part IV), and depend on clean, well represented data throughout (Part III). The applications are deliberately placed last because they presuppose the rest. Reaching them is the payoff for the climb.

To make the convergence concrete, trace a single request through the stack. A user asks a support agent, “Has my refund been processed?” A lightweight intent classifier (Part IV) routes this to the billing workflow. The agent embeds the question and retrieves the user’s recent transactions and the relevant policy passages (Part VII), which depend on clean, well-represented account data (Part III). It composes a prompt that interleaves the retrieved facts with the question and sends it to a language model (Part VI), whose attention mechanism (Part VI), trained by backpropagation (Part V) on a cross entropy objective (Part II), produces a grounded answer. The agent loop (Part VIII) decides whether the answer is complete or whether another tool call is needed. Every part of the book appears in that single sentence-long interaction, which is precisely why the applications come last.

15.3 3. Suggested Learning Paths

Readers come to this book with different backgrounds and different goals. The chapters are written to support a linear reading, but few readers should feel obligated to read every section in order. Below are three paths, each tuned to a kind of reader. Treat them as starting points to adapt, not prescriptions. Each path is a different traversal of the same dependency graph from Section 2: the practitioner descends from applications and pulls in theory on demand, the researcher ascends from foundations and refuses to leave a layer until it is fully understood, and the student walks the topological order straight through.

15.3.1 3.1 The Practitioner

Suppose you are an engineer who needs to ship a working system soon, and who is comfortable treating some mathematics as a tool to be trusted rather than derived. Your path emphasizes building and evaluating.

Begin with Part I for context, then read Part III (Data) carefully, because data problems will dominate your real work. Skim Part II, returning to specific topics (matrix multiplication, the softmax, gradient descent) only when a later chapter sends you back. Read the practical chapters of Part V on training and the major architectures, then go deep on Part VI, Part VII, and Part VIII, which contain most of what you will use day to day. Run every code example. Your aim is fluency with the systems, and you can acquire the underlying theory opportunistically as questions arise.

The risk of this path is silent failure. A system that runs and returns plausible output can still be wrong in ways that only theory exposes, such as a retrieval index that degrades on out-of-distribution queries or an evaluation metric that rewards the wrong behavior. The practitioner’s discipline is to treat every unexplained result as a prompt to descend one layer, not to move on because the demo worked.

15.3.2 3.2 The Researcher

Suppose you intend to extend the field rather than only apply it, and you want to understand methods well enough to question them. Your path emphasizes derivation and the boundaries of current knowledge.

Read Part II thoroughly and do the proofs and derivations; they are the foundation of any original contribution. Work through Part IV completely, because the classical methods clarify ideas (generalization, regularization, the structure of optimization) that deep learning inherits but rarely re-derives. Study Part V and Part VI line by line, paying attention to the design choices and the alternatives that were not taken. In Part VII and Part VIII, read with a critical eye toward open problems: evaluation, robustness, interpretability, and coordination remain active research frontiers. Follow the references at the end of each chapter into the primary literature.

The risk of this path is the opposite of the practitioner’s: deep local understanding that never integrates into a working system, so that subtle implementation realities (numerical stability, data leakage, the gap between a benchmark and a deployment) stay invisible. The researcher’s discipline is to build at least once what they have only derived, because an idea that has never survived contact with running code is an idea only half understood.

15.3.3 3.3 The Student

Suppose you are learning the field systematically, perhaps alongside a course, and you have time to build understanding from the ground up. Your path is the linear one, and it is the path the book was structured to support.

Read the parts in order. Do the exercises in each chapter before moving on, because the concepts compound and gaps left early become obstacles later. Resist the temptation to jump ahead to the exciting material in Part VI; the excitement is far greater when you can see exactly why attention works and where it came from. Keep a notebook of questions, and revisit it as you advance, since many early puzzles resolve themselves once you have the later machinery. The reward for patience is a genuinely connected mental model rather than a collection of disconnected recipes.

15.3.4 3.4 Choosing and Mixing Paths

These paths are not mutually exclusive. A practitioner who hits a wall will often benefit from briefly adopting the researcher’s path through a single topic. A student preparing for a job may add the practitioner’s emphasis on running code. The honest advice is to read with a purpose in mind, to be willing to backtrack, and to use the dependency map in Section 2 to decide when a detour into earlier material is worth the time.

The following table summarizes the three paths and the failure mode each must guard against.

Reader	Primary emphasis	Typical entry order	Characteristic risk
Practitioner	Build and evaluate systems	I, III, then VI through VIII, with II as reference	Silent failure that only theory would expose
Researcher	Derive and question methods	II and IV in full, then V and VI in depth	Deep understanding that never ships as working code
Student	Connected foundational mastery	Linear, I through VIII	Slower start, traded for a durable mental model

15.4 4. How to Use the Code and Exercises

This book is called “AI in Action” because understanding AI requires doing AI. Every concept that can be made concrete is accompanied by runnable code, and most chapters end with exercises that ask you to extend, break, or apply that code.

15.4.1 4.1 The Role of Code

The code is not decoration, and it is not a finished library to be imported and forgotten. It is written to be read. Implementations favor clarity over performance, so that you can trace exactly how a gradient flows or how an attention score is computed. When you encounter a code listing, the most valuable thing you can do is run it, change a number, and predict the result before you observe it. The gap between your prediction and the outcome is where learning happens.

Throughout, the book leans on mature, free, open-source tools rather than proprietary platforms, so that every example is reproducible without a license or an account. The core Python stack is NumPy for numerical arrays, pandas for tabular data, scikit-learn for classical machine learning, and PyTorch for deep learning, with Matplotlib for visualization. Where a chapter shows the same idea in more than one language, it uses the open Julia and Rust ecosystems for contrast. The point of this choice is durability: an example built on open tools can still be run years from now, whereas one tied to a hosted service can vanish when the service changes.

15.4.2 4.2 Kinds of Exercises

The exercises fall into three rough categories. Some are confirmatory: they ask you to verify a claim made in the text, such as reproducing a derivation numerically. Some are exploratory: they ask you to vary a setting (a learning rate, a dataset size, a model depth) and characterize the effect. Some are constructive: they ask you to build something new from the components introduced in the chapter. The categories grow in difficulty and in value. The constructive exercises, though the most demanding, are the ones that most reliably turn knowledge into skill.

These three categories correspond loosely to a ladder of cognitive demand, from verifying what is given, to manipulating it, to creating something new. The ordering is not accidental. A confirmatory exercise checks that you can follow a result; an exploratory one checks that you can probe its sensitivity; a constructive one checks that you can wield the underlying idea independently. Spending all your effort on the first kind feels productive and teaches little, because recognition is far easier than generation.

15.4.3 4.3 A Suggested Practice

A reliable habit is to read a chapter once for the ideas, then return with an editor open and reconstruct the key code from a blank file, consulting the text only when stuck. This active recall is uncomfortable and slow, and it is far more effective than passive rereading. The effect is well documented: testing yourself on material produces stronger long-term retention than restudying it for the same amount of time, an effect known as the testing effect or retrieval practice (Roediger and Karpicke 2006). Where a chapter provides a dataset, spend a few minutes simply looking at the data before modeling it, because intuition about the input is worth more than any single technique applied to it.

15.5 5. What You Will Be Able to Do by the End

It is worth being concrete about the destination, because a clear goal makes the climb easier. The three competencies below are deliberately distinct: fluency is about reading the field, capability is about building in it, and judgment is about deciding whether what you have built should exist at all. A reader can possess any one without the others, and the book aims for all three.

15.5.1 5.1 Conceptual Fluency

By the end of the book you will be able to read a contemporary research paper or system description and follow its argument. You will recognize the mathematical objects, understand the design choices, and locate the work within the broader stack. The vocabulary that may feel like jargon now will become a language you think in.

15.5.2 5.2 Practical Capability

You will be able to take a real problem, frame it as a learning or reasoning task, choose appropriate methods, prepare and inspect the data, implement and train a model or assemble a system, and evaluate it honestly. You will know how to extend a language model with retrieval, how to give an agent tools and let it plan, and how to tell whether the result actually works. These are the capabilities that turn a reader into a builder.

15.5.3 5.3 Critical Judgment

Perhaps most importantly, you will be able to judge. You will know the failure modes of these systems, the ways evaluation can mislead, and the ethical stakes of deploying models that affect people. You will be equipped to ask whether a given approach is appropriate, what it might be hiding, and what could go wrong. In a field that moves quickly and is prone to hype, this judgment is the most durable thing the book can offer.

15.6 6. A Note on the Pace of the Field

Artificial intelligence advances quickly, and any book risks describing a moving target. The response of this book is to emphasize the foundations that change slowly: the mathematics, the principles of learning, the structure of the transformer, the logic of retrieval, and the architecture of agents. Specific models and benchmarks will be superseded, but the conceptual scaffolding will remain. If you understand why attention works, you will understand its successors faster. If you understand how to evaluate a system honestly, no new model will fool you for long. The goal is not to memorize the present state of the art but to develop the understanding that lets you absorb whatever comes next.

It helps to separate two clocks. One is fast: leaderboards, model names, the size of the latest training run, the API that is fashionable this quarter. The other is slow: the chain rule, maximum likelihood, the bias and variance tradeoff, the attention operation, the idea of grounding a generation in retrieved evidence. The fast clock generates most of the news and almost none of the durable understanding. This book is deliberately weighted toward the slow clock, on the wager that a reader who owns the slow-moving foundations can re-derive or quickly absorb anything on the fast clock, while a reader who only memorized the fast clock is left behind with each release. When a chapter does mention a specific model or benchmark, treat it as a dated example of a durable idea, not as a fact to memorize.

With the map in hand, you are ready to begin the climb. Part II awaits, where the abstract motivation of these opening chapters gives way to the precise mathematics on which everything else is built.

15.7 Further Reading

Bishop, C. M., and Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer. A modern, mathematically grounded survey that complements the spine of this book.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. The standard reference for the neural network foundations developed in Part V.
Murphy, K. P. (2022, 2023). Probabilistic Machine Learning (two volumes). MIT Press. A thorough treatment of the probabilistic perspective that underlies Parts II and IV.
Jurafsky, D., and Martin, J. H. (2025 draft). Speech and Language Processing (3rd ed.). A comprehensive companion for the language modeling material of Part VI.
Roediger, H. L., and Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17(3), 249-255. The experimental basis for the active-recall study habit recommended in Section 4. (Roediger and Karpicke 2006)

# A Preview of the Road Ahead This chapter closes Part I (Foundations) by laying out a map of the territory that follows. You have now seen why artificial intelligence matters, how the field arrived at its present moment, and what conceptual vocabulary the rest of the book assumes. The purpose here is different. Rather than introduce new technical material, this chapter explains how the book is organized, why the parts are ordered the way they are, and how you can navigate them according to your background and goals. Think of it as a trailhead sign: a description of the trails, their difficulty, where they connect, and what view waits at the summit. A map is only useful if you can read it precisely, so this chapter does three things. It states the organizing principle of the book as an explicit claim, not a slogan. It makes the dependencies between parts concrete enough to reason about, including a diagram you can consult whenever you are deciding what to read next. And it offers calibrated learning paths for different kinds of reader, with honest notes on where each path is likely to break down. Everything here is in service of one practical question: given who you are and what you need, what should you read, in what order, and what can you safely defer. ## 1. The Shape of the Book "AI in Action" is built around a single conviction: that modern artificial intelligence is best understood as a layered stack, where each layer rests on the one beneath it and exposes a cleaner interface to the one above. You cannot reason carefully about transformers without linear algebra and probability. You cannot build a retrieval system without understanding embeddings, and you cannot understand embeddings without first understanding how data is represented. Agents, in turn, orchestrate language models, retrieval, and tools into systems that act in the world. The book follows this dependency structure deliberately. The layering claim deserves a precise statement, because the rest of the book's structure follows from it. > **The layering principle.** Each part of the book exposes an *interface*: a small set of concepts and capabilities that later parts can use without re-deriving how they were built. Part II exposes vectors, gradients, and probability distributions. Part V exposes a trained differentiable model. Part VI exposes a language model that maps a prompt to a distribution over continuations. Part VIII exposes an agent that takes goals and acts. A reader standing at any layer needs fluency with the interface immediately below, but not with the full internals of every layer further down. This is the same discipline that makes software systems tractable. A web developer calls a sorting routine without re-deriving its average-case complexity, yet benefits from knowing one exists and roughly what it costs. The layering principle says learning AI can work the same way, provided you actually master each interface before you rely on it. The failure mode it warns against is using a layer whose interface you have not internalized, which produces the brittle, cargo-cult understanding that collapses the moment a system behaves unexpectedly. The material is divided into eight parts. ### 1.1 Part I: Foundations This part, which you are completing now, sets the intellectual and historical context. It motivates the field, surveys its trajectory, and establishes shared terminology. Its job is orientation rather than mastery. By design it is the part you can read fastest, because it asks you to absorb ideas rather than derive them. ### 1.2 Part II: Mathematical Foundations Here the book becomes concrete and quantitative. We develop the linear algebra (vectors, matrices, norms, decompositions), the calculus of optimization (gradients, the chain rule, convexity), and the probability and statistics (distributions, expectation, estimation, information theory) that the rest of the book uses constantly. The treatment is practical: every concept is connected to a place where it later appears, so that the mathematics never feels like an arbitrary detour. ### 1.3 Part III: Data No model is better than the data it learns from. This part covers data representation, collection, annotation, cleaning, handling of missing values, and the treatment of time and other structured signals. It also confronts the ethical and practical realities of privacy, consent, and bias. These chapters are where the abstractions of Part II meet the messy texture of the real world. ### 1.4 Part IV: Classical Machine Learning Before neural networks dominated, a rich body of methods established the core ideas of learning from data: linear and logistic regression, decision trees, support vector machines, clustering, dimensionality reduction, and the discipline of evaluation. These methods remain the right tool for many problems, and they teach the concepts (overfitting, regularization, the bias and variance tradeoff, cross validation) that carry forward into deep learning. ### 1.5 Part V: Neural Networks This part builds the deep learning toolkit from first principles: the perceptron, the multilayer network, backpropagation, and the practical machinery of training (initialization, optimizers, normalization, regularization). It then surveys the major architectural families, including convolutional networks for images and recurrent networks for sequences. By the end you understand not only how these models work but why they are trained the way they are. ### 1.6 Part VI: The Transformer and Language Models At the center of the book sits the transformer, the architecture that reshaped the field. We derive attention carefully, build up the full encoder and decoder stacks, and then trace the path from pretraining to large language models, instruction tuning, and alignment. This part connects the mathematics, the data practices, and the neural network foundations into the systems that power today's most capable AI. ### 1.7 Part VII: Retrieval and Knowledge Graphs Language models are powerful but bounded by what they internalized during training. This part shows how to extend them with external memory: vector search, retrieval augmented generation, and knowledge graphs that encode structured relationships. Here embeddings become infrastructure, and the reader learns how to ground a model's outputs in verifiable sources. ### 1.8 Part VIII: Agents and Applications The final part moves from models to systems that act. We cover agent architectures, planning, tool use, multi-agent coordination, and the interoperability protocols that let agents communicate. We close with applications that integrate everything: building, evaluating, and deploying real systems responsibly. This is where the book delivers on its title, turning understanding into action. ## 2. How the Parts Build on Each Other The ordering is not arbitrary, and it is worth seeing the dependencies explicitly so that you can decide what to read and what to defer. The structure is best understood as a directed acyclic graph of prerequisites. An edge from one part to another means the second part assumes fluency with the first. There are no cycles, which is exactly what makes a sensible reading order possible: a topological sort of this graph is a valid sequence in which to learn the material, and the book's part numbering is one such sort. ```{mermaid} flowchart TD P1["Part I Foundations"] P2["Part II Mathematical Foundations"] P3["Part III Data"] P4["Part IV Classical Machine Learning"] P5["Part V Neural Networks"] P6["Part VI Transformers and Language Models"] P7["Part VII Retrieval and Knowledge Graphs"] P8["Part VIII Agents and Applications"] P1 --> P2 P2 --> P4 P2 --> P5 P4 --> P5 P5 --> P6 P3 --> P4 P3 --> P6 P3 --> P8 P6 --> P7 P6 --> P8 P7 --> P8 ``` Read the diagram as a set of constraints rather than a single forced route. Any ordering that respects the arrows is a legitimate path through the book. The vertical chain (Part II to Part IV to Part V to Part VI) is the load-bearing spine. Data (Part III) and retrieval (Part VII) attach to that spine at the points where they are first needed. ### 2.1 The Vertical Spine There is a vertical spine running from mathematics upward: Part II supplies the language of vectors, gradients, and probability; Part IV uses that language to define learning and evaluation; Part V generalizes learning to deep networks; Part VI specializes deep networks into transformers and language models. Each step assumes fluency with the previous one. A reader who skips the gradient material in Part II will find backpropagation in Part V opaque, and a reader who skips backpropagation will find the training dynamics of language models in Part VI mysterious. The spine is not merely a thematic grouping. Each link is a concrete technical dependency that you can name. To make this auditable rather than asserted, here are the load-bearing prerequisites along the spine. | To understand (later) | You must already command (earlier) | Why the link is hard to skip | |---|---|---| | Logistic regression, Part IV | The gradient and the chain rule, Part II | The decision boundary is fit by gradient descent on a log-loss whose derivative you must be able to take | | Backpropagation, Part V | Partial derivatives and the chain rule, Part II | Backpropagation is the chain rule applied systematically over a computation graph | | Attention, Part VI | Matrix multiplication, dot products, the softmax, Part II | An attention score is a scaled dot product passed through a softmax, then used to weight a matrix of values | | Alignment and fine tuning, Part VI | Maximum likelihood and the cross entropy objective, Part II and Part IV | Pretraining and instruction tuning both minimize a cross entropy loss, and reward modeling rests on probabilistic estimation | The pattern is that ideas do not so much get replaced as get reused at a larger scale. The chain rule you learn for a two-variable function in Part II is the same chain rule that propagates gradients through a network with billions of parameters in Part VI. Skipping the small case does not save time; it defers the cost to a moment when the same idea is buried inside far more machinery and is correspondingly harder to extract. ### 2.2 The Horizontal Supports Two parts act as horizontal supports that touch every layer. Part III (Data) is relevant whether you are fitting a logistic regression or pretraining a transformer, because the questions of representation, quality, and bias never go away. Part VII (Retrieval and Knowledge Graphs) draws on embeddings, which first appear in Part V and mature in Part VI, but it serves the application goals of Part VIII. You can think of data and retrieval as the connective tissue between the vertical spine and the systems built on top of it. ### 2.3 The Convergence Point Part VIII is where the threads converge. An agent that answers a customer's question may call a language model (Part VI), retrieve grounding documents (Part VII), apply a classifier to route the request (Part IV), and depend on clean, well represented data throughout (Part III). The applications are deliberately placed last because they presuppose the rest. Reaching them is the payoff for the climb. To make the convergence concrete, trace a single request through the stack. A user asks a support agent, "Has my refund been processed?" A lightweight intent classifier (Part IV) routes this to the billing workflow. The agent embeds the question and retrieves the user's recent transactions and the relevant policy passages (Part VII), which depend on clean, well-represented account data (Part III). It composes a prompt that interleaves the retrieved facts with the question and sends it to a language model (Part VI), whose attention mechanism (Part VI), trained by backpropagation (Part V) on a cross entropy objective (Part II), produces a grounded answer. The agent loop (Part VIII) decides whether the answer is complete or whether another tool call is needed. Every part of the book appears in that single sentence-long interaction, which is precisely why the applications come last. ## 3. Suggested Learning Paths Readers come to this book with different backgrounds and different goals. The chapters are written to support a linear reading, but few readers should feel obligated to read every section in order. Below are three paths, each tuned to a kind of reader. Treat them as starting points to adapt, not prescriptions. Each path is a different traversal of the same dependency graph from Section 2: the practitioner descends from applications and pulls in theory on demand, the researcher ascends from foundations and refuses to leave a layer until it is fully understood, and the student walks the topological order straight through. ### 3.1 The Practitioner Suppose you are an engineer who needs to ship a working system soon, and who is comfortable treating some mathematics as a tool to be trusted rather than derived. Your path emphasizes building and evaluating. Begin with Part I for context, then read Part III (Data) carefully, because data problems will dominate your real work. Skim Part II, returning to specific topics (matrix multiplication, the softmax, gradient descent) only when a later chapter sends you back. Read the practical chapters of Part V on training and the major architectures, then go deep on Part VI, Part VII, and Part VIII, which contain most of what you will use day to day. Run every code example. Your aim is fluency with the systems, and you can acquire the underlying theory opportunistically as questions arise. The risk of this path is silent failure. A system that runs and returns plausible output can still be wrong in ways that only theory exposes, such as a retrieval index that degrades on out-of-distribution queries or an evaluation metric that rewards the wrong behavior. The practitioner's discipline is to treat every unexplained result as a prompt to descend one layer, not to move on because the demo worked. ### 3.2 The Researcher Suppose you intend to extend the field rather than only apply it, and you want to understand methods well enough to question them. Your path emphasizes derivation and the boundaries of current knowledge. Read Part II thoroughly and do the proofs and derivations; they are the foundation of any original contribution. Work through Part IV completely, because the classical methods clarify ideas (generalization, regularization, the structure of optimization) that deep learning inherits but rarely re-derives. Study Part V and Part VI line by line, paying attention to the design choices and the alternatives that were not taken. In Part VII and Part VIII, read with a critical eye toward open problems: evaluation, robustness, interpretability, and coordination remain active research frontiers. Follow the references at the end of each chapter into the primary literature. The risk of this path is the opposite of the practitioner's: deep local understanding that never integrates into a working system, so that subtle implementation realities (numerical stability, data leakage, the gap between a benchmark and a deployment) stay invisible. The researcher's discipline is to build at least once what they have only derived, because an idea that has never survived contact with running code is an idea only half understood. ### 3.3 The Student Suppose you are learning the field systematically, perhaps alongside a course, and you have time to build understanding from the ground up. Your path is the linear one, and it is the path the book was structured to support. Read the parts in order. Do the exercises in each chapter before moving on, because the concepts compound and gaps left early become obstacles later. Resist the temptation to jump ahead to the exciting material in Part VI; the excitement is far greater when you can see exactly why attention works and where it came from. Keep a notebook of questions, and revisit it as you advance, since many early puzzles resolve themselves once you have the later machinery. The reward for patience is a genuinely connected mental model rather than a collection of disconnected recipes. ### 3.4 Choosing and Mixing Paths These paths are not mutually exclusive. A practitioner who hits a wall will often benefit from briefly adopting the researcher's path through a single topic. A student preparing for a job may add the practitioner's emphasis on running code. The honest advice is to read with a purpose in mind, to be willing to backtrack, and to use the dependency map in Section 2 to decide when a detour into earlier material is worth the time. The following table summarizes the three paths and the failure mode each must guard against. | Reader | Primary emphasis | Typical entry order | Characteristic risk | |---|---|---|---| | Practitioner | Build and evaluate systems | I, III, then VI through VIII, with II as reference | Silent failure that only theory would expose | | Researcher | Derive and question methods | II and IV in full, then V and VI in depth | Deep understanding that never ships as working code | | Student | Connected foundational mastery | Linear, I through VIII | Slower start, traded for a durable mental model | ## 4. How to Use the Code and Exercises This book is called "AI in Action" because understanding AI requires doing AI. Every concept that can be made concrete is accompanied by runnable code, and most chapters end with exercises that ask you to extend, break, or apply that code. ### 4.1 The Role of Code The code is not decoration, and it is not a finished library to be imported and forgotten. It is written to be read. Implementations favor clarity over performance, so that you can trace exactly how a gradient flows or how an attention score is computed. When you encounter a code listing, the most valuable thing you can do is run it, change a number, and predict the result before you observe it. The gap between your prediction and the outcome is where learning happens. Throughout, the book leans on mature, free, open-source tools rather than proprietary platforms, so that every example is reproducible without a license or an account. The core Python stack is NumPy for numerical arrays, pandas for tabular data, scikit-learn for classical machine learning, and PyTorch for deep learning, with Matplotlib for visualization. Where a chapter shows the same idea in more than one language, it uses the open Julia and Rust ecosystems for contrast. The point of this choice is durability: an example built on open tools can still be run years from now, whereas one tied to a hosted service can vanish when the service changes. ### 4.2 Kinds of Exercises The exercises fall into three rough categories. Some are confirmatory: they ask you to verify a claim made in the text, such as reproducing a derivation numerically. Some are exploratory: they ask you to vary a setting (a learning rate, a dataset size, a model depth) and characterize the effect. Some are constructive: they ask you to build something new from the components introduced in the chapter. The categories grow in difficulty and in value. The constructive exercises, though the most demanding, are the ones that most reliably turn knowledge into skill. These three categories correspond loosely to a ladder of cognitive demand, from verifying what is given, to manipulating it, to creating something new. The ordering is not accidental. A confirmatory exercise checks that you can follow a result; an exploratory one checks that you can probe its sensitivity; a constructive one checks that you can wield the underlying idea independently. Spending all your effort on the first kind feels productive and teaches little, because recognition is far easier than generation. ### 4.3 A Suggested Practice A reliable habit is to read a chapter once for the ideas, then return with an editor open and reconstruct the key code from a blank file, consulting the text only when stuck. This active recall is uncomfortable and slow, and it is far more effective than passive rereading. The effect is well documented: testing yourself on material produces stronger long-term retention than restudying it for the same amount of time, an effect known as the testing effect or retrieval practice [@roediger2006test]. Where a chapter provides a dataset, spend a few minutes simply looking at the data before modeling it, because intuition about the input is worth more than any single technique applied to it. ## 5. What You Will Be Able to Do by the End It is worth being concrete about the destination, because a clear goal makes the climb easier. The three competencies below are deliberately distinct: fluency is about reading the field, capability is about building in it, and judgment is about deciding whether what you have built should exist at all. A reader can possess any one without the others, and the book aims for all three. ### 5.1 Conceptual Fluency By the end of the book you will be able to read a contemporary research paper or system description and follow its argument. You will recognize the mathematical objects, understand the design choices, and locate the work within the broader stack. The vocabulary that may feel like jargon now will become a language you think in. ### 5.2 Practical Capability You will be able to take a real problem, frame it as a learning or reasoning task, choose appropriate methods, prepare and inspect the data, implement and train a model or assemble a system, and evaluate it honestly. You will know how to extend a language model with retrieval, how to give an agent tools and let it plan, and how to tell whether the result actually works. These are the capabilities that turn a reader into a builder. ### 5.3 Critical Judgment Perhaps most importantly, you will be able to judge. You will know the failure modes of these systems, the ways evaluation can mislead, and the ethical stakes of deploying models that affect people. You will be equipped to ask whether a given approach is appropriate, what it might be hiding, and what could go wrong. In a field that moves quickly and is prone to hype, this judgment is the most durable thing the book can offer. ## 6. A Note on the Pace of the Field Artificial intelligence advances quickly, and any book risks describing a moving target. The response of this book is to emphasize the foundations that change slowly: the mathematics, the principles of learning, the structure of the transformer, the logic of retrieval, and the architecture of agents. Specific models and benchmarks will be superseded, but the conceptual scaffolding will remain. If you understand why attention works, you will understand its successors faster. If you understand how to evaluate a system honestly, no new model will fool you for long. The goal is not to memorize the present state of the art but to develop the understanding that lets you absorb whatever comes next. It helps to separate two clocks. One is fast: leaderboards, model names, the size of the latest training run, the API that is fashionable this quarter. The other is slow: the chain rule, maximum likelihood, the bias and variance tradeoff, the attention operation, the idea of grounding a generation in retrieved evidence. The fast clock generates most of the news and almost none of the durable understanding. This book is deliberately weighted toward the slow clock, on the wager that a reader who owns the slow-moving foundations can re-derive or quickly absorb anything on the fast clock, while a reader who only memorized the fast clock is left behind with each release. When a chapter does mention a specific model or benchmark, treat it as a dated example of a durable idea, not as a fact to memorize. With the map in hand, you are ready to begin the climb. Part II awaits, where the abstract motivation of these opening chapters gives way to the precise mathematics on which everything else is built. ## Further Reading - Bishop, C. M., and Bishop, H. (2024). *Deep Learning: Foundations and Concepts*. Springer. A modern, mathematically grounded survey that complements the spine of this book. - Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. The standard reference for the neural network foundations developed in Part V. - Murphy, K. P. (2022, 2023). *Probabilistic Machine Learning* (two volumes). MIT Press. A thorough treatment of the probabilistic perspective that underlies Parts II and IV. - Jurafsky, D., and Martin, J. H. (2025 draft). *Speech and Language Processing* (3rd ed.). A comprehensive companion for the language modeling material of Part VI. - Roediger, H. L., and Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. *Psychological Science*, 17(3), 249-255. The experimental basis for the active-recall study habit recommended in Section 4. [@roediger2006test]