1 Introduction

“The real voyage of discovery consists not in seeking new landscapes but in having new eyes.” Marcel Proust

Artificial intelligence is no longer a speculative discipline at the edges of computer science. It is now part of the load-bearing structure of modern society. Search engines, recommendation systems, medical diagnostics, scientific discovery pipelines, creative tools, and large language models are all instances of the same underlying idea: we can formalize learning from data, implement it as algorithms, and deploy it as software systems that act in the world.

This book is about learning how to build such systems with mathematical clarity and engineering discipline, using Python as the main instrument. The central promise is threefold:

You will understand why the core methods of modern AI work.
You will know how to implement these methods correctly and robustly in Python.
You will be able to reason about limitations, failure modes, and tradeoffs, not only about performance on a single benchmark.

1.1 What this book is about

At the highest level, artificial intelligence in this book is framed as the study and construction of learning and decision-making systems. The two fundamental recurring questions are:

Statistical question: Given data sampled from some unknown process, how can we infer models that generalize beyond the observed samples?
Computational question: Given a model and an objective, how can we compute good approximate solutions under realistic resource constraints?

We repeatedly return to one central mathematical object:

\[ L(\theta) = \mathbb{E}_{(x, y) \sim \mathcal{D}}\left[\ell\big(f_\theta(x), y\big)\right], \]

the expected loss, also called the population risk, of a parameterized model $f_\theta$ under an unknown data distribution $\mathcal{D}$. Almost everything we do is a variant of answering the question:

How can we choose parameters $\theta$ so that $L(\theta)$ is small, even though $\mathcal{D}$ is unknown and we only observe a finite sample?

The detailed answers differ between linear models, convolutional networks, transformers, diffusion models, or reinforcement learning agents, but the skeleton remains the same. This unity is one of the guiding themes of the book.

Definition: the supervised learning problem

A supervised learning problem is specified by an input space $\mathcal{X}$, an output space $\mathcal{Y}$, a hypothesis class $\mathcal{F} = \{f_\theta : \theta \in \Theta\}$, a per-example loss $\ell : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\ge 0}$, and a distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$. We observe a training set $S = \{(x_i, y_i)\}_{i=1}^n$ drawn independently and identically distributed (i.i.d.) from $\mathcal{D}$. A learning algorithm is a map $\mathcal{A} : (\mathcal{X} \times \mathcal{Y})^n \to \mathcal{F}$ that turns a finite sample into a hypothesis. The goal is to choose $\mathcal{A}$ so that the returned $f_{\hat\theta} = \mathcal{A}(S)$ has small population risk $L(\hat\theta)$.

Because $\mathcal{D}$ is unknown, we cannot evaluate $L(\theta)$ directly. We instead minimize its sample average, the empirical risk

\[ \hat{L}(\theta) = \frac{1}{n} \sum_{i=1}^n \ell\big(f_\theta(x_i), y_i\big), \]

and hope that small empirical risk implies small population risk. This principle is empirical risk minimization (ERM). Its validity is not automatic. The quantity that controls whether ERM succeeds is the generalization gap,

\[ \mathrm{gap}(\theta) = L(\theta) - \hat{L}(\theta), \]

the difference between true and observed performance. A central message of statistical learning theory is that the gap can be bounded uniformly over the hypothesis class in terms of a complexity measure of $\mathcal{F}$ (for example its Vapnik Chervonenkis dimension or Rademacher complexity) and the sample size $n$, with the bound shrinking roughly like $1/\sqrt{n}$ (Vapnik 1998; Shalev-Shwartz and Ben-David 2014). We develop these tools in Part I; here we only stress the conclusion: fitting the training data is necessary but never sufficient. The object we actually care about is the population risk.

1.1.1 Decomposing the error

It helps to know where error comes from. Fix a loss and let $f^\star$ be the Bayes-optimal predictor, the function (over all measurable functions, not just those in $\mathcal{F}$) that minimizes the population risk. Let $f_{\mathcal{F}}^\star$ minimize risk within the chosen class $\mathcal{F}$. Then the excess risk of a learned model $f_{\hat\theta}$ splits cleanly:

\[ \underbrace{L(\hat\theta) - L(f^\star)}_{\text{excess risk}} = \underbrace{\big(L(f_{\mathcal{F}}^\star) - L(f^\star)\big)}_{\text{approximation error}} + \underbrace{\big(L(\hat\theta) - L(f_{\mathcal{F}}^\star)\big)}_{\text{estimation error}}. \]

The approximation error measures how much we lose by restricting attention to $\mathcal{F}$. It shrinks as the class grows richer. The estimation error measures how much we lose by choosing $\hat\theta$ from finite data rather than knowing $\mathcal{D}$. It tends to grow as the class grows richer, because a more expressive class is easier to overfit. This tension is the bias variance or approximation estimation tradeoff, and balancing it (through model size, regularization, and the amount of data) recurs in every chapter. In modern overparameterized models the classical picture is refined by phenomena such as double descent (Belkin et al. 2019), which we revisit in Part I, but the decomposition above remains the right starting point.

From a practical perspective, we also treat AI systems as software artifacts that live in real infrastructure:

They must be trained within budgets of time, energy, and money.
They must be deployed as services that can fail, degrade, drift, and require monitoring.
They must be evaluated not only on accuracy, but also on safety, fairness, robustness, privacy, and overall impact.

The book therefore covers not only mathematical foundations and algorithms, but also:

dataset curation and versioning,
experiment management,
reproducible code organization,
deployment patterns,
evaluation and monitoring.

The two views, the statistical object $L(\theta)$ and the deployed software artifact, are not separate concerns. They form a single loop. Data feeds training, training produces a model, the model is served, serving generates new data and surfaces distribution shift, and that feedback drives the next round. Holding only one half of this loop in view is a common source of systems that score well offline yet fail in production.

flowchart LR
    A["Data collection"] --> B["Training (minimize empirical risk)"]
    B --> C["Evaluation (estimate population risk)"]
    C --> D["Deployment and serving"]
    D --> E["Monitoring and drift detection"]
    E --> A
    C --> B

The arrow from monitoring back to data collection is what distinguishes a living machine learning system from a one-time fit. We return to each node of this loop in Part III.

1.2 Who this book is for

You are likely to benefit from this book if you see yourself in at least one of these profiles:

Research-oriented reader You know that “it works” is not enough; you want to understand why it works, when it breaks, and how to propose new methods. You might be a graduate student, a postdoctoral researcher, or a scientist in industry.
Engineer building AI systems You are responsible for running models in production, managing latency and cost constraints, designing metrics, and responding when systems misbehave. You need code that is idiomatic, testable, and maintainable.
Scientist in another field using AI as a tool You might work in physics, biology, economics, or the humanities and use machine learning as part of your research workflow. You want to understand what your models are really doing, rather than treating them as opaque black boxes.
Advanced learner transitioning from introductory material You have already read a standard machine learning textbook or taken a first course and now want to connect theory, code, and practical systems design in a coherent way.

This book assumes that you have:

intermediate Python programming skills,
basic familiarity with linear algebra (vectors, matrices, eigenvalues),
basic probability and statistics (random variables, expectations, variances, conditional probability),
some exposure to calculus (derivatives, gradients, simple integrals).

1.3 How this book is structured

The book is organized into three large arcs: Foundations, Modalities, and Systems, Decisions, and Society. Within each arc, chapters follow a consistent internal structure designed to tie together theory, implementation, and practice.

1.3.1 The three recurring layers in every chapter

Each technical chapter is divided into three conceptual layers.

Conceptual foundations
- Clear problem statement, including inputs, outputs, and evaluation criteria.
- Mathematical formalization, including definitions, theorems, and proofs.
- Intuitive explanations: geometric, probabilistic, or information-theoretic perspectives.
- Discussion of assumptions and failure cases.
Python implementation
- A reference implementation using modern Python, type hints, and widely used libraries.
- Emphasis on clarity and correctness instead of clever “tricks”.
- Integration points with numerical libraries such as NumPy, JAX, or PyTorch.
- Unit tests and simple benchmarks.
Line-by-line walkthrough and practice
- Explanation of each major code block and important lines.
- Analysis of computational complexity and memory usage.
- Comments on numerical stability and common implementation bugs.
- Variants and extensions, often linked to recent research papers.

A typical chapter will also include:

Worked examples on synthetic datasets to isolate specific phenomena.
Case studies on real datasets or tasks.
Exercises ranging from straightforward implementation to open-ended research directions.

This pattern is designed so that you can:

skim conceptual sections to gain orientation,
focus on the implementation layer when you want to build systems,
return to the proofs later to deepen understanding.

1.4 Overview of the three arcs

1.4.1 Part I: Foundations

Part I develops the mathematical and algorithmic scaffolding on which the rest of the book rests.

Mathematical bedrock

We revisit probability, information, and optimization from the perspective of machine learning.
- random variables, distributions, and expectations,
- the law of large numbers and concentration of measure,
- bias variance decomposition,
- basic information measures such as entropy and mutual information,
- geometry of parameter spaces and loss surfaces, including convexity, local minima, and saddle points.
The aim is not to prove every theorem in full generality, but to provide just enough rigor that later use of these concepts is well grounded.
Optimization and numerical methods

Training most modern AI models is an optimization problem. We study:
- first order methods such as stochastic gradient descent and its variants,
- second order ideas such as curvature and preconditioning,
- adaptive learning rates and their tradeoffs,
- line searches and trust region methods in small and medium scale problems,
- techniques for mixed precision training and numerical stability.
We also discuss automatic differentiation, computational graphs, and practical issues such as exploding and vanishing gradients, gradient clipping, and checkpointing.
Representations and embeddings

Almost every model in this book relies on representing complex objects (words, images, audio, graphs) as vectors in relatively low dimensional spaces. In this chapter, we study:
- linear and nonlinear feature maps,
- distributional and embedding based representations of words and tokens,
- manifold hypotheses about high dimensional data,
- information bottleneck ideas and compression,
- geometric properties of learned representations.
This chapter builds conceptual links that reappear when we discuss attention, contrastive learning, and generative models.

1.4.2 Part II: Modalities

Part II dives into specific data modalities and the corresponding model families.

Large language models

Here we focus on transformers for text and code.
- self attention, positional encodings, and tokenization,
- masked language modeling and next token prediction objectives,
- scaling laws for model size, data, and compute,
- retrieval augmented generation,
- in context learning and prompting strategies,
- instruction tuning and alignment.
We analyze language models both as probabilistic models of text and as components in larger systems such as agents and tools.
Vision, video, and 3D perception

Visual perception models are responsible for interpreting static images, video streams, and three dimensional scenes.
- convolutional neural networks and their inductive biases,
- modern vision transformers and hybrid architectures,
- object detection and segmentation,
- representations of depth and geometry,
- neural radiance fields and related 3D modeling techniques,
- temporal modeling for video.
We examine tradeoffs in accuracy, latency, and memory in vision models and show how they can be adapted to new tasks.
Speech and audio modeling

This chapter focuses on waveforms and spectrograms for speech, music, and other acoustic signals.
- short time Fourier transforms and time frequency representations,
- autoregressive audio models,
- self supervised pretraining for speech and audio,
- text to speech and speech to text systems,
- alignment between audio and text.
We study how the same architectural motifs such as attention and convolution reappear in the acoustic domain.
Multimodal fusion

Real world data is rarely purely textual, visual, or acoustic. Multimodal models attempt to connect multiple streams.
- joint and factorized representations of multiple modalities,
- contrastive learning objectives that align encoders, for example image text pairs,
- cross attention mechanisms that allow one modality to guide another,
- multimodal retrieval,
- models that generate one modality from another, such as text to image or text to video.
Generative intelligence

Generative models aim to model data distributions in a way that allows both sampling and density estimation.
- autoregressive models for text, images, and audio,
- variational autoencoders and latent variable models,
- flows and invertible architectures,
- score based and diffusion models,
- evaluation metrics for generative models and their limitations.
We also discuss creative and scientific uses of generative models, along with their risks.

1.4.3 Part III: Systems, decisions, and society

The final arc connects modeling with decision making, systems engineering, and societal impact.

Probabilistic programming and causal modeling

Here we treat model structure and uncertainty explicitly.
- probabilistic graphical models and structured factorizations,
- variational inference and Monte Carlo methods,
- differentiable programming and probabilistic programming languages,
- basic causal concepts such as interventions and counterfactuals,
- algorithms for causal discovery under assumptions.
This chapter highlights how causal questions differ from purely predictive ones.
Sequential decision making and agents

Reinforcement learning and related paradigms concern agents that act in environments to maximize expected return.
- Markov decision processes and value functions,
- policy optimization algorithms,
- off policy evaluation and data efficiency,
- exploration strategies,
- multi agent interactions and game theoretic perspectives,
- connections between supervised learning and decision transformers.
We consider both tabular settings and function approximation with deep networks.
Systems, scaling, and deployment

Even the most elegant models must eventually be implemented as systems that train and serve under cost and latency constraints.
- distributed training strategies such as data, model, and pipeline parallelism,
- hardware accelerators and their constraints,
- quantization and pruning for efficient inference,
- serving architectures for real time and batch predictions,
- observability, logging, and monitoring,
- experimentation frameworks and A/B testing.
We emphasize reproducibility and the organization of codebases for long lived projects.
Ethics, safety, and governance

AI systems are embedded in social, economic, and political contexts. This chapter surveys:
- bias and fairness notions and their tradeoffs,
- interpretability and explanation methods,
- robustness under distribution shift and adversarial perturbations,
- privacy preserving training, including differential privacy and federated learning,
- regulatory frameworks and compliance.
The goal is not to provide final answers, but to equip you with a language and a toolkit for reasoning about responsibility and impact.

1.5 How to read this book

There are several productive ways to navigate the material, depending on your goals.

1.5.1 The research path

If your goal is to work on new methods or theoretical analysis:

Read Part I in order, including proofs and exercises.
For each subsequent chapter, focus on:
- the problem formulation,
- the main algorithms,
- the sections that connect to known theorems or open questions.
Reimplement algorithms from scratch in minimal frameworks, without relying on high level libraries where possible.
Use the references to dig into original research papers.

1.5.2 The engineering path

If you work in industry building systems and need to make design decisions:

Skim the conceptual parts of Part I to refresh key ideas.
In each later chapter, pay particular attention to:
- implementation details,
- complexity and resource analysis,
- failure modes and debugging strategies.
Work through case studies and exercises that involve modifying and extending provided code.
Map presented design patterns to your own infrastructure.

1.5.3 The practitioner path

If you mainly want to use existing methods effectively:

Read the motivation and overview sections of each chapter.
Focus on code examples that show “canonical” usage patterns.
Use the exercises to practice interpreting model outputs and diagnosing basic issues.
Gradually return to more mathematical sections as specific needs arise.

You do not need to read the book strictly sequentially, but Part I provides language and tools that every later chapter uses. Skipping it entirely is not recommended unless you already have a solid mathematical and statistical background.

1.6 A small example: concept, code, explanation

To illustrate the style of the book, consider a very simple supervised learning problem. Given pairs $(x_i, y_i)$ with scalar inputs ($x_i$) and outputs ($y_i$), we want to fit an affine model

\[ f_\theta(x) = w x + b, \]

where $\theta = (w, b)$ are parameters. We will choose $\theta$ to minimize the mean squared error on a dataset of size $n$:

\[ \hat{L}(\theta) = \frac{1}{n} \sum_{i=1}^n (f_\theta(x_i) - y_i)^2. \]

This is the empirical risk from the definition above, with $\ell(\hat y, y) = (\hat y - y)^2$. Stacking the inputs into a design matrix $X \in \mathbb{R}^{n \times 2}$ whose $i$-th row is $(x_i, 1)$ and the targets into $y \in \mathbb{R}^n$, the empirical risk is $\hat{L}(\theta) = \tfrac{1}{n}\lVert X\theta - y\rVert_2^2$. Because this objective is convex and quadratic in $\theta$, setting its gradient to zero gives the normal equations

\[ X^\top X\,\theta = X^\top y, \]

whose solution $\hat\theta = (X^\top X)^{-1} X^\top y$ is the global minimizer whenever $X^\top X$ is invertible. This is one of the rare cases in the book where the optimum is available in closed form. Most later models replace this single linear solve with iterative stochastic optimization, but the logic (write down a risk, then minimize it) is identical.

In Python, we might write:

from dataclasses import dataclass
from typing import Sequence, Tuple

import numpy as np

Array = np.ndarray


@dataclass
class LinearModel:
    """Simple scalar-input linear regression model f(x) = w * x + b."""
    w: float
    b: float

    def __call__(self, x: Array) -> Array:
        x = np.asarray(x, dtype=float)
        return self.w * x + self.b


def mean_squared_error(y_pred: Array, y_true: Array) -> float:
    y_pred = np.asarray(y_pred, dtype=float)
    y_true = np.asarray(y_true, dtype=float)
    residuals = y_pred - y_true
    return float(np.mean(residuals ** 2))


def fit_linear_regression(
    xs: Sequence[float],
    ys: Sequence[float],
) -> LinearModel:
    """Fit w and b by solving the normal equations."""
    x = np.asarray(xs, dtype=float)
    y = np.asarray(ys, dtype=float)

    # Design matrix with a column of ones for the bias term.
    X = np.stack([x, np.ones_like(x)], axis=1)

    # Solve (X^T X) theta = X^T y for theta = [w, b].
    XtX = X.T @ X
    Xty = X.T @ y

    theta = np.linalg.solve(XtX, Xty)
    w, b = theta[0], theta[1]
    return LinearModel(w=w, b=b)


def evaluate_model(
    model: LinearModel,
    xs: Sequence[float],
    ys: Sequence[float],
) -> float:
    preds = model(xs)
    return mean_squared_error(preds, ys)

Even this basic code illustrates several recurring themes in the book:

Explicit parameterization The model is a small data class with named parameters w and b. Throughout the book, we favor explicit parameterization and clear naming over anonymous layers or magical factory functions.
Separation of concerns The loss function mean_squared_error is independent of the model class. This mirrors how, in more complex models, we separate architecture, objective, and optimization.
Matrix formulation The function fit_linear_regression uses the design matrix (X) and solves the normal equations. This highlights the connection between least squares and linear algebra, which generalizes directly to higher dimensions.
Numerical considerations For this tiny example, directly solving $X^\top X\theta = X^\top y$ is adequate. For larger or ill-conditioned problems, forming $X^\top X$ squares the condition number of $X$ and amplifies rounding error, so we prefer alternative methods such as QR factorization, the singular value decomposition, or iterative optimization. The free, open-source NumPy and SciPy stack provides all of these (for instance numpy.linalg.lstsq, which uses an SVD-based solver) and we lean on such mature libraries throughout.

When to use, and pitfalls

The closed-form least squares solution is the right tool when the model is linear in its parameters, the loss is squared error, the number of features is modest, and $X^\top X$ is well conditioned. Reach for iterative or regularized methods instead when any of these break down. Common pitfalls, all revisited in later chapters, are: reporting the training error $\hat{L}(\hat\theta)$ as if it were the population risk $L(\hat\theta)$, which is optimistically biased; a near-singular $X^\top X$ caused by collinear features; and silent extrapolation when test inputs fall outside the range of the training inputs.

As chapters progress, this pattern scales up: from linear models to deep networks, from closed-form solutions to stochastic optimization, from direct calls to numpy.linalg to full training loops with automatic differentiation and distributed computation.

1.7 Notation and conventions

To reduce cognitive load, we try to maintain consistent notation across the book. The following table collects some of the most frequently used symbols.

Symbol	Meaning
$x$	Input example (feature vector, token sequence, etc.)
$y$	Target output (label, regression target, next token)
$\mathcal{D}$	Data distribution over pairs $(x, y)$
$(x_i, y_i)$	The i-th training example
$n$	Number of training examples
$f_\theta$	Model with parameters $\theta$
$\ell(\cdot, \cdot)$	Loss function for a single example
$L(\theta)$	Population (expected) loss
$\hat{L}(\theta)$	Empirical loss on a finite dataset
$\eta$	Learning rate or step size
$\nabla_\theta$	Gradient with respect to parameters

When we discuss vectors and matrices, we generally use bold letters in prose (for example, “vector w” and “matrix X”) and rely on context, rather than heavy typographical conventions.

Python code follows a few stylistic conventions that you will see repeatedly:

Pure functions with type hints wherever possible.
Small classes that encapsulate well-defined state and behavior.
Separation between library-style code and experiment scripts.
Use of tests to pin down edge cases.

1.8 Reproducibility and execution model

All examples in this book are designed to run as part of a literate programming workflow. The source for this text is written in a format that allows:

rendering as a book with narrative, equations, and figures,
executing code blocks in order, with their outputs captured,
exporting code to separate scripts or notebooks.

In practice, this means:

Each chapter corresponds to a file that can be executed in a notebook environment.
Code examples are self-contained and include imports and definitions.
When an example relies on substantial external data or models, we clearly describe how to obtain them or provide synthetic alternatives.

You are encouraged to treat the book not merely as reading material, but as a collection of experiments that you can run, modify, and extend.

1.9 Historical context in one page

The technical details in later chapters sit within a history that is useful to keep in mind. Very briefly:

Early work in symbolic AI focused on explicit rules and logic.
Statistical learning theory introduced the idea of learning from samples with guarantees about generalization.
Neural networks evolved from simple perceptrons to deep architectures capable of approximating highly complex functions.
Advances in hardware and software stacks enabled training models with billions or trillions of parameters.
Scalability, data availability, and algorithmic innovations combined to produce current families of models such as large language models and diffusion generators.

Throughout the book, we occasionally pause to connect a method to its historical roots. Understanding where ideas come from often clarifies where they might go next.

1.10 What you will be able to do by the end

By the time you reach the end of the book, you should be able to:

Design and analyze learning algorithms for new problems, starting from clear formal problem statements.
Implement a wide range of models and training procedures from first principles in Python.
Critically evaluate the strengths and limitations of standard architectures and techniques.
Build and maintain systems that train, deploy, monitor, and update models in realistic environments.
Engage with the research literature with enough background to understand both technical details and broader implications.
Contribute to conversations about the ethical and societal aspects of AI with an informed technical perspective.

Most importantly, you will have a mental framework in which new developments fit naturally, rather than appearing as isolated tricks. The field will continue to change rapidly, but the underlying principles in probability, optimization, representation, and systems design will remain highly stable.

Artificial intelligence is both a scientific discipline and an engineering craft. It sits at the intersection of mathematics, code, and human values. This book invites you into that intersection. The next chapters build the foundations; from there, we will gradually climb toward current frontiers, always with one eye on theory and the other on implementation.

Let us begin!

# Introduction ![](images/intro/logo.png) > "The real voyage of discovery consists not in seeking new landscapes but in having new eyes." Marcel Proust ------------------------------------------------------------------------ Artificial intelligence is no longer a speculative discipline at the edges of computer science. It is now part of the load-bearing structure of modern society. Search engines, recommendation systems, medical diagnostics, scientific discovery pipelines, creative tools, and large language models are all instances of the same underlying idea: we can formalize learning from data, implement it as algorithms, and deploy it as software systems that act in the world. This book is about learning how to build such systems with mathematical clarity and engineering discipline, using Python as the main instrument. The central promise is threefold: 1. You will understand why the core methods of modern AI work. 2. You will know how to implement these methods correctly and robustly in Python. 3. You will be able to reason about limitations, failure modes, and tradeoffs, not only about performance on a single benchmark. ------------------------------------------------------------------------ ## What this book is about At the highest level, artificial intelligence in this book is framed as the study and construction of **learning and decision-making systems**. The two fundamental recurring questions are: - **Statistical question**: Given data sampled from some unknown process, how can we infer models that generalize beyond the observed samples? - **Computational question**: Given a model and an objective, how can we compute good approximate solutions under realistic resource constraints? We repeatedly return to one central mathematical object: $$ L(\theta) = \mathbb{E}_{(x, y) \sim \mathcal{D}}\left[\ell\big(f_\theta(x), y\big)\right], $$ the **expected loss**, also called the **population risk**, of a parameterized model $f_\theta$ under an unknown data distribution $\mathcal{D}$. Almost everything we do is a variant of answering the question: > How can we choose parameters $\theta$ so that $L(\theta)$ is small, even though $\mathcal{D}$ is unknown and we only observe a finite sample? The detailed answers differ between linear models, convolutional networks, transformers, diffusion models, or reinforcement learning agents, but the skeleton remains the same. This unity is one of the guiding themes of the book. ::: {.callout-note title="Definition: the supervised learning problem"} A **supervised learning problem** is specified by an input space $\mathcal{X}$, an output space $\mathcal{Y}$, a hypothesis class $\mathcal{F} = \{f_\theta : \theta \in \Theta\}$, a per-example loss $\ell : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\ge 0}$, and a distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$. We observe a training set $S = \{(x_i, y_i)\}_{i=1}^n$ drawn independently and identically distributed (i.i.d.) from $\mathcal{D}$. A **learning algorithm** is a map $\mathcal{A} : (\mathcal{X} \times \mathcal{Y})^n \to \mathcal{F}$ that turns a finite sample into a hypothesis. The goal is to choose $\mathcal{A}$ so that the returned $f_{\hat\theta} = \mathcal{A}(S)$ has small population risk $L(\hat\theta)$. ::: Because $\mathcal{D}$ is unknown, we cannot evaluate $L(\theta)$ directly. We instead minimize its sample average, the **empirical risk** $$ \hat{L}(\theta) = \frac{1}{n} \sum_{i=1}^n \ell\big(f_\theta(x_i), y_i\big), $$ and hope that small empirical risk implies small population risk. This principle is **empirical risk minimization (ERM)**. Its validity is not automatic. The quantity that controls whether ERM succeeds is the **generalization gap**, $$ \mathrm{gap}(\theta) = L(\theta) - \hat{L}(\theta), $$ the difference between true and observed performance. A central message of statistical learning theory is that the gap can be bounded uniformly over the hypothesis class in terms of a complexity measure of $\mathcal{F}$ (for example its Vapnik Chervonenkis dimension or Rademacher complexity) and the sample size $n$, with the bound shrinking roughly like $1/\sqrt{n}$ [@vapnik1998statistical; @shalevshwartz2014understanding]. We develop these tools in Part I; here we only stress the conclusion: **fitting the training data is necessary but never sufficient. The object we actually care about is the population risk.** ### Decomposing the error It helps to know where error comes from. Fix a loss and let $f^\star$ be the **Bayes-optimal predictor**, the function (over all measurable functions, not just those in $\mathcal{F}$) that minimizes the population risk. Let $f_{\mathcal{F}}^\star$ minimize risk within the chosen class $\mathcal{F}$. Then the excess risk of a learned model $f_{\hat\theta}$ splits cleanly: $$ \underbrace{L(\hat\theta) - L(f^\star)}_{\text{excess risk}} = \underbrace{\big(L(f_{\mathcal{F}}^\star) - L(f^\star)\big)}_{\text{approximation error}} + \underbrace{\big(L(\hat\theta) - L(f_{\mathcal{F}}^\star)\big)}_{\text{estimation error}}. $$ The **approximation error** measures how much we lose by restricting attention to $\mathcal{F}$. It shrinks as the class grows richer. The **estimation error** measures how much we lose by choosing $\hat\theta$ from finite data rather than knowing $\mathcal{D}$. It tends to grow as the class grows richer, because a more expressive class is easier to overfit. This tension is the **bias variance** or **approximation estimation** tradeoff, and balancing it (through model size, regularization, and the amount of data) recurs in every chapter. In modern overparameterized models the classical picture is refined by phenomena such as double descent [@belkin2019reconciling], which we revisit in Part I, but the decomposition above remains the right starting point. From a practical perspective, we also treat AI systems as **software artifacts** that live in real infrastructure: - They must be trained within budgets of time, energy, and money. - They must be deployed as services that can fail, degrade, drift, and require monitoring. - They must be evaluated not only on accuracy, but also on safety, fairness, robustness, privacy, and overall impact. The book therefore covers not only mathematical foundations and algorithms, but also: - dataset curation and versioning, - experiment management, - reproducible code organization, - deployment patterns, - evaluation and monitoring. The two views, the statistical object $L(\theta)$ and the deployed software artifact, are not separate concerns. They form a single loop. Data feeds training, training produces a model, the model is served, serving generates new data and surfaces distribution shift, and that feedback drives the next round. Holding only one half of this loop in view is a common source of systems that score well offline yet fail in production. ```{mermaid} flowchart LR A["Data collection"] --> B["Training (minimize empirical risk)"] B --> C["Evaluation (estimate population risk)"] C --> D["Deployment and serving"] D --> E["Monitoring and drift detection"] E --> A C --> B ``` The arrow from monitoring back to data collection is what distinguishes a living machine learning system from a one-time fit. We return to each node of this loop in Part III. ------------------------------------------------------------------------ ## Who this book is for You are likely to benefit from this book if you see yourself in at least one of these profiles: - **Research-oriented reader** You know that "it works" is not enough; you want to understand why it works, when it breaks, and how to propose new methods. You might be a graduate student, a postdoctoral researcher, or a scientist in industry. - **Engineer building AI systems** You are responsible for running models in production, managing latency and cost constraints, designing metrics, and responding when systems misbehave. You need code that is idiomatic, testable, and maintainable. - **Scientist in another field using AI as a tool** You might work in physics, biology, economics, or the humanities and use machine learning as part of your research workflow. You want to understand what your models are really doing, rather than treating them as opaque black boxes. - **Advanced learner transitioning from introductory material** You have already read a standard machine learning textbook or taken a first course and now want to connect theory, code, and practical systems design in a coherent way. This book assumes that you have: - intermediate Python programming skills, - basic familiarity with linear algebra (vectors, matrices, eigenvalues), - basic probability and statistics (random variables, expectations, variances, conditional probability), - some exposure to calculus (derivatives, gradients, simple integrals). ------------------------------------------------------------------------ ## How this book is structured The book is organized into three large arcs: **Foundations**, **Modalities**, and **Systems, Decisions, and Society**. Within each arc, chapters follow a consistent internal structure designed to tie together theory, implementation, and practice. ### The three recurring layers in every chapter Each technical chapter is divided into three conceptual layers. 1. **Conceptual foundations** - Clear problem statement, including inputs, outputs, and evaluation criteria. - Mathematical formalization, including definitions, theorems, and proofs. - Intuitive explanations: geometric, probabilistic, or information-theoretic perspectives. - Discussion of assumptions and failure cases. 2. **Python implementation** - A reference implementation using modern Python, type hints, and widely used libraries. - Emphasis on clarity and correctness instead of clever "tricks". - Integration points with numerical libraries such as NumPy, JAX, or PyTorch. - Unit tests and simple benchmarks. 3. **Line-by-line walkthrough and practice** - Explanation of each major code block and important lines. - Analysis of computational complexity and memory usage. - Comments on numerical stability and common implementation bugs. - Variants and extensions, often linked to recent research papers. A typical chapter will also include: - **Worked examples** on synthetic datasets to isolate specific phenomena. - **Case studies** on real datasets or tasks. - **Exercises** ranging from straightforward implementation to open-ended research directions. This pattern is designed so that you can: - skim conceptual sections to gain orientation, - focus on the implementation layer when you want to build systems, - return to the proofs later to deepen understanding. ------------------------------------------------------------------------ ## Overview of the three arcs ### Part I: Foundations Part I develops the mathematical and algorithmic scaffolding on which the rest of the book rests. 1. **Mathematical bedrock** We revisit probability, information, and optimization from the perspective of machine learning. - random variables, distributions, and expectations, - the law of large numbers and concentration of measure, - bias variance decomposition, - basic information measures such as entropy and mutual information, - geometry of parameter spaces and loss surfaces, including convexity, local minima, and saddle points. The aim is not to prove every theorem in full generality, but to provide just enough rigor that later use of these concepts is well grounded. 2. **Optimization and numerical methods** Training most modern AI models is an optimization problem. We study: - first order methods such as stochastic gradient descent and its variants, - second order ideas such as curvature and preconditioning, - adaptive learning rates and their tradeoffs, - line searches and trust region methods in small and medium scale problems, - techniques for mixed precision training and numerical stability. We also discuss automatic differentiation, computational graphs, and practical issues such as exploding and vanishing gradients, gradient clipping, and checkpointing. 3. **Representations and embeddings** Almost every model in this book relies on representing complex objects (words, images, audio, graphs) as vectors in relatively low dimensional spaces. In this chapter, we study: - linear and nonlinear feature maps, - distributional and embedding based representations of words and tokens, - manifold hypotheses about high dimensional data, - information bottleneck ideas and compression, - geometric properties of learned representations. This chapter builds conceptual links that reappear when we discuss attention, contrastive learning, and generative models. ### Part II: Modalities Part II dives into specific data modalities and the corresponding model families. 4. **Large language models** Here we focus on transformers for text and code. - self attention, positional encodings, and tokenization, - masked language modeling and next token prediction objectives, - scaling laws for model size, data, and compute, - retrieval augmented generation, - in context learning and prompting strategies, - instruction tuning and alignment. We analyze language models both as probabilistic models of text and as components in larger systems such as agents and tools. 5. **Vision, video, and 3D perception** Visual perception models are responsible for interpreting static images, video streams, and three dimensional scenes. - convolutional neural networks and their inductive biases, - modern vision transformers and hybrid architectures, - object detection and segmentation, - representations of depth and geometry, - neural radiance fields and related 3D modeling techniques, - temporal modeling for video. We examine tradeoffs in accuracy, latency, and memory in vision models and show how they can be adapted to new tasks. 6. **Speech and audio modeling** This chapter focuses on waveforms and spectrograms for speech, music, and other acoustic signals. - short time Fourier transforms and time frequency representations, - autoregressive audio models, - self supervised pretraining for speech and audio, - text to speech and speech to text systems, - alignment between audio and text. We study how the same architectural motifs such as attention and convolution reappear in the acoustic domain. 7. **Multimodal fusion** Real world data is rarely purely textual, visual, or acoustic. Multimodal models attempt to connect multiple streams. - joint and factorized representations of multiple modalities, - contrastive learning objectives that align encoders, for example image text pairs, - cross attention mechanisms that allow one modality to guide another, - multimodal retrieval, - models that generate one modality from another, such as text to image or text to video. 8. **Generative intelligence** Generative models aim to model data distributions in a way that allows both sampling and density estimation. - autoregressive models for text, images, and audio, - variational autoencoders and latent variable models, - flows and invertible architectures, - score based and diffusion models, - evaluation metrics for generative models and their limitations. We also discuss creative and scientific uses of generative models, along with their risks. ### Part III: Systems, decisions, and society The final arc connects modeling with decision making, systems engineering, and societal impact. 9. **Probabilistic programming and causal modeling** Here we treat model structure and uncertainty explicitly. - probabilistic graphical models and structured factorizations, - variational inference and Monte Carlo methods, - differentiable programming and probabilistic programming languages, - basic causal concepts such as interventions and counterfactuals, - algorithms for causal discovery under assumptions. This chapter highlights how causal questions differ from purely predictive ones. 10. **Sequential decision making and agents** Reinforcement learning and related paradigms concern agents that act in environments to maximize expected return. - Markov decision processes and value functions, - policy optimization algorithms, - off policy evaluation and data efficiency, - exploration strategies, - multi agent interactions and game theoretic perspectives, - connections between supervised learning and decision transformers. We consider both tabular settings and function approximation with deep networks. 11. **Systems, scaling, and deployment** Even the most elegant models must eventually be implemented as systems that train and serve under cost and latency constraints. - distributed training strategies such as data, model, and pipeline parallelism, - hardware accelerators and their constraints, - quantization and pruning for efficient inference, - serving architectures for real time and batch predictions, - observability, logging, and monitoring, - experimentation frameworks and A/B testing. We emphasize reproducibility and the organization of codebases for long lived projects. 12. **Ethics, safety, and governance** AI systems are embedded in social, economic, and political contexts. This chapter surveys: - bias and fairness notions and their tradeoffs, - interpretability and explanation methods, - robustness under distribution shift and adversarial perturbations, - privacy preserving training, including differential privacy and federated learning, - regulatory frameworks and compliance. The goal is not to provide final answers, but to equip you with a language and a toolkit for reasoning about responsibility and impact. ------------------------------------------------------------------------ ## How to read this book There are several productive ways to navigate the material, depending on your goals. ### The research path If your goal is to work on new methods or theoretical analysis: 1. Read Part I in order, including proofs and exercises. 2. For each subsequent chapter, focus on: - the problem formulation, - the main algorithms, - the sections that connect to known theorems or open questions. 3. Reimplement algorithms from scratch in minimal frameworks, without relying on high level libraries where possible. 4. Use the references to dig into original research papers. ### The engineering path If you work in industry building systems and need to make design decisions: 1. Skim the conceptual parts of Part I to refresh key ideas. 2. In each later chapter, pay particular attention to: - implementation details, - complexity and resource analysis, - failure modes and debugging strategies. 3. Work through case studies and exercises that involve modifying and extending provided code. 4. Map presented design patterns to your own infrastructure. ### The practitioner path If you mainly want to use existing methods effectively: 1. Read the motivation and overview sections of each chapter. 2. Focus on code examples that show "canonical" usage patterns. 3. Use the exercises to practice interpreting model outputs and diagnosing basic issues. 4. Gradually return to more mathematical sections as specific needs arise. You do not need to read the book strictly sequentially, but Part I provides language and tools that every later chapter uses. Skipping it entirely is not recommended unless you already have a solid mathematical and statistical background. ------------------------------------------------------------------------ ## A small example: concept, code, explanation To illustrate the style of the book, consider a very simple supervised learning problem. Given pairs $(x_i, y_i)$ with scalar inputs ($x_i$) and outputs ($y_i$), we want to fit an affine model $$ f_\theta(x) = w x + b, $$ where $\theta = (w, b)$ are parameters. We will choose $\theta$ to minimize the mean squared error on a dataset of size $n$: $$ \hat{L}(\theta) = \frac{1}{n} \sum_{i=1}^n (f_\theta(x_i) - y_i)^2. $$ This is the empirical risk from the definition above, with $\ell(\hat y, y) = (\hat y - y)^2$. Stacking the inputs into a design matrix $X \in \mathbb{R}^{n \times 2}$ whose $i$-th row is $(x_i, 1)$ and the targets into $y \in \mathbb{R}^n$, the empirical risk is $\hat{L}(\theta) = \tfrac{1}{n}\lVert X\theta - y\rVert_2^2$. Because this objective is convex and quadratic in $\theta$, setting its gradient to zero gives the **normal equations** $$ X^\top X\,\theta = X^\top y, $$ whose solution $\hat\theta = (X^\top X)^{-1} X^\top y$ is the global minimizer whenever $X^\top X$ is invertible. This is one of the rare cases in the book where the optimum is available in closed form. Most later models replace this single linear solve with iterative stochastic optimization, but the logic (write down a risk, then minimize it) is identical. In Python, we might write: ``` python from dataclasses import dataclass from typing import Sequence, Tuple import numpy as np Array = np.ndarray @dataclass class LinearModel: """Simple scalar-input linear regression model f(x) = w * x + b.""" w: float b: float def __call__(self, x: Array) -> Array: x = np.asarray(x, dtype=float) return self.w * x + self.b def mean_squared_error(y_pred: Array, y_true: Array) -> float: y_pred = np.asarray(y_pred, dtype=float) y_true = np.asarray(y_true, dtype=float) residuals = y_pred - y_true return float(np.mean(residuals ** 2)) def fit_linear_regression( xs: Sequence[float], ys: Sequence[float], ) -> LinearModel: """Fit w and b by solving the normal equations.""" x = np.asarray(xs, dtype=float) y = np.asarray(ys, dtype=float) # Design matrix with a column of ones for the bias term. X = np.stack([x, np.ones_like(x)], axis=1) # Solve (X^T X) theta = X^T y for theta = [w, b]. XtX = X.T @ X Xty = X.T @ y theta = np.linalg.solve(XtX, Xty) w, b = theta[0], theta[1] return LinearModel(w=w, b=b) def evaluate_model( model: LinearModel, xs: Sequence[float], ys: Sequence[float], ) -> float: preds = model(xs) return mean_squared_error(preds, ys) ``` Even this basic code illustrates several recurring themes in the book: - **Explicit parameterization** The model is a small data class with named parameters `w` and `b`. Throughout the book, we favor explicit parameterization and clear naming over anonymous layers or magical factory functions. - **Separation of concerns** The loss function `mean_squared_error` is independent of the model class. This mirrors how, in more complex models, we separate architecture, objective, and optimization. - **Matrix formulation** The function `fit_linear_regression` uses the design matrix (X) and solves the normal equations. This highlights the connection between least squares and linear algebra, which generalizes directly to higher dimensions. - **Numerical considerations** For this tiny example, directly solving $X^\top X\theta = X^\top y$ is adequate. For larger or ill-conditioned problems, forming $X^\top X$ squares the condition number of $X$ and amplifies rounding error, so we prefer alternative methods such as QR factorization, the singular value decomposition, or iterative optimization. The free, open-source NumPy and SciPy stack provides all of these (for instance `numpy.linalg.lstsq`, which uses an SVD-based solver) and we lean on such mature libraries throughout. ::: {.callout-tip title="When to use, and pitfalls"} The closed-form least squares solution is the right tool when the model is linear in its parameters, the loss is squared error, the number of features is modest, and $X^\top X$ is well conditioned. Reach for iterative or regularized methods instead when any of these break down. Common pitfalls, all revisited in later chapters, are: reporting the training error $\hat{L}(\hat\theta)$ as if it were the population risk $L(\hat\theta)$, which is optimistically biased; a near-singular $X^\top X$ caused by collinear features; and silent extrapolation when test inputs fall outside the range of the training inputs. ::: As chapters progress, this pattern scales up: from linear models to deep networks, from closed-form solutions to stochastic optimization, from direct calls to `numpy.linalg` to full training loops with automatic differentiation and distributed computation. ------------------------------------------------------------------------ ## Notation and conventions To reduce cognitive load, we try to maintain consistent notation across the book. The following table collects some of the most frequently used symbols. | Symbol | Meaning | |:-------------------------|:---------------------------------------------| | $x$ | Input example (feature vector, token sequence, etc.) | | $y$ | Target output (label, regression target, next token) | | $\mathcal{D}$ | Data distribution over pairs $(x, y)$ | | $(x_i, y_i)$ | The i-th training example | | $n$ | Number of training examples | | $f_\theta$ | Model with parameters $\theta$ | | $\ell(\cdot, \cdot)$ | Loss function for a single example | | $L(\theta)$ | Population (expected) loss | | $\hat{L}(\theta)$ | Empirical loss on a finite dataset | | $\eta$ | Learning rate or step size | | $\nabla_\theta$ | Gradient with respect to parameters | When we discuss vectors and matrices, we generally use bold letters in prose (for example, "vector w" and "matrix X") and rely on context, rather than heavy typographical conventions. Python code follows a few stylistic conventions that you will see repeatedly: - Pure functions with type hints wherever possible. - Small classes that encapsulate well-defined state and behavior. - Separation between library-style code and experiment scripts. - Use of tests to pin down edge cases. ------------------------------------------------------------------------ ## Reproducibility and execution model All examples in this book are designed to run as part of a literate programming workflow. The source for this text is written in a format that allows: - rendering as a book with narrative, equations, and figures, - executing code blocks in order, with their outputs captured, - exporting code to separate scripts or notebooks. In practice, this means: - Each chapter corresponds to a file that can be executed in a notebook environment. - Code examples are self-contained and include imports and definitions. - When an example relies on substantial external data or models, we clearly describe how to obtain them or provide synthetic alternatives. You are encouraged to treat the book not merely as reading material, but as a collection of experiments that you can run, modify, and extend. ------------------------------------------------------------------------ ## Historical context in one page The technical details in later chapters sit within a history that is useful to keep in mind. Very briefly: - Early work in symbolic AI focused on explicit rules and logic. - Statistical learning theory introduced the idea of learning from samples with guarantees about generalization. - Neural networks evolved from simple perceptrons to deep architectures capable of approximating highly complex functions. - Advances in hardware and software stacks enabled training models with billions or trillions of parameters. - Scalability, data availability, and algorithmic innovations combined to produce current families of models such as large language models and diffusion generators. Throughout the book, we occasionally pause to connect a method to its historical roots. Understanding where ideas come from often clarifies where they might go next. ------------------------------------------------------------------------ ## What you will be able to do by the end By the time you reach the end of the book, you should be able to: - Design and analyze learning algorithms for new problems, starting from clear formal problem statements. - Implement a wide range of models and training procedures from first principles in Python. - Critically evaluate the strengths and limitations of standard architectures and techniques. - Build and maintain systems that train, deploy, monitor, and update models in realistic environments. - Engage with the research literature with enough background to understand both technical details and broader implications. - Contribute to conversations about the ethical and societal aspects of AI with an informed technical perspective. Most importantly, you will have a mental framework in which new developments fit naturally, rather than appearing as isolated tricks. The field will continue to change rapidly, but the underlying principles in probability, optimization, representation, and systems design will remain highly stable. ------------------------------------------------------------------------ Artificial intelligence is both a scientific discipline and an engineering craft. It sits at the intersection of mathematics, code, and human values. This book invites you into that intersection. The next chapters build the foundations; from there, we will gradually climb toward current frontiers, always with one eye on theory and the other on implementation. Let us begin!

Symbol	Meaning
\(x\)	Input example (feature vector, token sequence, etc.)
\(y\)	Target output (label, regression target, next token)
\(\mathcal{D}\)	Data distribution over pairs \((x, y)\)
\((x_i, y_i)\)	The i-th training example
\(n\)	Number of training examples
\(f_\theta\)	Model with parameters \(\theta\)
\(\ell(\cdot, \cdot)\)	Loss function for a single example
\(L(\theta)\)	Population (expected) loss
\(\hat{L}(\theta)\)	Empirical loss on a finite dataset
\(\eta\)	Learning rate or step size
\(\nabla_\theta\)	Gradient with respect to parameters