13  AI Software Ecosystems

13.1 1. Introduction

Modern artificial intelligence is as much a software achievement as a mathematical one. The transformer architecture, the diffusion model, and the policy gradient are ideas, but ideas only become reproducible engineering when a layered stack of libraries turns them into running code on accelerators. This chapter examines that stack: the language that hosts it (Python), the numerical foundations (NumPy, SciPy), the deep learning frameworks (PyTorch, JAX, TensorFlow), the automatic differentiation machinery they share, the model and dataset distribution layer (Hugging Face), the data wrangling tools (pandas, Polars, Arrow), the experiment and reproducibility tooling, and finally the packaging systems that hold it all together (uv, conda).

The recurring theme is that each layer embodies a set of design tradeoffs, and that understanding those tradeoffs matters more than memorizing any single API. Ecosystems are not chosen; they accrete. The reasons one tool displaced another are usually social and ergonomic as much as technical, and a graduate practitioner benefits from seeing both dimensions.

13.2 2. Why Python Won

Python was not designed for numerical computing. Its dynamic typing, interpreter overhead, and global interpreter lock make it, on paper, a poor candidate for high-performance work. Yet it became the lingua franca of machine learning. The explanation lies in a division of labor: Python is a coordination language, not a computation language.

13.2.1 2.1 The two-language strategy

The performance-critical inner loops of numerical code live in compiled C, C++, Fortran, or CUDA. Python orchestrates these kernels. A practitioner writes high-level, readable glue, and the heavy arithmetic happens below the interpreter in optimized libraries. This pattern, sometimes called the two-language problem, is also Python’s superpower: the productivity of a scripting language at the top, the speed of compiled code at the bottom.

13.2.2 2.2 Ergonomics and community

Python’s readability lowered the barrier to entry for scientists who were not primarily programmers. Interactive computing through IPython and later Jupyter made exploratory work natural (run a cell, inspect an array, plot a result). The result was a virtuous cycle: more users attracted more library authors, which attracted more users. By the time deep learning arrived around 2012, Python already had a mature numerical stack waiting, and the frameworks that defined the era chose Python bindings because that is where the researchers were.

Competitors existed. R dominated statistics, Julia promised to solve the two-language problem outright with a single fast language, and MATLAB held engineering. None matched Python’s breadth across data engineering, web services, and general scripting, which let the same language carry a model from a notebook to a production endpoint.

13.3 3. The Numerical Core

13.3.1 3.1 NumPy

NumPy is the substrate on which nearly everything else rests. Its central abstraction is the ndarray, a strided, homogeneous, n-dimensional array backed by a contiguous block of memory. Two ideas make it powerful. The first is vectorization: operations apply to whole arrays in compiled loops rather than element by element in Python, eliminating interpreter overhead. The second is broadcasting, a set of rules for combining arrays of different shapes without copying data, which lets a vector be added to every row of a matrix with no explicit loop.

import numpy as np
A = np.random.randn(1000, 50)
mu = A.mean(axis=0)          # shape (50,)
centered = A - mu            # broadcasting subtracts mu from each row
cov = centered.T @ centered / (A.shape[0] - 1)

NumPy also defines an informal contract: the array interface and the universal function (ufunc) protocol. Downstream libraries agree to speak ndarray, which is why a SciPy routine, a Matplotlib plot, and a pandas column interoperate. The NumPy 2.0 release in 2024 modernized the C API and dtype system while preserving this contract, a reminder that backward compatibility is itself a feature in foundational software.

13.3.2 3.2 SciPy

SciPy builds a library of scientific algorithms atop NumPy arrays: optimization, numerical integration, interpolation, signal processing, sparse matrices, and a large statistics module. Where NumPy supplies the data structure and elementary operations, SciPy supplies the textbook methods. Much of it wraps battle-tested Fortran and C libraries such as LAPACK, BLAS, and ARPACK, which is precisely the two-language strategy in action. For AI practitioners, SciPy is often invisible infrastructure (sparse linear algebra under a recommender, special functions under a probability distribution), but it remains the reference implementation for classical numerical methods.

13.4 4. Deep Learning Frameworks

A deep learning framework provides three things: an array library that runs on accelerators, an automatic differentiation engine, and a collection of neural network building blocks and optimizers. The three dominant frameworks differ chiefly in how they schedule computation and how they express differentiation.

13.4.1 4.1 The execution model divide

The central design axis is static versus dynamic graphs. A static graph framework asks you to define the entire computation up front, compiles it, then feeds data through. This enables aggressive optimization and easy deployment but makes debugging painful, because the Python code that builds the graph runs only once and ordinary print statements and breakpoints do not see the running computation. A dynamic (eager) framework executes operations immediately as the Python interpreter reaches them, so the graph is built implicitly on every forward pass. This is slower in principle but vastly more pleasant: control flow is ordinary Python, errors point at the offending line, and the mental model matches NumPy.

13.4.2 4.2 TensorFlow’s history

TensorFlow, released by Google in 2015, was the first framework to reach mass adoption. It used a static graph: you constructed a symbolic graph with placeholders, then ran it inside a session. This bought excellent production tooling (graph serialization, TensorFlow Serving, mobile and browser runtimes, and first-class support for Google’s TPUs) but imposed a steep learning curve. The friction was real enough that TensorFlow 2.0 (2019) switched the default to eager execution and adopted Keras as its high-level API, an implicit acknowledgment that the dynamic model had won the research community. Despite enormous engineering investment, TensorFlow steadily lost research mindshare, illustrating that developer experience can outweigh raw capability in deciding which tool people actually reach for.

13.4.3 4.3 PyTorch and its dominance

PyTorch, released by Facebook AI Research in 2017 and descended from the Lua-based Torch, embraced dynamic graphs from the start through a system called define-by-run. Its tensor API mirrors NumPy closely, its autograd is transparent, and its error messages are legible. Researchers adopted it rapidly because prototyping felt like writing ordinary Python. By the late 2010s the majority of papers at major machine learning conferences reported PyTorch implementations, and that research dominance pulled the rest of the ecosystem toward it.

PyTorch’s later evolution addressed its one structural weakness, the performance cost of eager execution. The torch.compile system introduced in PyTorch 2.0 (2023) traces eager code, captures it into a graph, and hands it to a backend compiler, recovering much of the speed of a static framework without forcing the user to abandon the eager programming model. This represents a convergence: rather than choosing graphs versus eager, the field now wants eager authoring with optional graph compilation.

import torch
model = MyNetwork()
model = torch.compile(model)         # transparent graph capture and fusion
loss = loss_fn(model(x), y)
loss.backward()                      # autograd populates .grad on every parameter
optimizer.step()

13.4.4 4.4 JAX and functional transforms

JAX, from Google, takes a different philosophical stance. Where PyTorch is object-oriented and stateful (a module owns its parameters, a tensor accumulates gradients), JAX is functional. It exposes a NumPy-compatible API and a set of composable function transformations: grad for differentiation, jit for just-in-time compilation through XLA, vmap for automatic vectorization, and pmap or shard_map for parallelism across devices. Because these transforms are functions that take functions and return functions, they compose freely: you can take the gradient of a vectorized, JIT-compiled function and get exactly what the algebra says you should.

import jax, jax.numpy as jnp
def loss(params, x, y):
    pred = predict(params, x)
    return jnp.mean((pred - y) ** 2)

grad_fn = jax.jit(jax.grad(loss))    # compiled gradient function
g = grad_fn(params, x_batch, y_batch)

The cost of this elegance is discipline. JAX requires functional purity (no side effects inside transformed functions), explicit handling of random number state through splittable keys, and immutable arrays. These constraints feel restrictive to newcomers, but they are exactly what make the transforms sound and the parallelism scalable. JAX found a strong following in research that pushes scale and in scientific computing, where its mathematical cleanliness is prized. The tradeoff is a smaller ecosystem and a steeper conceptual entry than PyTorch.

13.5 5. Automatic Differentiation

Automatic differentiation (autodiff) is the engine beneath every framework, and understanding it demystifies the whole stack. Autodiff is neither symbolic differentiation (which manipulates expressions and can explode in size) nor numerical differentiation (finite differences, which suffer truncation and rounding error). It computes exact derivatives by applying the chain rule mechanically to the sequence of elementary operations the program actually executed.

13.5.1 5.1 Forward and reverse mode

There are two modes. Forward mode propagates derivatives alongside values from inputs to outputs, and is efficient when there are few inputs and many outputs. Reverse mode first runs the computation forward while recording operations onto a tape, then walks backward accumulating gradients, and is efficient when there are many inputs and few outputs. Neural network training is exactly the latter case: millions of parameters (inputs) and a single scalar loss (output). This is why reverse mode, known in this setting as backpropagation, is the workhorse, and why every framework’s autograd defaults to it.

13.5.2 5.2 How frameworks realize it

PyTorch builds the tape dynamically during the forward pass: each operation on a tensor that requires gradients records itself, and loss.backward() traverses that recorded graph. JAX instead traces the function once to an intermediate representation and transforms that representation, which is why grad is a function transform rather than a method call. Both compute the same mathematical object; the difference is when and how the graph is captured. The conceptual payoff is that practitioners can write arbitrary differentiable programs (not just fixed layer stacks) and trust that gradients will be correct, which is what made research into novel architectures so productive.

13.6 6. The Hugging Face Ecosystem

If frameworks made models trainable, Hugging Face made them shareable. The company’s libraries and hub became the de facto distribution layer for the transformer era, doing for models and datasets what package registries did for code.

13.6.1 6.1 transformers

The transformers library provides a unified API across thousands of model architectures. Its design insight was the pipeline of a configuration, a tokenizer, and a model, all loadable by name with a single call, with weights downloaded and cached automatically. A practitioner can load a pretrained model for classification, generation, or embedding without reimplementing the architecture or hunting for checkpoints.

from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("org/model-name")
model = AutoModelForCausalLM.from_pretrained("org/model-name")
out = model.generate(**tok("Once upon a time", return_tensors="pt"))

A notable design choice is the deliberate avoidance of deep inheritance: each model is largely self-contained in a single file rather than woven through a tall abstraction hierarchy. This repeats code across models but makes any one model easy to read, fork, and modify, a tradeoff that favors researchers over framework purists.

13.6.2 6.2 datasets and the hub

The datasets library handles data at the scale models now demand. It is backed by Apache Arrow and uses memory-mapping so that corpora larger than RAM can be processed without loading them entirely, with lazy transformations and efficient streaming. The Hugging Face Hub is the registry tying it together: a git-based store, using git-LFS for large files, that versions models, datasets, and demo applications, complete with model cards documenting intended use and limitations. The cultural effect was significant: it normalized open distribution of weights and made reproducing a result a matter of a download rather than an email to the authors.

13.7 7. Data Tooling

Before a model trains, data must be loaded, cleaned, joined, and reshaped. This tabular layer has its own evolving ecosystem.

13.7.1 7.1 pandas

pandas defined dataframe manipulation in Python. Its DataFrame, a labeled, heterogeneous table built on NumPy, became ubiquitous in data science. Its strengths are expressiveness and a vast surface of conveniences for real-world messy data (time series, missing values, group-wise operations, joins). Its weaknesses, well known to anyone who has scaled it, are memory inefficiency (operations often copy), single-threaded execution for most work, an eager evaluation model that cannot optimize across steps, and an API grown sprawling over more than a decade.

13.7.2 7.2 Polars

Polars is a newer dataframe library written in Rust that targets exactly those weaknesses. It offers a lazy execution mode in which a query is built as a plan, optimized as a whole (predicate pushdown, projection pruning), then executed across all cores. It uses Apache Arrow as its in-memory format and is built for parallelism from the ground up. The result is order-of-magnitude speedups on many workloads and a more consistent, composable expression API. The tradeoff is a smaller ecosystem and a different mental model: users must think in terms of expressions and query plans rather than in-place mutation.

import polars as pl
result = (
    pl.scan_parquet("events.parquet")      # lazy, nothing read yet
      .filter(pl.col("amount") > 0)
      .group_by("user")
      .agg(pl.col("amount").sum())
      .collect()                            # optimize the whole plan, then run
)

13.7.3 7.3 Arrow as the connective tissue

Apache Arrow underlies both Polars and Hugging Face datasets, and increasingly pandas. Arrow defines a standardized, language-independent columnar memory format. Its importance is interoperability: when two systems both speak Arrow, they can share data with zero-copy handoffs rather than serializing and deserializing across a boundary. Arrow is the modern analog of the NumPy array contract, raised to the level of cross-language, cross-process data exchange, and it quietly removes a large class of conversion overhead from the data pipeline.

13.8 8. Experiment Tracking and Reproducibility

Training runs produce a flood of artifacts: hyperparameters, metrics over time, model checkpoints, and the environment that produced them. Without discipline, results become irreproducible and comparisons become guesswork.

13.8.1 8.1 Tracking tools

Experiment trackers such as Weights and Biases, MLflow, and the open-source TensorBoard log metrics, configurations, and artifacts to a central store and render them for comparison. MLflow additionally offers a model registry and packaging format aimed at the path to production, while Weights and Biases emphasizes hosted collaboration and rich visualization. The common abstraction is the run: a single execution tagged with its configuration and the time series of values it emitted, queryable and comparable after the fact.

13.8.2 8.2 Configuration and orchestration

Reproducibility also depends on managing configuration cleanly. Tools such as Hydra compose hierarchical configuration files and sweep over them, separating the description of an experiment from its code. The deeper challenge is that reproducibility is multi-layered: identical code can produce different results due to nondeterminism in parallel GPU kernels, differing library versions, or unseeded randomness. Genuine reproducibility therefore requires pinning the data, the random seeds, the library versions, and ideally the hardware, which is why the packaging layer discussed next is not a side concern but part of the scientific method.

13.9 9. Packaging and Environments

The final layer answers a deceptively hard question: how do you install a consistent set of libraries that actually work together? AI dependencies are unusually difficult because they mix Python packages with compiled extensions, CUDA toolkits, and system libraries that must all agree.

13.9.1 9.1 The conda lineage

conda emerged to solve a problem that pip historically could not: managing non-Python dependencies. conda is a language-agnostic package and environment manager that installs precompiled binaries, including CUDA runtimes and system libraries, into isolated environments. For scientific stacks with heavy native code, this was transformative. Its costs are a slow classical dependency solver (partially addressed by faster solvers) and an ecosystem split between channels, with licensing considerations around the default channel that pushed many users toward the community conda-forge channel.

13.9.2 9.2 The pip and uv lineage

The mainstream Python path is pip installing from the Python Package Index, with virtual environments for isolation. Historically this was fragmented across many tools for locking, virtual environments, and building. uv, written in Rust, consolidated these roles into a single fast tool that creates environments, resolves and locks dependencies, and installs packages, often an order of magnitude faster than the tools it replaces. Its lockfile produces deterministic, reproducible installs, directly serving the reproducibility goals of the previous section. The wheel format and the increasing availability of GPU-enabled wheels have narrowed conda’s former advantage, so that for many AI projects a pip or uv workflow now suffices where conda was once mandatory.

13.9.3 9.3 The underlying tradeoff

The choice between these tools is a tradeoff between scope and speed. conda manages the broadest set of dependencies, including non-Python ones, at the cost of complexity and historically of speed. uv and pip are faster and simpler but assume the Python packaging ecosystem can supply what you need, which is increasingly but not universally true. The pragmatic resolution many teams adopt is to use conda only when a stubborn native dependency demands it, and a fast pip or uv workflow everywhere else.

13.10 10. Conclusion

The AI software ecosystem is a layered settlement, each stratum built on the contracts exposed by the one below: compiled kernels under NumPy, NumPy under the frameworks, the frameworks under model hubs and data tools, and packaging holding the whole structure in a reproducible state. The throughline is that the winning tools were rarely the most powerful in the abstract. They were the ones that got the tradeoffs right for the people using them, favoring legibility (PyTorch’s eager model), interoperability (Arrow, the NumPy contract), and ergonomics (Python itself). For the practitioner, fluency means understanding not just how to call these libraries but why each made the choices it did, because the next shift in the ecosystem will be driven by the same forces that produced this one.

13.11 References

  1. Harris, C. R., et al. “Array programming with NumPy.” Nature 585 (2020): 357 to 362. https://www.nature.com/articles/s41586-020-2649-2
  2. Virtanen, P., et al. “SciPy 1.0: fundamental algorithms for scientific computing in Python.” Nature Methods 17 (2020): 261 to 272. https://www.nature.com/articles/s41592-019-0686-2
  3. Paszke, A., et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” NeurIPS 2019. https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library
  4. Ansel, J., et al. “PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation.” ASPLOS 2024. https://pytorch.org/assets/pytorch2-2.pdf
  5. Abadi, M., et al. “TensorFlow: A System for Large-Scale Machine Learning.” OSDI 2016. https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
  6. Bradbury, J., et al. “JAX: composable transformations of Python and NumPy programs.” 2018. https://github.com/jax-ml/jax
  7. Baydin, A. G., et al. “Automatic Differentiation in Machine Learning: a Survey.” Journal of Machine Learning Research 18 (2018): 1 to 43. https://jmlr.org/papers/v18/17-468.html
  8. Wolf, T., et al. “Transformers: State-of-the-Art Natural Language Processing.” EMNLP 2020 (System Demonstrations). https://aclanthology.org/2020.emnlp-demos.6/
  9. Lhoest, Q., et al. “Datasets: A Community Library for Natural Language Processing.” EMNLP 2021. https://aclanthology.org/2021.emnlp-demo.21/
  10. McKinney, W. “Data Structures for Statistical Computing in Python.” Proceedings of the 9th Python in Science Conference (2010). https://conference.scipy.org/proceedings/scipy2010/mckinney.html
  11. Polars documentation. https://docs.pola.rs/
  12. Apache Arrow project. https://arrow.apache.org/
  13. Hugging Face Hub documentation. https://huggingface.co/docs/hub/index
  14. MLflow documentation. https://mlflow.org/docs/latest/index.html
  15. Astral. “uv: An extremely fast Python package and project manager.” https://docs.astral.sh/uv/
  16. conda documentation. https://docs.conda.io/