2 The History of Artificial Intelligence

Artificial intelligence did not arrive as a single invention. It emerged from a long sequence of intellectual bets, technical breakthroughs, and disappointments, each of which reshaped what researchers believed machines could do. This chapter traces that history chronologically, from the formal foundations laid in the 1930s and 1940s through the large-language-model era of the mid 2020s. The central argument is that every major shift in the field was driven by some combination of three forces: available compute, available data, and new algorithmic ideas. When all three aligned, progress was explosive. When one was missing, the field stalled, sometimes for a decade.

This is a conceptual chapter, so it favors precise statements over runnable code. Where a result has an exact mathematical core, that core is given explicitly, because the equations explain why a transition happened rather than merely that it happened. Three small pieces of mathematics carry most of the explanatory weight in the modern story: the limitation of single-layer perceptrons (which caused a winter), the backpropagation update (which ended it), and the self-attention operation plus the empirical scaling laws (which produced the current era). Each is stated where it belongs in the timeline.

A note on the three forces

Throughout the chapter, “compute” means the arithmetic throughput available to train and run models, usually measured in floating-point operations per second; “data” means machine-readable examples a system can learn from; and “algorithms” means the model architectures and training procedures that turn compute and data into capability. The recurring claim is multiplicative rather than additive: capability is large only when all three factors are simultaneously large, so a near-zero value in any one factor drives the product toward zero. This is why isolated brilliant ideas, such as backpropagation in 1986 or convolutional networks in 1998, had to wait years for data and compute to catch up before they reshaped the field.

2.1 1. A Chronological Overview

Before examining the eras in detail, it helps to see them laid out as a timeline. The diagram below marks the events that anchor each section of this chapter.

timeline
    title A Chronological Overview of AI
    1936 : Turing's "On Computable Numbers" defines computation
    1943 : McCulloch and Pitts model an artificial neuron
    1950 : Turing proposes the Imitation Game
    1956 : Dartmouth Summer Research Project names the field
    1957 : Rosenblatt builds the Perceptron
    1966 : "ELIZA; early machine translation falters"
    1969 : Minsky and Papert publish "Perceptrons"
    1973 : Lighthill Report triggers the first AI winter
    1980 : XCON expert system deployed at DEC
    1986 : Backpropagation popularized; connectionism revives
    1987 : LISP machine market collapses; second AI winter
    1997 : Deep Blue defeats Kasparov
    1998 : "LeNet-5; gradient learning for documents"
    2006 : Hinton's deep belief nets revive "deep" learning
    2009 : ImageNet dataset released
    2012 : AlexNet wins ImageNet by a wide margin
    2014 : "GANs; sequence-to-sequence learning"
    2017 : "Attention Is All You Need" introduces the Transformer
    2018 : BERT and GPT show transfer learning at scale
    2020 : GPT-3 demonstrates in-context learning
    2022 : ChatGPT reaches mainstream users
    2023 : GPT-4 and multimodal frontier models
    2024 : Reasoning-focused and agentic systems mature

The sections that follow walk through these markers and, more importantly, explain why each transition occurred.

2.2 2. Early Foundations

2.2.1 2.1 The Formal Idea of Computation

The intellectual prerequisite for artificial intelligence was a precise definition of computation itself. In 1936 Alan Turing introduced the abstract machine that now bears his name, showing that a simple device manipulating symbols on a tape could carry out any effective procedure. The Church-Turing thesis, which emerged from this work and Alonzo Church’s parallel formulation, holds that any function computable by an effective procedure is computable by a Turing machine. This result established that reasoning, if it could be reduced to symbol manipulation, was in principle mechanizable. The Turing machine gave later researchers a reason to believe that thought might be a computational process rather than something irreducibly biological.

Around the same time, Warren McCulloch and Walter Pitts published a 1943 paper showing that networks of simplified neurons, each firing according to a threshold rule, could compute logical functions. Their neuron is worth stating precisely, because it is the direct ancestor of every artificial neuron used today. Given binary inputs $x_1, \dots, x_n$, a McCulloch-Pitts unit outputs

\[ y = \mathbb{1}\!\left[\sum_{i=1}^{n} w_i x_i \ge \theta\right], \]

where $w_i$ are fixed weights, $\theta$ is a threshold, and $\mathbb{1}[\cdot]$ is the indicator function that returns 1 when its condition holds and 0 otherwise. With suitable weights and thresholds, single units realize logical AND, OR, and NOT, and networks of them realize any Boolean function. This was the first bridge between brains and logic, and it planted the seed for both the symbolic and connectionist traditions that would later diverge. The crucial limitation, invisible at the time, is that McCulloch and Pitts gave no procedure for learning the weights; they had to be set by hand. Supplying that missing learning procedure is the through-line of the rest of this chapter.

2.2.2 2.2 Turing’s Imitation Game

In 1950 Turing published “Computing Machinery and Intelligence,” which reframed the unanswerable question “Can machines think?” into an operational test. In the Imitation Game, now called the Turing Test, a human interrogator converses through text with a machine and another human, and the machine succeeds if the interrogator cannot reliably tell which is which. The proposal mattered less as a benchmark than as a philosophical move: it shifted the conversation from metaphysics toward observable behavior, which is the stance most of the field has taken ever since.

2.2.3 2.3 The Dartmouth Workshop

The field acquired its name and its founding agenda at the Dartmouth Summer Research Project on Artificial Intelligence in 1956. John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon organized the gathering on the premise that “every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” The proposal was strikingly confident, and that confidence set the tone for the next two decades. Dartmouth did not solve any technical problem, but it gathered the people who would, defined a shared vocabulary, and secured a sense of legitimacy that attracted funding.

2.3 3. Symbolic AI and the Logic Era

2.3.1 3.1 Physical Symbol Systems

The dominant paradigm from the late 1950s into the 1980s was symbolic AI, sometimes called good old-fashioned AI. Its guiding hypothesis, articulated by Allen Newell and Herbert Simon, held that a physical symbol system has the necessary and sufficient means for general intelligent action. On this view, intelligence is the manipulation of symbols according to formal rules, and the path to thinking machines runs through logic, search, and explicit knowledge representation.

Early systems gave the approach real momentum. Newell and Simon’s Logic Theorist (1956) proved theorems from Whitehead and Russell’s Principia Mathematica, and their later General Problem Solver attempted to model human means-ends reasoning. McCarthy invented LISP in 1958, a language whose treatment of code as data made it the natural home of symbolic programming for decades.

2.3.2 3.2 The Limits of Pure Reasoning

Symbolic systems excelled in narrow, well-structured domains but struggled wherever the world was messy. Joseph Weizenbaum’s ELIZA (1966) produced strikingly human dialogue by pattern matching alone, which revealed how easily people attribute understanding to shallow systems, an effect now called the ELIZA effect. Meanwhile, early machine translation efforts foundered on ambiguity and context, and a 1966 report by the Automatic Language Processing Advisory Committee concluded that the work had not delivered, cutting funding sharply. The deeper problem was that encoding common sense by hand proved enormous, a difficulty later known as the knowledge acquisition bottleneck.

Reasoning algorithms also faced combinatorial explosion, and the reason is quantitative. Many symbolic methods cast a problem as search through a state space, the set of all configurations reachable by applying legal operators from a start state. If each state has, on average, $b$ successor states (the branching factor) and a solution lies at depth $d$, then an uninformed search such as breadth-first search may examine on the order of $b^d$ states. This growth is exponential in $d$, so even a modest branching factor defeats brute force: with $b = 10$ and $d = 15$, the count reaches $10^{15}$, far beyond the memory and clock speed of 1970s hardware. Heuristic search such as A* prunes this space when a good heuristic is available, but for open-ended reasoning no such heuristic existed. Exponential blow-up, not any single failed project, was the structural obstacle the field kept hitting.

2.4 4. The First AI Winter

By the early 1970s the gap between promises and results had become impossible to ignore. In Britain, the 1973 Lighthill Report assessed AI research for the Science Research Council and concluded that the field had failed to achieve its grand objectives, singling out the combinatorial explosion as a fundamental obstacle. The report led to deep cuts in British funding. In the United States, agencies that had supported open-ended research grew impatient and redirected money toward projects with concrete deliverables.

The causes were structural rather than accidental. Compute was scarce and expensive, so algorithms that scaled poorly hit hard walls. Data in machine-readable form barely existed, so systems could not learn from experience. And the reigning algorithmic philosophy, hand-built symbolic rules, did not degrade gracefully when faced with novelty. With all three enabling forces weak, the first AI winter set in and lasted through much of the 1970s.

2.5 5. Expert Systems and Their Collapse

2.5.1 5.1 Knowledge as a Product

AI returned to favor in the late 1970s and early 1980s through expert systems, programs that captured the decision rules of human specialists in a narrow domain. MYCIN, developed at Stanford, diagnosed bacterial infections and recommended antibiotics, often matching specialist physicians. The commercial breakthrough was XCON (also called R1), deployed at Digital Equipment Corporation in 1980 to configure computer orders. XCON reportedly saved the company tens of millions of dollars a year, and it convinced industry that AI could pay for itself.

A whole sector grew around this idea, including companies that sold specialized LISP machines optimized for symbolic computation. Japan’s ambitious Fifth Generation Computer Systems project, launched in 1982, poured national resources into logic programming hardware, and Western governments responded with programs of their own.

2.5.2 5.2 Why It Did Not Last

The expert-system boom carried the seeds of the second winter. The systems were brittle: they performed well inside their narrow rule sets but failed unpredictably at the edges, and they could not learn or update themselves. Maintaining large rule bases became costly, since every new case risked conflicting with existing rules. Most damaging, the specialized hardware lost its reason to exist when general-purpose workstations from companies like Sun and the new Intel and Apple machines became cheap and fast enough to run the same software. The LISP machine market collapsed around 1987, the Fifth Generation project ended without meeting its goals, and funding evaporated again. This second AI winter ran roughly from the late 1980s into the mid 1990s.

2.6 6. Connectionism and Statistical Machine Learning

2.6.1 6.1 The Neural Network Revival

A different tradition had been quietly maturing alongside symbolic AI. Frank Rosenblatt’s Perceptron (1957) was an early trainable neural model. Unlike the McCulloch-Pitts unit, it came with a learning rule: present an example, and if the prediction is wrong, nudge the weights toward the correct answer. The perceptron convergence theorem guarantees that this rule finds a separating boundary in finitely many steps whenever one exists. The catch is the italicized clause.

Marvin Minsky and Seymour Papert’s 1969 book “Perceptrons” showed that a single-layer perceptron computes only linearly separable functions, those whose positive and negative examples can be split by a single hyperplane $w^\top x + b = 0$. The exclusive-or (XOR) function is the canonical counterexample. Its four cases,

\[ \text{XOR}(0,0)=0,\quad \text{XOR}(0,1)=1,\quad \text{XOR}(1,0)=1,\quad \text{XOR}(1,1)=0, \]

place the two outputs that equal 1 on one diagonal of the unit square and the two that equal 0 on the other. No single straight line separates the diagonals, so no single-layer perceptron, whatever its weights, can compute XOR. The result was mathematically narrow (it concerned one-layer networks) but was widely read as a verdict on neural networks in general, and it chilled the field for years.

The escape route was known in principle: stacking layers removes the linearity barrier, because a hidden layer can first remap the inputs into a space where the classes are separable. XOR, for instance, is solved by a two-layer network that internally computes the features “at least one input is on” (OR) and “both inputs are on” (AND) and then outputs their difference. What was missing was an efficient way to train the hidden weights. The thaw came in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized backpropagation, an efficient method for training multi-layer networks by propagating error gradients backward through the layers.

Backpropagation is simply the chain rule applied systematically. For a network with loss $L$, the gradient of the loss with respect to a weight $w$ in an early layer factors as a product of local derivatives along the path from that weight to the output,

\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_n}\,\frac{\partial a_n}{\partial a_{n-1}}\cdots\frac{\partial a_{k+1}}{\partial w}, \]

where each $a_j$ is the activation at layer $j$. Computing these factors from the output backward lets the algorithm reuse intermediate results, so the cost of one gradient evaluation is a small constant multiple of the cost of one forward pass, rather than growing with the number of weights. Each weight is then updated by gradient descent, $w \leftarrow w - \eta\,\partial L / \partial w$, with a learning rate $\eta$. Multi-layer networks could now represent and learn the functions single-layer ones could not, and connectionism revived as a serious research program. The same factored-product form also explains a difficulty that would stall deep networks for two more decades: when many factors are each smaller than one, their product shrinks toward zero, the vanishing-gradient problem, which is part of why the breakthroughs of the 2010s depended on activation functions and initialization schemes that keep these factors near one.

2.6.2 6.2 The Statistical Turn

The 1990s brought a broader shift from hand-coded rules toward learning from data. In speech recognition and machine translation, probabilistic models such as hidden Markov models and statistical translation systems began to outperform their rule-based predecessors, in large part because growing digital corpora finally provided enough training material. Methods like support vector machines, introduced by Cortes and Vapnik in 1995, and ensemble techniques such as random forests gave practitioners powerful, well-understood tools with strong theoretical guarantees. The lesson of this period, sometimes summarized as “more data beats clever rules,” reoriented the field toward statistics and optimization. Symbolic AI did not vanish, but the center of gravity moved.

This was also the era of high-profile demonstrations. In 1997 IBM’s Deep Blue defeated world chess champion Garry Kasparov, a milestone that, although built on search and specialized hardware rather than learning, showed the public that machines could surpass humans in a domain long held up as a test of intellect.

2.7 7. The Deep Learning Revolution

2.7.1 7.1 The 2012 Inflection Point

The modern era of AI began in earnest in 2012. At the ImageNet Large Scale Visual Recognition Challenge, a deep convolutional neural network called AlexNet, built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, cut the image classification error rate dramatically, beating the runner-up by a margin that stunned the computer vision community. The architecture itself was not entirely new; convolutional networks traced back to Yann LeCun’s LeNet of the late 1990s. What changed was the convergence of the three enabling forces.

First, data: the ImageNet dataset, organized by Fei-Fei Li and released in 2009, provided more than a million labeled images, enough to train a large network without it simply memorizing. Second, compute: AlexNet was trained on consumer graphics processing units, which offered the massive parallelism that neural network training demands at a fraction of the cost of specialized hardware. Third, algorithmic refinements such as the rectified linear activation and dropout regularization made deep networks practical to train. When all three lined up, depth suddenly paid off.

2.7.2 7.2 A Cascade of Capabilities

After 2012 progress came quickly across domains. Deep networks took over speech recognition, then natural language processing, then reinforcement learning. In 2014 Ian Goodfellow and colleagues introduced generative adversarial networks, which learned to synthesize realistic images, and the same year sequence-to-sequence models showed that neural networks could map one sequence to another for tasks like translation. In 2016 DeepMind’s AlphaGo defeated the professional Go player Lee Sedol, combining deep neural networks with tree search to master a game whose branching factor had long been considered out of reach. The common thread was representation learning: instead of engineers crafting features by hand, the networks discovered useful internal representations directly from raw data.

2.8 8. The Transformer and Large-Language-Model Era

2.8.1 8.1 Attention and the Transformer

A pivotal architectural change arrived in 2017 with the paper “Attention Is All You Need” by Vaswani and colleagues at Google. The Transformer replaced the recurrent and convolutional structures that had dominated sequence modeling with a mechanism called self-attention, which lets every position in a sequence directly attend to every other position.

The mechanism has a compact closed form. Each input position is projected into three vectors, a query $q$, a key $k$, and a value $v$. Stacking these over all positions into matrices $Q$, $K$, and $V$, scaled dot-product attention computes

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V, \]

where $d_k$ is the dimension of the key vectors and the softmax is taken row by row. Reading the formula left to right clarifies why it works. The product $QK^\top$ scores how well each query matches each key, so entry $(i, j)$ measures the relevance of position $j$ to position $i$. The division by $\sqrt{d_k}$ keeps these scores from growing with dimension and pushing the softmax into a region of vanishing gradients. The softmax turns each row of scores into a probability distribution over positions, and multiplying by $V$ returns a weighted average of the value vectors. Every output is thus a content-addressed blend of the entire sequence.

This had two consequences that proved decisive. First, it captured long-range dependencies far better than recurrent networks: any two positions interact through a single attention step, a constant path length, whereas a recurrent network must pass information through a chain of intermediate steps that grows with distance and tends to forget. Second, the dominant computation is the matrix product $QK^\top$, which is highly parallel and maps directly onto the dense linear algebra that modern accelerators execute fastest. The price is that this product is quadratic in sequence length $n$, costing $O(n^2 d)$ operations and memory, which is why long-context efficiency became an active research area. The Transformer was, in effect, an architecture designed to scale.

2.8.2 8.2 Pretraining and Transfer

The next insight was that a single large model could be pretrained on vast amounts of unlabeled text and then adapted to many tasks. In 2018 Google’s BERT used a masked-language objective to learn deep bidirectional representations, while OpenAI’s GPT used a left-to-right objective suited to generation. Both demonstrated transfer learning at scale: pretrain once on a broad corpus, then fine-tune cheaply for specific tasks. The economic logic was compelling, since the expensive part of training was amortized across countless downstream applications.

2.8.3 8.3 Scaling Laws and Emergence

Researchers then discovered that performance improved smoothly and predictably as models, data, and compute grew together, a relationship captured in empirical scaling laws. The cross-entropy test loss $L$ falls as a power law in each resource. Holding the others abundant, the loss as a function of parameter count $N$ takes the form

\[ L(N) \approx L_\infty + \left(\frac{N_c}{N}\right)^{\alpha_N}, \]

with an analogous expression in dataset size, where $L_\infty$ is an irreducible loss floor, $N_c$ is a constant, and the exponent $\alpha_N$ (empirically a small positive number well under one) sets how fast returns diminish. Because a power law is a straight line on log-log axes, a few small training runs let researchers fit the line and extrapolate the loss of a far larger run before paying for it. This turned model building into something closer to engineering: given a compute budget, one could forecast the gains from spending it, and later work (the Chinchilla analysis) refined how to split that budget between more parameters and more data, finding that earlier models were undertrained for their size.

GPT-3, released in 2020 with 175 billion parameters, validated the bet. It showed in-context learning, the ability to perform new tasks from a few examples in the prompt without any weight updates, a behavior that had not been explicitly designed. Capabilities that appeared abruptly at certain scales were described as emergent, although later analysis showed that how emergence is measured affects how sharp it appears: a metric that scores only exact answers can look like a sudden jump, while a smoother metric over the same runs reveals gradual improvement. The distinction matters, because it determines whether a given capability can be forecast in advance or only discovered after the fact.

2.8.4 8.4 Alignment, Chat Interfaces, and the Public Era

Raw scale produced capable but unruly models, so attention turned to making them useful and safe. Reinforcement learning from human feedback, refined in work on InstructGPT in 2022, trained models to follow instructions and respect human preferences. When OpenAI wrapped such a model in a conversational interface and released ChatGPT in late 2022, adoption was unprecedented, reaching an enormous user base within weeks and bringing AI into mainstream awareness. In 2023 GPT-4 and competing frontier models from Anthropic, Google, and others extended capabilities to multimodal input and stronger reasoning. Through 2024 and into the mid 2020s, the frontier shifted toward systems that reason through problems step by step and act as agents, calling tools and executing multi-step plans, while open-weight model families made strong capabilities broadly available.

2.8.5 8.5 Why This Era Happened

The large-language-model era is the clearest illustration of the chapter’s thesis. The Transformer supplied an algorithm that scaled with hardware. The internet supplied training data at a scale no curated dataset could match. And a decade of investment in accelerators supplied the compute to train models with hundreds of billions of parameters. None of the three alone would have sufficed. Their simultaneous availability, together with the discovery that scaling reliably improved capability, produced the most rapid expansion of AI capability in the field’s history.

2.9 9. A Worked Example: Reading One Transition Through the Three Forces

The chapter’s thesis is easiest to trust when applied to a single transition in detail. Consider the convolutional network. Yann LeCun’s LeNet-5 reached strong handwritten-digit accuracy in 1998, using essentially the same convolutional structure, gradient training, and weight sharing that AlexNet would use in 2012. The algorithm, in other words, existed fourteen years before the breakthrough. Why did the field not surge in 1998?

Walk the three forces in turn. The algorithm was present and largely sufficient; AlexNet’s additions, mainly the rectified linear activation and dropout, were refinements rather than reinventions. The data was not present at the needed scale: the standard digit benchmark held tens of thousands of small grayscale images, whereas ImageNet, released in 2009, held more than a million labeled natural images across a thousand categories, enough to train a large network without it memorizing its training set. The compute was not present at the needed throughput or price: training a network of AlexNet’s size on the central processors of 1998 would have taken an impractically long time, while by 2012 commodity graphics processors delivered the parallel arithmetic that convolution demands at a fraction of the cost. Only when data and compute both crossed their thresholds did the latent algorithm convert into a result that reset the field. This is the multiplicative claim made concrete: two factors near zero held the product near zero for over a decade, despite a mature algorithm.

2.10 10. Patterns Across the History

Looking across seven decades, several patterns recur. Progress has alternated between symbolic and statistical approaches, each correcting the other’s weaknesses, and modern research increasingly seeks to combine the explicit structure of the former with the learning power of the latter. Hype cycles have repeatedly outrun delivery, and the two AI winters are a reminder that inflated expectations carry real costs when they collapse. Above all, the enabling forces of compute, data, and algorithms have set the pace. Whenever a new algorithmic idea met sufficient data and sufficient compute, the field surged forward, and whenever any of the three was missing, even brilliant ideas had to wait. Understanding this dynamic is the best guide to reasoning about where artificial intelligence may go next.

2.10.1 10.1 Reading the History Well: Pitfalls of Hindsight

A history of a fast-moving field is easy to misread, so a few cautions are in order.

Do not treat dates as discoveries. The years in the timeline mark when an idea became visible or dominant, not when it was first conceived. Backpropagation, convolution, and attention all predate the moments they are famous for. Asking “what changed around the idea” usually points at data or compute, not at the algorithm alone.
Beware survivorship and the smooth-curve illusion. The narrative naturally foregrounds the approaches that won. Many reasonable contemporaries lost, and the winners often looked unpromising at the time. A clean line drawn through the survivors can make the path look more inevitable than it was.
Distinguish capability from understanding. ELIZA in 1966 and large language models today both invite observers to attribute more comprehension than a system demonstrably has. The behavioral framing of the Turing Test was meant to sidestep this question, not to settle it.
Treat extrapolation with discipline. Scaling laws are empirical fits over a finite range, valid until a resource saturates or a new bottleneck appears. They have held remarkably well, but a power law is a description of past runs, not a guarantee about future ones, and the choice of metric can make progress look either smooth or abrupt.

The honest summary is that the three forces explain the timing and shape of past transitions better than they predict the content of the next one. The next surge will likely again require an algorithmic idea meeting sufficient data and compute, but which idea, and which bottleneck it relieves, is exactly what the history cannot tell us in advance.

2.11 References

Turing, A. M. (1936). “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society. https://doi.org/10.1112/plms/s2-42.1.230
McCulloch, W. S., and Pitts, W. (1943). “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics. https://doi.org/10.1007/BF02478259
Turing, A. M. (1950). “Computing Machinery and Intelligence.” Mind. https://doi.org/10.1093/mind/LIX.236.433
McCarthy, J., Minsky, M., Rochester, N., and Shannon, C. (1955). “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence.” http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
Newell, A., and Simon, H. A. (1976). “Computer Science as Empirical Inquiry: Symbols and Search.” Communications of the ACM. https://doi.org/10.1145/360018.360022
Weizenbaum, J. (1966). “ELIZA: A Computer Program for the Study of Natural Language Communication Between Man and Machine.” Communications of the ACM. https://doi.org/10.1145/365153.365168
Lighthill, J. (1973). “Artificial Intelligence: A General Survey.” Science Research Council. https://www.chilton-computing.org.uk/inf/literature/reports/lighthill_report/p001.htm
Buchanan, B. G., and Shortliffe, E. H. (1984). “Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project.” Addison-Wesley. https://www.shortliffe.net/Buchanan-Shortliffe-1984/MYCIN%20Book.htm
Minsky, M., and Papert, S. (1969). “Perceptrons: An Introduction to Computational Geometry.” MIT Press.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). “Learning Representations by Back-Propagating Errors.” Nature. https://doi.org/10.1038/323533a0
Cortes, C., and Vapnik, V. (1995). “Support-Vector Networks.” Machine Learning. https://doi.org/10.1007/BF00994018
Campbell, M., Hoane, A. J., and Hsu, F. (2002). “Deep Blue.” Artificial Intelligence. https://doi.org/10.1016/S0004-3702(01)00129-1
Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. (2009). “ImageNet: A Large-Scale Hierarchical Image Database.” IEEE CVPR. https://doi.org/10.1109/CVPR.2009.5206848
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
Goodfellow, I., et al. (2014). “Generative Adversarial Networks.” NeurIPS. https://arxiv.org/abs/1406.2661
Silver, D., et al. (2016). “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature. https://doi.org/10.1038/nature16961
Vaswani, A., et al. (2017). “Attention Is All You Need.” NeurIPS. https://arxiv.org/abs/1706.03762
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” https://arxiv.org/abs/1810.04805
Brown, T., et al. (2020). “Language Models Are Few-Shot Learners.” NeurIPS. https://arxiv.org/abs/2005.14165
Ouyang, L., et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS. https://arxiv.org/abs/2203.02155
Kaplan, J., et al. (2020). “Scaling Laws for Neural Language Models.” https://arxiv.org/abs/2001.08361
OpenAI (2023). “GPT-4 Technical Report.” https://arxiv.org/abs/2303.08774
Russell, S., and Norvig, P. (2021). “Artificial Intelligence: A Modern Approach,” 4th ed. Pearson. https://aima.cs.berkeley.edu/
Rosenblatt, F. (1958). “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review. https://doi.org/10.1037/h0042519
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE. https://doi.org/10.1109/5.726791
Hoffmann, J., et al. (2022). “Training Compute-Optimal Large Language Models.” NeurIPS. https://arxiv.org/abs/2203.15556
Wei, J., et al. (2022). “Emergent Abilities of Large Language Models.” Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682

# The History of Artificial Intelligence Artificial intelligence did not arrive as a single invention. It emerged from a long sequence of intellectual bets, technical breakthroughs, and disappointments, each of which reshaped what researchers believed machines could do. This chapter traces that history chronologically, from the formal foundations laid in the 1930s and 1940s through the large-language-model era of the mid 2020s. The central argument is that every major shift in the field was driven by some combination of three forces: available compute, available data, and new algorithmic ideas. When all three aligned, progress was explosive. When one was missing, the field stalled, sometimes for a decade. This is a conceptual chapter, so it favors precise statements over runnable code. Where a result has an exact mathematical core, that core is given explicitly, because the equations explain *why* a transition happened rather than merely *that* it happened. Three small pieces of mathematics carry most of the explanatory weight in the modern story: the limitation of single-layer perceptrons (which caused a winter), the backpropagation update (which ended it), and the self-attention operation plus the empirical scaling laws (which produced the current era). Each is stated where it belongs in the timeline. ::: {.callout-note} ## A note on the three forces Throughout the chapter, "compute" means the arithmetic throughput available to train and run models, usually measured in floating-point operations per second; "data" means machine-readable examples a system can learn from; and "algorithms" means the model architectures and training procedures that turn compute and data into capability. The recurring claim is multiplicative rather than additive: capability is large only when all three factors are simultaneously large, so a near-zero value in any one factor drives the product toward zero. This is why isolated brilliant ideas, such as backpropagation in 1986 or convolutional networks in 1998, had to wait years for data and compute to catch up before they reshaped the field. ::: ## 1. A Chronological Overview Before examining the eras in detail, it helps to see them laid out as a timeline. The diagram below marks the events that anchor each section of this chapter. ```{mermaid} timeline title A Chronological Overview of AI 1936 : Turing's "On Computable Numbers" defines computation 1943 : McCulloch and Pitts model an artificial neuron 1950 : Turing proposes the Imitation Game 1956 : Dartmouth Summer Research Project names the field 1957 : Rosenblatt builds the Perceptron 1966 : "ELIZA; early machine translation falters" 1969 : Minsky and Papert publish "Perceptrons" 1973 : Lighthill Report triggers the first AI winter 1980 : XCON expert system deployed at DEC 1986 : Backpropagation popularized; connectionism revives 1987 : LISP machine market collapses; second AI winter 1997 : Deep Blue defeats Kasparov 1998 : "LeNet-5; gradient learning for documents" 2006 : Hinton's deep belief nets revive "deep" learning 2009 : ImageNet dataset released 2012 : AlexNet wins ImageNet by a wide margin 2014 : "GANs; sequence-to-sequence learning" 2017 : "Attention Is All You Need" introduces the Transformer 2018 : BERT and GPT show transfer learning at scale 2020 : GPT-3 demonstrates in-context learning 2022 : ChatGPT reaches mainstream users 2023 : GPT-4 and multimodal frontier models 2024 : Reasoning-focused and agentic systems mature ``` The sections that follow walk through these markers and, more importantly, explain why each transition occurred. ## 2. Early Foundations ### 2.1 The Formal Idea of Computation The intellectual prerequisite for artificial intelligence was a precise definition of computation itself. In 1936 Alan Turing introduced the abstract machine that now bears his name, showing that a simple device manipulating symbols on a tape could carry out any effective procedure. The Church-Turing thesis, which emerged from this work and Alonzo Church's parallel formulation, holds that any function computable by an effective procedure is computable by a Turing machine. This result established that reasoning, if it could be reduced to symbol manipulation, was in principle mechanizable. The Turing machine gave later researchers a reason to believe that thought might be a computational process rather than something irreducibly biological. Around the same time, Warren McCulloch and Walter Pitts published a 1943 paper showing that networks of simplified neurons, each firing according to a threshold rule, could compute logical functions. Their neuron is worth stating precisely, because it is the direct ancestor of every artificial neuron used today. Given binary inputs $x_1, \dots, x_n$, a McCulloch-Pitts unit outputs $$ y = \mathbb{1}\!\left[\sum_{i=1}^{n} w_i x_i \ge \theta\right], $$ where $w_i$ are fixed weights, $\theta$ is a threshold, and $\mathbb{1}[\cdot]$ is the indicator function that returns 1 when its condition holds and 0 otherwise. With suitable weights and thresholds, single units realize logical AND, OR, and NOT, and networks of them realize any Boolean function. This was the first bridge between brains and logic, and it planted the seed for both the symbolic and connectionist traditions that would later diverge. The crucial limitation, invisible at the time, is that McCulloch and Pitts gave no procedure for *learning* the weights; they had to be set by hand. Supplying that missing learning procedure is the through-line of the rest of this chapter. ### 2.2 Turing's Imitation Game In 1950 Turing published "Computing Machinery and Intelligence," which reframed the unanswerable question "Can machines think?" into an operational test. In the Imitation Game, now called the Turing Test, a human interrogator converses through text with a machine and another human, and the machine succeeds if the interrogator cannot reliably tell which is which. The proposal mattered less as a benchmark than as a philosophical move: it shifted the conversation from metaphysics toward observable behavior, which is the stance most of the field has taken ever since. ### 2.3 The Dartmouth Workshop The field acquired its name and its founding agenda at the Dartmouth Summer Research Project on Artificial Intelligence in 1956. John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon organized the gathering on the premise that "every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it." The proposal was strikingly confident, and that confidence set the tone for the next two decades. Dartmouth did not solve any technical problem, but it gathered the people who would, defined a shared vocabulary, and secured a sense of legitimacy that attracted funding. ## 3. Symbolic AI and the Logic Era ### 3.1 Physical Symbol Systems The dominant paradigm from the late 1950s into the 1980s was symbolic AI, sometimes called good old-fashioned AI. Its guiding hypothesis, articulated by Allen Newell and Herbert Simon, held that a physical symbol system has the necessary and sufficient means for general intelligent action. On this view, intelligence is the manipulation of symbols according to formal rules, and the path to thinking machines runs through logic, search, and explicit knowledge representation. Early systems gave the approach real momentum. Newell and Simon's Logic Theorist (1956) proved theorems from Whitehead and Russell's Principia Mathematica, and their later General Problem Solver attempted to model human means-ends reasoning. McCarthy invented LISP in 1958, a language whose treatment of code as data made it the natural home of symbolic programming for decades. ### 3.2 The Limits of Pure Reasoning Symbolic systems excelled in narrow, well-structured domains but struggled wherever the world was messy. Joseph Weizenbaum's ELIZA (1966) produced strikingly human dialogue by pattern matching alone, which revealed how easily people attribute understanding to shallow systems, an effect now called the ELIZA effect. Meanwhile, early machine translation efforts foundered on ambiguity and context, and a 1966 report by the Automatic Language Processing Advisory Committee concluded that the work had not delivered, cutting funding sharply. The deeper problem was that encoding common sense by hand proved enormous, a difficulty later known as the knowledge acquisition bottleneck. Reasoning algorithms also faced combinatorial explosion, and the reason is quantitative. Many symbolic methods cast a problem as search through a state space, the set of all configurations reachable by applying legal operators from a start state. If each state has, on average, $b$ successor states (the branching factor) and a solution lies at depth $d$, then an uninformed search such as breadth-first search may examine on the order of $b^d$ states. This growth is exponential in $d$, so even a modest branching factor defeats brute force: with $b = 10$ and $d = 15$, the count reaches $10^{15}$, far beyond the memory and clock speed of 1970s hardware. Heuristic search such as A* prunes this space when a good heuristic is available, but for open-ended reasoning no such heuristic existed. Exponential blow-up, not any single failed project, was the structural obstacle the field kept hitting. ## 4. The First AI Winter By the early 1970s the gap between promises and results had become impossible to ignore. In Britain, the 1973 Lighthill Report assessed AI research for the Science Research Council and concluded that the field had failed to achieve its grand objectives, singling out the combinatorial explosion as a fundamental obstacle. The report led to deep cuts in British funding. In the United States, agencies that had supported open-ended research grew impatient and redirected money toward projects with concrete deliverables. The causes were structural rather than accidental. Compute was scarce and expensive, so algorithms that scaled poorly hit hard walls. Data in machine-readable form barely existed, so systems could not learn from experience. And the reigning algorithmic philosophy, hand-built symbolic rules, did not degrade gracefully when faced with novelty. With all three enabling forces weak, the first AI winter set in and lasted through much of the 1970s. ## 5. Expert Systems and Their Collapse ### 5.1 Knowledge as a Product AI returned to favor in the late 1970s and early 1980s through expert systems, programs that captured the decision rules of human specialists in a narrow domain. MYCIN, developed at Stanford, diagnosed bacterial infections and recommended antibiotics, often matching specialist physicians. The commercial breakthrough was XCON (also called R1), deployed at Digital Equipment Corporation in 1980 to configure computer orders. XCON reportedly saved the company tens of millions of dollars a year, and it convinced industry that AI could pay for itself. A whole sector grew around this idea, including companies that sold specialized LISP machines optimized for symbolic computation. Japan's ambitious Fifth Generation Computer Systems project, launched in 1982, poured national resources into logic programming hardware, and Western governments responded with programs of their own. ### 5.2 Why It Did Not Last The expert-system boom carried the seeds of the second winter. The systems were brittle: they performed well inside their narrow rule sets but failed unpredictably at the edges, and they could not learn or update themselves. Maintaining large rule bases became costly, since every new case risked conflicting with existing rules. Most damaging, the specialized hardware lost its reason to exist when general-purpose workstations from companies like Sun and the new Intel and Apple machines became cheap and fast enough to run the same software. The LISP machine market collapsed around 1987, the Fifth Generation project ended without meeting its goals, and funding evaporated again. This second AI winter ran roughly from the late 1980s into the mid 1990s. ## 6. Connectionism and Statistical Machine Learning ### 6.1 The Neural Network Revival A different tradition had been quietly maturing alongside symbolic AI. Frank Rosenblatt's Perceptron (1957) was an early trainable neural model. Unlike the McCulloch-Pitts unit, it came with a learning rule: present an example, and if the prediction is wrong, nudge the weights toward the correct answer. The perceptron convergence theorem guarantees that this rule finds a separating boundary in finitely many steps whenever one exists. The catch is the italicized clause. Marvin Minsky and Seymour Papert's 1969 book "Perceptrons" showed that a single-layer perceptron computes only linearly separable functions, those whose positive and negative examples can be split by a single hyperplane $w^\top x + b = 0$. The exclusive-or (XOR) function is the canonical counterexample. Its four cases, $$ \text{XOR}(0,0)=0,\quad \text{XOR}(0,1)=1,\quad \text{XOR}(1,0)=1,\quad \text{XOR}(1,1)=0, $$ place the two outputs that equal 1 on one diagonal of the unit square and the two that equal 0 on the other. No single straight line separates the diagonals, so no single-layer perceptron, whatever its weights, can compute XOR. The result was mathematically narrow (it concerned one-layer networks) but was widely read as a verdict on neural networks in general, and it chilled the field for years. The escape route was known in principle: stacking layers removes the linearity barrier, because a hidden layer can first remap the inputs into a space where the classes *are* separable. XOR, for instance, is solved by a two-layer network that internally computes the features "at least one input is on" (OR) and "both inputs are on" (AND) and then outputs their difference. What was missing was an efficient way to train the hidden weights. The thaw came in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized backpropagation, an efficient method for training multi-layer networks by propagating error gradients backward through the layers. Backpropagation is simply the chain rule applied systematically. For a network with loss $L$, the gradient of the loss with respect to a weight $w$ in an early layer factors as a product of local derivatives along the path from that weight to the output, $$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_n}\,\frac{\partial a_n}{\partial a_{n-1}}\cdots\frac{\partial a_{k+1}}{\partial w}, $$ where each $a_j$ is the activation at layer $j$. Computing these factors from the output backward lets the algorithm reuse intermediate results, so the cost of one gradient evaluation is a small constant multiple of the cost of one forward pass, rather than growing with the number of weights. Each weight is then updated by gradient descent, $w \leftarrow w - \eta\,\partial L / \partial w$, with a learning rate $\eta$. Multi-layer networks could now represent and *learn* the functions single-layer ones could not, and connectionism revived as a serious research program. The same factored-product form also explains a difficulty that would stall deep networks for two more decades: when many factors are each smaller than one, their product shrinks toward zero, the vanishing-gradient problem, which is part of why the breakthroughs of the 2010s depended on activation functions and initialization schemes that keep these factors near one. ### 6.2 The Statistical Turn The 1990s brought a broader shift from hand-coded rules toward learning from data. In speech recognition and machine translation, probabilistic models such as hidden Markov models and statistical translation systems began to outperform their rule-based predecessors, in large part because growing digital corpora finally provided enough training material. Methods like support vector machines, introduced by Cortes and Vapnik in 1995, and ensemble techniques such as random forests gave practitioners powerful, well-understood tools with strong theoretical guarantees. The lesson of this period, sometimes summarized as "more data beats clever rules," reoriented the field toward statistics and optimization. Symbolic AI did not vanish, but the center of gravity moved. This was also the era of high-profile demonstrations. In 1997 IBM's Deep Blue defeated world chess champion Garry Kasparov, a milestone that, although built on search and specialized hardware rather than learning, showed the public that machines could surpass humans in a domain long held up as a test of intellect. ## 7. The Deep Learning Revolution ### 7.1 The 2012 Inflection Point The modern era of AI began in earnest in 2012. At the ImageNet Large Scale Visual Recognition Challenge, a deep convolutional neural network called AlexNet, built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, cut the image classification error rate dramatically, beating the runner-up by a margin that stunned the computer vision community. The architecture itself was not entirely new; convolutional networks traced back to Yann LeCun's LeNet of the late 1990s. What changed was the convergence of the three enabling forces. First, data: the ImageNet dataset, organized by Fei-Fei Li and released in 2009, provided more than a million labeled images, enough to train a large network without it simply memorizing. Second, compute: AlexNet was trained on consumer graphics processing units, which offered the massive parallelism that neural network training demands at a fraction of the cost of specialized hardware. Third, algorithmic refinements such as the rectified linear activation and dropout regularization made deep networks practical to train. When all three lined up, depth suddenly paid off. ### 7.2 A Cascade of Capabilities After 2012 progress came quickly across domains. Deep networks took over speech recognition, then natural language processing, then reinforcement learning. In 2014 Ian Goodfellow and colleagues introduced generative adversarial networks, which learned to synthesize realistic images, and the same year sequence-to-sequence models showed that neural networks could map one sequence to another for tasks like translation. In 2016 DeepMind's AlphaGo defeated the professional Go player Lee Sedol, combining deep neural networks with tree search to master a game whose branching factor had long been considered out of reach. The common thread was representation learning: instead of engineers crafting features by hand, the networks discovered useful internal representations directly from raw data. ## 8. The Transformer and Large-Language-Model Era ### 8.1 Attention and the Transformer A pivotal architectural change arrived in 2017 with the paper "Attention Is All You Need" by Vaswani and colleagues at Google. The Transformer replaced the recurrent and convolutional structures that had dominated sequence modeling with a mechanism called self-attention, which lets every position in a sequence directly attend to every other position. The mechanism has a compact closed form. Each input position is projected into three vectors, a query $q$, a key $k$, and a value $v$. Stacking these over all positions into matrices $Q$, $K$, and $V$, scaled dot-product attention computes $$ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V, $$ where $d_k$ is the dimension of the key vectors and the softmax is taken row by row. Reading the formula left to right clarifies why it works. The product $QK^\top$ scores how well each query matches each key, so entry $(i, j)$ measures the relevance of position $j$ to position $i$. The division by $\sqrt{d_k}$ keeps these scores from growing with dimension and pushing the softmax into a region of vanishing gradients. The softmax turns each row of scores into a probability distribution over positions, and multiplying by $V$ returns a weighted average of the value vectors. Every output is thus a content-addressed blend of the entire sequence. This had two consequences that proved decisive. First, it captured long-range dependencies far better than recurrent networks: any two positions interact through a single attention step, a constant path length, whereas a recurrent network must pass information through a chain of intermediate steps that grows with distance and tends to forget. Second, the dominant computation is the matrix product $QK^\top$, which is highly parallel and maps directly onto the dense linear algebra that modern accelerators execute fastest. The price is that this product is quadratic in sequence length $n$, costing $O(n^2 d)$ operations and memory, which is why long-context efficiency became an active research area. The Transformer was, in effect, an architecture designed to scale. ### 8.2 Pretraining and Transfer The next insight was that a single large model could be pretrained on vast amounts of unlabeled text and then adapted to many tasks. In 2018 Google's BERT used a masked-language objective to learn deep bidirectional representations, while OpenAI's GPT used a left-to-right objective suited to generation. Both demonstrated transfer learning at scale: pretrain once on a broad corpus, then fine-tune cheaply for specific tasks. The economic logic was compelling, since the expensive part of training was amortized across countless downstream applications. ### 8.3 Scaling Laws and Emergence Researchers then discovered that performance improved smoothly and predictably as models, data, and compute grew together, a relationship captured in empirical scaling laws. The cross-entropy test loss $L$ falls as a power law in each resource. Holding the others abundant, the loss as a function of parameter count $N$ takes the form $$ L(N) \approx L_\infty + \left(\frac{N_c}{N}\right)^{\alpha_N}, $$ with an analogous expression in dataset size, where $L_\infty$ is an irreducible loss floor, $N_c$ is a constant, and the exponent $\alpha_N$ (empirically a small positive number well under one) sets how fast returns diminish. Because a power law is a straight line on log-log axes, a few small training runs let researchers fit the line and extrapolate the loss of a far larger run before paying for it. This turned model building into something closer to engineering: given a compute budget, one could forecast the gains from spending it, and later work (the Chinchilla analysis) refined how to split that budget between more parameters and more data, finding that earlier models were undertrained for their size. GPT-3, released in 2020 with 175 billion parameters, validated the bet. It showed in-context learning, the ability to perform new tasks from a few examples in the prompt without any weight updates, a behavior that had not been explicitly designed. Capabilities that appeared abruptly at certain scales were described as emergent, although later analysis showed that how emergence is measured affects how sharp it appears: a metric that scores only exact answers can look like a sudden jump, while a smoother metric over the same runs reveals gradual improvement. The distinction matters, because it determines whether a given capability can be forecast in advance or only discovered after the fact. ### 8.4 Alignment, Chat Interfaces, and the Public Era Raw scale produced capable but unruly models, so attention turned to making them useful and safe. Reinforcement learning from human feedback, refined in work on InstructGPT in 2022, trained models to follow instructions and respect human preferences. When OpenAI wrapped such a model in a conversational interface and released ChatGPT in late 2022, adoption was unprecedented, reaching an enormous user base within weeks and bringing AI into mainstream awareness. In 2023 GPT-4 and competing frontier models from Anthropic, Google, and others extended capabilities to multimodal input and stronger reasoning. Through 2024 and into the mid 2020s, the frontier shifted toward systems that reason through problems step by step and act as agents, calling tools and executing multi-step plans, while open-weight model families made strong capabilities broadly available. ### 8.5 Why This Era Happened The large-language-model era is the clearest illustration of the chapter's thesis. The Transformer supplied an algorithm that scaled with hardware. The internet supplied training data at a scale no curated dataset could match. And a decade of investment in accelerators supplied the compute to train models with hundreds of billions of parameters. None of the three alone would have sufficed. Their simultaneous availability, together with the discovery that scaling reliably improved capability, produced the most rapid expansion of AI capability in the field's history. ## 9. A Worked Example: Reading One Transition Through the Three Forces The chapter's thesis is easiest to trust when applied to a single transition in detail. Consider the convolutional network. Yann LeCun's LeNet-5 reached strong handwritten-digit accuracy in 1998, using essentially the same convolutional structure, gradient training, and weight sharing that AlexNet would use in 2012. The algorithm, in other words, existed fourteen years before the breakthrough. Why did the field not surge in 1998? Walk the three forces in turn. The *algorithm* was present and largely sufficient; AlexNet's additions, mainly the rectified linear activation and dropout, were refinements rather than reinventions. The *data* was not present at the needed scale: the standard digit benchmark held tens of thousands of small grayscale images, whereas ImageNet, released in 2009, held more than a million labeled natural images across a thousand categories, enough to train a large network without it memorizing its training set. The *compute* was not present at the needed throughput or price: training a network of AlexNet's size on the central processors of 1998 would have taken an impractically long time, while by 2012 commodity graphics processors delivered the parallel arithmetic that convolution demands at a fraction of the cost. Only when data and compute both crossed their thresholds did the latent algorithm convert into a result that reset the field. This is the multiplicative claim made concrete: two factors near zero held the product near zero for over a decade, despite a mature algorithm. ## 10. Patterns Across the History Looking across seven decades, several patterns recur. Progress has alternated between symbolic and statistical approaches, each correcting the other's weaknesses, and modern research increasingly seeks to combine the explicit structure of the former with the learning power of the latter. Hype cycles have repeatedly outrun delivery, and the two AI winters are a reminder that inflated expectations carry real costs when they collapse. Above all, the enabling forces of compute, data, and algorithms have set the pace. Whenever a new algorithmic idea met sufficient data and sufficient compute, the field surged forward, and whenever any of the three was missing, even brilliant ideas had to wait. Understanding this dynamic is the best guide to reasoning about where artificial intelligence may go next. ### 10.1 Reading the History Well: Pitfalls of Hindsight A history of a fast-moving field is easy to misread, so a few cautions are in order. - **Do not treat dates as discoveries.** The years in the timeline mark when an idea became visible or dominant, not when it was first conceived. Backpropagation, convolution, and attention all predate the moments they are famous for. Asking "what changed around the idea" usually points at data or compute, not at the algorithm alone. - **Beware survivorship and the smooth-curve illusion.** The narrative naturally foregrounds the approaches that won. Many reasonable contemporaries lost, and the winners often looked unpromising at the time. A clean line drawn through the survivors can make the path look more inevitable than it was. - **Distinguish capability from understanding.** ELIZA in 1966 and large language models today both invite observers to attribute more comprehension than a system demonstrably has. The behavioral framing of the Turing Test was meant to sidestep this question, not to settle it. - **Treat extrapolation with discipline.** Scaling laws are empirical fits over a finite range, valid until a resource saturates or a new bottleneck appears. They have held remarkably well, but a power law is a description of past runs, not a guarantee about future ones, and the choice of metric can make progress look either smooth or abrupt. The honest summary is that the three forces explain the *timing and shape* of past transitions better than they predict the *content* of the next one. The next surge will likely again require an algorithmic idea meeting sufficient data and compute, but which idea, and which bottleneck it relieves, is exactly what the history cannot tell us in advance. ## References 1. Turing, A. M. (1936). "On Computable Numbers, with an Application to the Entscheidungsproblem." Proceedings of the London Mathematical Society. https://doi.org/10.1112/plms/s2-42.1.230 2. McCulloch, W. S., and Pitts, W. (1943). "A Logical Calculus of the Ideas Immanent in Nervous Activity." Bulletin of Mathematical Biophysics. https://doi.org/10.1007/BF02478259 3. Turing, A. M. (1950). "Computing Machinery and Intelligence." Mind. https://doi.org/10.1093/mind/LIX.236.433 4. McCarthy, J., Minsky, M., Rochester, N., and Shannon, C. (1955). "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence." http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf 5. Newell, A., and Simon, H. A. (1976). "Computer Science as Empirical Inquiry: Symbols and Search." Communications of the ACM. https://doi.org/10.1145/360018.360022 6. Weizenbaum, J. (1966). "ELIZA: A Computer Program for the Study of Natural Language Communication Between Man and Machine." Communications of the ACM. https://doi.org/10.1145/365153.365168 7. Lighthill, J. (1973). "Artificial Intelligence: A General Survey." Science Research Council. https://www.chilton-computing.org.uk/inf/literature/reports/lighthill_report/p001.htm 8. Buchanan, B. G., and Shortliffe, E. H. (1984). "Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project." Addison-Wesley. https://www.shortliffe.net/Buchanan-Shortliffe-1984/MYCIN%20Book.htm 9. Minsky, M., and Papert, S. (1969). "Perceptrons: An Introduction to Computational Geometry." MIT Press. 10. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning Representations by Back-Propagating Errors." Nature. https://doi.org/10.1038/323533a0 11. Cortes, C., and Vapnik, V. (1995). "Support-Vector Networks." Machine Learning. https://doi.org/10.1007/BF00994018 12. Campbell, M., Hoane, A. J., and Hsu, F. (2002). "Deep Blue." Artificial Intelligence. https://doi.org/10.1016/S0004-3702(01)00129-1 13. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." IEEE CVPR. https://doi.org/10.1109/CVPR.2009.5206848 14. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks 15. Goodfellow, I., et al. (2014). "Generative Adversarial Networks." NeurIPS. https://arxiv.org/abs/1406.2661 16. Silver, D., et al. (2016). "Mastering the Game of Go with Deep Neural Networks and Tree Search." Nature. https://doi.org/10.1038/nature16961 17. Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS. https://arxiv.org/abs/1706.03762 18. Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." https://arxiv.org/abs/1810.04805 19. Brown, T., et al. (2020). "Language Models Are Few-Shot Learners." NeurIPS. https://arxiv.org/abs/2005.14165 20. Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS. https://arxiv.org/abs/2203.02155 21. Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models." https://arxiv.org/abs/2001.08361 22. OpenAI (2023). "GPT-4 Technical Report." https://arxiv.org/abs/2303.08774 23. Russell, S., and Norvig, P. (2021). "Artificial Intelligence: A Modern Approach," 4th ed. Pearson. https://aima.cs.berkeley.edu/ 24. Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." Psychological Review. https://doi.org/10.1037/h0042519 25. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE. https://doi.org/10.1109/5.726791 26. Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. https://arxiv.org/abs/2203.15556 27. Wei, J., et al. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682