2  The History of Artificial Intelligence

Artificial intelligence did not arrive as a single invention. It emerged from a long sequence of intellectual bets, technical breakthroughs, and disappointments, each of which reshaped what researchers believed machines could do. This chapter traces that history chronologically, from the formal foundations laid in the 1930s and 1940s through the large-language-model era of the mid 2020s. The central argument is that every major shift in the field was driven by some combination of three forces: available compute, available data, and new algorithmic ideas. When all three aligned, progress was explosive. When one was missing, the field stalled, sometimes for a decade.

2.1 1. A Chronological Overview

Before examining the eras in detail, it helps to see them laid out as a timeline. The diagram below marks the events that anchor each section of this chapter.

timeline
    title A Chronological Overview of AI
    1936 : Turing's "On Computable Numbers" defines computation
    1943 : McCulloch and Pitts model an artificial neuron
    1950 : Turing proposes the Imitation Game
    1956 : Dartmouth Summer Research Project names the field
    1957 : Rosenblatt builds the Perceptron
    1966 : "ELIZA; early machine translation falters"
    1969 : Minsky and Papert publish "Perceptrons"
    1973 : Lighthill Report triggers the first AI winter
    1980 : XCON expert system deployed at DEC
    1986 : Backpropagation popularized; connectionism revives
    1987 : LISP machine market collapses; second AI winter
    1997 : Deep Blue defeats Kasparov
    1998 : "LeNet-5; gradient learning for documents"
    2006 : Hinton's deep belief nets revive "deep" learning
    2009 : ImageNet dataset released
    2012 : AlexNet wins ImageNet by a wide margin
    2014 : "GANs; sequence-to-sequence learning"
    2017 : "Attention Is All You Need" introduces the Transformer
    2018 : BERT and GPT show transfer learning at scale
    2020 : GPT-3 demonstrates in-context learning
    2022 : ChatGPT reaches mainstream users
    2023 : GPT-4 and multimodal frontier models
    2024 : Reasoning-focused and agentic systems mature

The sections that follow walk through these markers and, more importantly, explain why each transition occurred.

2.2 2. Early Foundations

2.2.1 2.1 The Formal Idea of Computation

The intellectual prerequisite for artificial intelligence was a precise definition of computation itself. In 1936 Alan Turing introduced the abstract machine that now bears his name, showing that a simple device manipulating symbols on a tape could carry out any effective procedure. This result established that reasoning, if it could be reduced to symbol manipulation, was in principle mechanizable. The Turing machine gave later researchers a reason to believe that thought might be a computational process rather than something irreducibly biological.

Around the same time, Warren McCulloch and Walter Pitts published a 1943 paper showing that networks of simplified neurons, each firing according to a threshold rule, could compute logical functions. This was the first bridge between brains and logic, and it planted the seed for both the symbolic and connectionist traditions that would later diverge.

2.2.2 2.2 Turing’s Imitation Game

In 1950 Turing published “Computing Machinery and Intelligence,” which reframed the unanswerable question “Can machines think?” into an operational test. In the Imitation Game, now called the Turing Test, a human interrogator converses through text with a machine and another human, and the machine succeeds if the interrogator cannot reliably tell which is which. The proposal mattered less as a benchmark than as a philosophical move: it shifted the conversation from metaphysics toward observable behavior, which is the stance most of the field has taken ever since.

2.2.3 2.3 The Dartmouth Workshop

The field acquired its name and its founding agenda at the Dartmouth Summer Research Project on Artificial Intelligence in 1956. John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon organized the gathering on the premise that “every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” The proposal was strikingly confident, and that confidence set the tone for the next two decades. Dartmouth did not solve any technical problem, but it gathered the people who would, defined a shared vocabulary, and secured a sense of legitimacy that attracted funding.

2.3 3. Symbolic AI and the Logic Era

2.3.1 3.1 Physical Symbol Systems

The dominant paradigm from the late 1950s into the 1980s was symbolic AI, sometimes called good old-fashioned AI. Its guiding hypothesis, articulated by Allen Newell and Herbert Simon, held that a physical symbol system has the necessary and sufficient means for general intelligent action. On this view, intelligence is the manipulation of symbols according to formal rules, and the path to thinking machines runs through logic, search, and explicit knowledge representation.

Early systems gave the approach real momentum. Newell and Simon’s Logic Theorist (1956) proved theorems from Whitehead and Russell’s Principia Mathematica, and their later General Problem Solver attempted to model human means-ends reasoning. McCarthy invented LISP in 1958, a language whose treatment of code as data made it the natural home of symbolic programming for decades.

2.3.2 3.2 The Limits of Pure Reasoning

Symbolic systems excelled in narrow, well-structured domains but struggled wherever the world was messy. Joseph Weizenbaum’s ELIZA (1966) produced strikingly human dialogue by pattern matching alone, which revealed how easily people attribute understanding to shallow systems. Meanwhile, early machine translation efforts foundered on ambiguity and context, and a 1966 report by the Automatic Language Processing Advisory Committee concluded that the work had not delivered, cutting funding sharply. The deeper problem was that encoding common sense by hand proved enormous, a difficulty later known as the knowledge acquisition bottleneck. Reasoning algorithms also faced combinatorial explosion: search spaces grew faster than any computer of the era could handle.

2.4 4. The First AI Winter

By the early 1970s the gap between promises and results had become impossible to ignore. In Britain, the 1973 Lighthill Report assessed AI research for the Science Research Council and concluded that the field had failed to achieve its grand objectives, singling out the combinatorial explosion as a fundamental obstacle. The report led to deep cuts in British funding. In the United States, agencies that had supported open-ended research grew impatient and redirected money toward projects with concrete deliverables.

The causes were structural rather than accidental. Compute was scarce and expensive, so algorithms that scaled poorly hit hard walls. Data in machine-readable form barely existed, so systems could not learn from experience. And the reigning algorithmic philosophy, hand-built symbolic rules, did not degrade gracefully when faced with novelty. With all three enabling forces weak, the first AI winter set in and lasted through much of the 1970s.

2.5 5. Expert Systems and Their Collapse

2.5.1 5.1 Knowledge as a Product

AI returned to favor in the late 1970s and early 1980s through expert systems, programs that captured the decision rules of human specialists in a narrow domain. MYCIN, developed at Stanford, diagnosed bacterial infections and recommended antibiotics, often matching specialist physicians. The commercial breakthrough was XCON (also called R1), deployed at Digital Equipment Corporation in 1980 to configure computer orders. XCON reportedly saved the company tens of millions of dollars a year, and it convinced industry that AI could pay for itself.

A whole sector grew around this idea, including companies that sold specialized LISP machines optimized for symbolic computation. Japan’s ambitious Fifth Generation Computer Systems project, launched in 1982, poured national resources into logic programming hardware, and Western governments responded with programs of their own.

2.5.2 5.2 Why It Did Not Last

The expert-system boom carried the seeds of the second winter. The systems were brittle: they performed well inside their narrow rule sets but failed unpredictably at the edges, and they could not learn or update themselves. Maintaining large rule bases became costly, since every new case risked conflicting with existing rules. Most damaging, the specialized hardware lost its reason to exist when general-purpose workstations from companies like Sun and the new Intel and Apple machines became cheap and fast enough to run the same software. The LISP machine market collapsed around 1987, the Fifth Generation project ended without meeting its goals, and funding evaporated again. This second AI winter ran roughly from the late 1980s into the mid 1990s.

2.6 6. Connectionism and Statistical Machine Learning

2.6.1 6.1 The Neural Network Revival

A different tradition had been quietly maturing alongside symbolic AI. Frank Rosenblatt’s Perceptron (1957) was an early trainable neural model, but Marvin Minsky and Seymour Papert’s 1969 book “Perceptrons” showed that a single-layer perceptron could not represent simple functions such as exclusive-or, and the result chilled neural network research for years. The thaw came in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized backpropagation, an efficient method for training multi-layer networks by propagating error gradients backward through the layers. Multi-layer networks could represent the functions single-layer ones could not, and connectionism revived as a serious research program.

2.6.2 6.2 The Statistical Turn

The 1990s brought a broader shift from hand-coded rules toward learning from data. In speech recognition and machine translation, probabilistic models such as hidden Markov models and statistical translation systems began to outperform their rule-based predecessors, in large part because growing digital corpora finally provided enough training material. Methods like support vector machines, introduced by Cortes and Vapnik in 1995, and ensemble techniques such as random forests gave practitioners powerful, well-understood tools with strong theoretical guarantees. The lesson of this period, sometimes summarized as “more data beats clever rules,” reoriented the field toward statistics and optimization. Symbolic AI did not vanish, but the center of gravity moved.

This was also the era of high-profile demonstrations. In 1997 IBM’s Deep Blue defeated world chess champion Garry Kasparov, a milestone that, although built on search and specialized hardware rather than learning, showed the public that machines could surpass humans in a domain long held up as a test of intellect.

2.7 7. The Deep Learning Revolution

2.7.1 7.1 The 2012 Inflection Point

The modern era of AI began in earnest in 2012. At the ImageNet Large Scale Visual Recognition Challenge, a deep convolutional neural network called AlexNet, built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, cut the image classification error rate dramatically, beating the runner-up by a margin that stunned the computer vision community. The architecture itself was not entirely new; convolutional networks traced back to Yann LeCun’s LeNet of the late 1990s. What changed was the convergence of the three enabling forces.

First, data: the ImageNet dataset, organized by Fei-Fei Li and released in 2009, provided more than a million labeled images, enough to train a large network without it simply memorizing. Second, compute: AlexNet was trained on consumer graphics processing units, which offered the massive parallelism that neural network training demands at a fraction of the cost of specialized hardware. Third, algorithmic refinements such as the rectified linear activation and dropout regularization made deep networks practical to train. When all three lined up, depth suddenly paid off.

2.7.2 7.2 A Cascade of Capabilities

After 2012 progress came quickly across domains. Deep networks took over speech recognition, then natural language processing, then reinforcement learning. In 2014 Ian Goodfellow and colleagues introduced generative adversarial networks, which learned to synthesize realistic images, and the same year sequence-to-sequence models showed that neural networks could map one sequence to another for tasks like translation. In 2016 DeepMind’s AlphaGo defeated the professional Go player Lee Sedol, combining deep neural networks with tree search to master a game whose branching factor had long been considered out of reach. The common thread was representation learning: instead of engineers crafting features by hand, the networks discovered useful internal representations directly from raw data.

2.8 8. The Transformer and Large-Language-Model Era

2.8.1 8.1 Attention and the Transformer

A pivotal architectural change arrived in 2017 with the paper “Attention Is All You Need” by Vaswani and colleagues at Google. The Transformer replaced the recurrent and convolutional structures that had dominated sequence modeling with a mechanism called self-attention, which lets every position in a sequence directly attend to every other position. This had two consequences that proved decisive. It captured long-range dependencies far better than recurrent networks, and, because the computation was highly parallel, it made full use of modern accelerators. The Transformer was, in effect, an architecture designed to scale.

2.8.2 8.2 Pretraining and Transfer

The next insight was that a single large model could be pretrained on vast amounts of unlabeled text and then adapted to many tasks. In 2018 Google’s BERT used a masked-language objective to learn deep bidirectional representations, while OpenAI’s GPT used a left-to-right objective suited to generation. Both demonstrated transfer learning at scale: pretrain once on a broad corpus, then fine-tune cheaply for specific tasks. The economic logic was compelling, since the expensive part of training was amortized across countless downstream applications.

2.8.3 8.3 Scaling Laws and Emergence

Researchers then discovered that performance improved smoothly and predictably as models, data, and compute grew together, a relationship captured in empirical scaling laws. This turned model building into something closer to engineering: given a compute budget, one could forecast the gains from spending it. GPT-3, released in 2020 with 175 billion parameters, validated the bet. It showed in-context learning, the ability to perform new tasks from a few examples in the prompt without any weight updates, a behavior that had not been explicitly designed. Capabilities that appeared abruptly at certain scales were described as emergent, although later analysis showed that how emergence is measured affects how sharp it appears.

2.8.4 8.4 Alignment, Chat Interfaces, and the Public Era

Raw scale produced capable but unruly models, so attention turned to making them useful and safe. Reinforcement learning from human feedback, refined in work on InstructGPT in 2022, trained models to follow instructions and respect human preferences. When OpenAI wrapped such a model in a conversational interface and released ChatGPT in late 2022, adoption was unprecedented, reaching an enormous user base within weeks and bringing AI into mainstream awareness. In 2023 GPT-4 and competing frontier models from Anthropic, Google, and others extended capabilities to multimodal input and stronger reasoning. Through 2024 and into the mid 2020s, the frontier shifted toward systems that reason through problems step by step and act as agents, calling tools and executing multi-step plans, while open-weight model families made strong capabilities broadly available.

2.8.5 8.5 Why This Era Happened

The large-language-model era is the clearest illustration of the chapter’s thesis. The Transformer supplied an algorithm that scaled with hardware. The internet supplied training data at a scale no curated dataset could match. And a decade of investment in accelerators supplied the compute to train models with hundreds of billions of parameters. None of the three alone would have sufficed. Their simultaneous availability, together with the discovery that scaling reliably improved capability, produced the most rapid expansion of AI capability in the field’s history.

2.9 9. Patterns Across the History

Looking across seven decades, several patterns recur. Progress has alternated between symbolic and statistical approaches, each correcting the other’s weaknesses, and modern research increasingly seeks to combine the explicit structure of the former with the learning power of the latter. Hype cycles have repeatedly outrun delivery, and the two AI winters are a reminder that inflated expectations carry real costs when they collapse. Above all, the enabling forces of compute, data, and algorithms have set the pace. Whenever a new algorithmic idea met sufficient data and sufficient compute, the field surged forward, and whenever any of the three was missing, even brilliant ideas had to wait. Understanding this dynamic is the best guide to reasoning about where artificial intelligence may go next.

2.10 References

  1. Turing, A. M. (1936). “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society. https://doi.org/10.1112/plms/s2-42.1.230
  2. McCulloch, W. S., and Pitts, W. (1943). “A Logical Calculus of the Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics. https://doi.org/10.1007/BF02478259
  3. Turing, A. M. (1950). “Computing Machinery and Intelligence.” Mind. https://doi.org/10.1093/mind/LIX.236.433
  4. McCarthy, J., Minsky, M., Rochester, N., and Shannon, C. (1955). “A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence.” http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
  5. Newell, A., and Simon, H. A. (1976). “Computer Science as Empirical Inquiry: Symbols and Search.” Communications of the ACM. https://doi.org/10.1145/360018.360022
  6. Weizenbaum, J. (1966). “ELIZA: A Computer Program for the Study of Natural Language Communication Between Man and Machine.” Communications of the ACM. https://doi.org/10.1145/365153.365168
  7. Lighthill, J. (1973). “Artificial Intelligence: A General Survey.” Science Research Council. https://www.chilton-computing.org.uk/inf/literature/reports/lighthill_report/p001.htm
  8. Buchanan, B. G., and Shortliffe, E. H. (1984). “Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project.” Addison-Wesley. https://www.shortliffe.net/Buchanan-Shortliffe-1984/MYCIN%20Book.htm
  9. Minsky, M., and Papert, S. (1969). “Perceptrons: An Introduction to Computational Geometry.” MIT Press.
  10. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). “Learning Representations by Back-Propagating Errors.” Nature. https://doi.org/10.1038/323533a0
  11. Cortes, C., and Vapnik, V. (1995). “Support-Vector Networks.” Machine Learning. https://doi.org/10.1007/BF00994018
  12. Campbell, M., Hoane, A. J., and Hsu, F. (2002). “Deep Blue.” Artificial Intelligence. https://doi.org/10.1016/S0004-3702(01)00129-1
  13. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. (2009). “ImageNet: A Large-Scale Hierarchical Image Database.” IEEE CVPR. https://doi.org/10.1109/CVPR.2009.5206848
  14. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
  15. Goodfellow, I., et al. (2014). “Generative Adversarial Networks.” NeurIPS. https://arxiv.org/abs/1406.2661
  16. Silver, D., et al. (2016). “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature. https://doi.org/10.1038/nature16961
  17. Vaswani, A., et al. (2017). “Attention Is All You Need.” NeurIPS. https://arxiv.org/abs/1706.03762
  18. Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” https://arxiv.org/abs/1810.04805
  19. Brown, T., et al. (2020). “Language Models Are Few-Shot Learners.” NeurIPS. https://arxiv.org/abs/2005.14165
  20. Ouyang, L., et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS. https://arxiv.org/abs/2203.02155
  21. Kaplan, J., et al. (2020). “Scaling Laws for Neural Language Models.” https://arxiv.org/abs/2001.08361
  22. OpenAI (2023). “GPT-4 Technical Report.” https://arxiv.org/abs/2303.08774
  23. Russell, S., and Norvig, P. (2021). “Artificial Intelligence: A Modern Approach,” 4th ed. Pearson. https://aima.cs.berkeley.edu/