3 The Philosophy of Machine Intelligence

The question of whether machines can think is older than the digital computer itself, and it remains stubbornly unresolved even as large language models produce text that many readers cannot distinguish from human writing. This chapter examines the conceptual foundations beneath that question. It asks what we mean by intelligence, surveys the major thought experiments and arguments that have shaped the debate, and shows how these decades old disputes structure current arguments about whether systems such as GPT, Claude, and Gemini understand anything at all. The goal is not to settle the matter but to give you the conceptual vocabulary to reason about it carefully, to recognize which disputes are verbal and which are substantive, and to identify what would actually count as evidence either way.

A note on method before we begin. Philosophy of mind proceeds largely by analysis of concepts and by thought experiments, idealized scenarios that isolate one variable so that an intuition can be tested. A thought experiment is not a proof. It is a probe of our concepts, and its force depends on whether the intuition it elicits is reliable and whether the scenario it describes is coherent. Throughout this chapter we will treat each famous argument both as a claim and as an object to be examined, asking not only what it concludes but where its premises could be resisted.

3.1 1. What Intelligence Means and Why It Resists Definition

3.1.1 1.1 The Problem of Definition

Intelligence is one of those concepts that everyone uses confidently and no one can pin down. Psychologists have offered operational definitions tied to test performance, biologists have tied it to adaptive behavior, and computer scientists have often defaulted to task competence. Each definition captures something while excluding something else. A definition narrow enough to be measurable (for example, performance on a fixed benchmark) tends to miss the open ended flexibility we associate with genuine intelligence, while a definition broad enough to include that flexibility tends to become untestable.

It helps to distinguish three kinds of definition that often get conflated. An operational definition fixes a procedure that produces a number, such as a score on a test. A stipulative definition simply declares how a term will be used in a given discussion, so that arguments can proceed without ambiguity. A real definition purports to state the essence of the thing, the property that makes something intelligence rather than merely correlated with it. Most disputes about machine intelligence are confused because one party offers an operational definition (passing a benchmark) while another demands a real definition (genuine understanding), and the two then talk past each other. Being explicit about which kind of definition is in play resolves a surprising number of apparent disagreements.

One influential attempt to formalize a general notion is the proposal of Legg and Hutter, who define the intelligence of an agent as its expected performance, suitably weighted, across the space of all computable reward giving environments, with simpler environments weighted more heavily by an Occam style prior Legg and Hutter (2007). The definition is precise and captures the intuition that intelligence is general competence rather than narrow skill, but it is uncomputable in practice and presupposes a reward signal, which already builds in a contestable view of what intelligence is for. It is best read as a clarifying idealization rather than a usable yardstick.

3.1.2 1.2 Intelligence as a Cluster Concept

Part of the difficulty is that intelligence is a cluster concept. It bundles together perception, memory, reasoning, learning, planning, language, and the capacity to transfer skill from one domain to another. These capacities can come apart. A calculator exceeds any human at arithmetic yet plans nothing, while a crow solves novel physical puzzles without arithmetic. Because the bundle is loose, any single yardstick will look arbitrary to someone who weights the components differently.

A useful way to make this precise is to think of intelligence as a profile rather than a scalar. We can imagine a vector of capacities, each measured on its own axis: perceptual discrimination, memory capacity, deductive reasoning, sample efficient learning, planning depth, linguistic competence, and transfer. Two systems with the same overall summary score can have radically different profiles. A modern language model sits very high on linguistic competence and broad recall while sitting low on reliable multi step planning and on sample efficiency, since it required orders of magnitude more text than a child sees to reach its competence. A bee sits low on language and high on robust real time sensorimotor control. Reducing such profiles to a single ranking discards exactly the information that matters, which is why headline claims of the form “system X is now smarter than a human” are almost always too coarse to evaluate.

3.1.3 1.3 Behavioral Versus Internal Criteria

A second fault line runs between behavioral and internal accounts. A behavioral account says that intelligence is as intelligence does: if a system behaves in the ways an intelligent agent would, that settles the matter. An internal account insists that behavior is merely evidence, and that what makes behavior intelligent is the kind of process that produces it.

A clean way to see the stakes is Ned Block’s blockhead, a hypothetical machine that holds a gigantic lookup table storing a sensible response to every conversation a human interlocutor could produce within some bounded length Block (1981). By construction the blockhead would behave exactly as an intelligent agent does, yet intuitively it understands nothing, because all of its competence was front loaded by whoever filled in the table, and at run time it performs no reasoning at all. The blockhead is physically impossible, since the table would be astronomically larger than the observable universe, but it is logically coherent, and that is enough to make the philosophical point: behavioral indistinguishability over any finite test does not by itself guarantee that the right kind of process is occurring. This tension between what a system does and how it does it recurs throughout the chapter, and it is the hinge on which most of the classic arguments turn.

The blockhead also teaches a methodological lesson that recurs with language models. What separates the blockhead from a genuine reasoner is not its outputs but the structure of its competence: whether the right answers are computed compositionally and generalize to inputs never anticipated by a designer, or merely retrieved. This shifts the empirical question from “does it produce the right outputs” to “does it produce them in a way that generalizes systematically,” which is a question we can sometimes probe rather than merely assert.

3.1.4 1.4 Why the Definitional Problem Matters for AI

The definitional problem is not idle philosophizing. When researchers claim a system has reached human level intelligence, or when critics deny it, they are often disagreeing about definitions rather than about facts. Clarifying what we are asking lets us see that some apparent disputes are verbal, while others are substantive. As we will see, the modern debate about whether language models understand is partly a rerun of the behavioral versus internal disagreement under new conditions, now with the added wrinkle that we can, to a limited degree, look inside the systems and ask what their internal representations actually encode.

3.2 2. The Turing Test

3.2.1 2.1 Turing’s Proposal

In his 1950 paper “Computing Machinery and Intelligence,” Alan Turing sidestepped the unanswerable question “Can machines think?” and replaced it with an operational one Turing (1950). He described the imitation game: a human interrogator converses by text with two hidden participants, one human and one machine, and tries to tell which is which. If the machine fools the interrogator as often as a human would, Turing proposed, we should be prepared to say it thinks. The move is deliberately behavioral. Turing was skeptical that “thinking” could be defined in a way that would command agreement, so he offered a test that any competent machine could in principle pass or fail.

It is worth stating the criterion with some care, because casual summaries distort it. Turing did not propose that a single interrogator be fooled once. The natural reading is statistical: across many trials with competent judges, the machine should be misidentified as human about as often as a real human is. Formally, let (p_M) be the probability that an interrogator, after a fixed length of conversation, judges the machine to be human, and let (p_H) be the probability that the interrogator judges the actual human to be human. The machine passes when (p_M) is statistically indistinguishable from (p_H) over a large sample of qualified judges. Read this way, the test is demanding precisely because it compares the machine to a human baseline rather than to a fixed and gameable threshold.

3.2.2 2.2 What the Test Gets Right

The test has real virtues. It is medium neutral, judging the system on conversation rather than on appearance or substrate, and so it forecloses prejudice based on a machine being made of metal rather than carbon. It also sets a demanding bar, because open ended conversation draws on reasoning, world knowledge, humor, and the ability to handle the unexpected. Turing anticipated many objections, including the claim that machines could never be creative or original, and he answered them with arguments that still read freshly. His deeper point was epistemological. We attribute minds to other people on exactly the same behavioral basis, since we have no direct access to anyone else’s inner life, so to demand more of a machine than we demand of a fellow human is to apply a double standard.

3.2.3 2.3 Critiques of the Test

The test has nonetheless drawn sustained criticism, which falls into three broad families.

It tests deception rather than intelligence. A system might pass by exploiting human gullibility, deflecting hard questions, or imitating the typing errors and evasions of a person, none of which require genuine understanding. The 2014 episode in which a chatbot posing as a thirteen year old Ukrainian boy reportedly fooled a third of judges illustrated how a low bar and a clever persona can substitute for substance. The lesson is that the test is only as strong as its judges and its protocol, and that adversarial judging matters: an interrogator who probes for compositional reasoning and consistency over a long exchange is far harder to fool than one making small talk.

It is anthropocentric. The test treats human conversation as the gold standard and thereby risks missing forms of intelligence that are real but nonhuman, while also rewarding a system for hiding capabilities a human lacks, such as instant arithmetic. A superhuman system, paradoxically, must dumb itself down to pass.

It is behavioral, and behavior may underdetermine understanding. Passing the test shows competence at producing the right outputs without showing that anything inside the system understands those outputs. This is the blockhead worry of Section 1.3 in conversational form, and it is the thread that the next sections pull.

Modern language models sharpen all three worries, because they are explicitly trained on human text and can produce fluent conversation while their internal grasp of meaning is precisely what is in dispute. Partly for these reasons the field has largely moved from the Turing Test toward targeted probes designed to be hard to pass by surface mimicry, such as the Winograd Schema Challenge, which uses sentences whose correct interpretation hinges on world knowledge and cannot be settled by word statistics alone Levesque, Davis, and Morgenstern (2012).

3.3 3. Searle’s Chinese Room

3.3.1 3.1 The Argument

In 1980 John Searle introduced a thought experiment designed to show that passing a behavioral test, even a perfect one, does not establish understanding Searle (1980). Imagine Searle locked in a room. Speakers of Chinese pass written questions under the door. Searle, who knows no Chinese, consults a vast rulebook written in English that tells him, purely in terms of the shapes of the symbols, which Chinese symbols to write in response. By following the rules he produces answers indistinguishable from those of a native speaker. To those outside, the room appears to understand Chinese. Yet Searle, the only one who understands anything in the room, understands not a word of Chinese. He is merely manipulating symbols by their form.

3.3.2 3.2 The Argument Made Precise

Stripped to its logical skeleton, the Chinese Room is a short deductive argument. State it as three premises and a conclusion.

Programs are formal (syntactic). A program is defined entirely by rules that operate on the shapes of symbols, without reference to what those symbols mean.
Minds have mental contents (semantics). To understand is to grasp meaning, to have states that are about things in the world.
Syntax is neither constitutive of nor sufficient for semantics. Manipulating symbols by their form never, by itself, gives those symbols meaning. The room is the existence proof: it runs the syntax perfectly yet no understanding of Chinese appears anywhere in it.
Therefore, running a program is not sufficient for, and does not constitute, understanding.

The validity of the argument is not really in question; if the premises hold, the conclusion follows. The entire debate is about premise 3, the claim that syntax can never yield semantics. Every major reply to Searle is, in the end, a way of resisting premise 3, either by relocating where the semantics is supposed to come from (the systems and robot replies) or by denying that the relevant program is merely syntactic in the impoverished sense the argument assumes (the brain simulator reply).

3.3.3 3.3 The Target

Searle’s target is what he called strong AI, the thesis that a suitably programmed computer would thereby have a mind and genuinely understand. His claim is that a digital computer is exactly like the person in the room: it manipulates symbols according to formal rules (its program) without any access to what those symbols mean. The argument is meant to apply no matter how sophisticated the program, which is why it bears directly on systems far more capable than anything that existed in 1980. Notice what the argument does not claim. It does not say machines cannot think; Searle held that brains are machines and that some artifact might one day think if it reproduced the relevant causal powers of biological brains. It says only that running the right program is not, by itself, enough.

3.4 4. Strong and Weak AI, Functionalism, and the Computational Theory of Mind

3.4.1 4.1 The Strong and Weak Distinction

Searle drew a distinction that organizes much of the field. Weak AI treats the computer as a powerful tool for studying the mind and for performing tasks that would require intelligence if a human did them, without any claim that the computer itself has a mind. Strong AI claims that the right program is a mind, that mental states just are computational states of the appropriate kind. Almost no one disputes weak AI. The philosophical action is entirely about the strong claim, and it is the strong claim that Searle attacks.

3.4.2 4.2 Functionalism

The theoretical backbone of strong AI is functionalism, the dominant position in the philosophy of mind through the late twentieth century, given its canonical statement by Hilary Putnam Putnam (1967). Functionalism holds that mental states are defined not by what they are made of but by their causal role, by how they relate to sensory inputs, to other mental states, and to behavioral outputs. Pain, on this view, is whatever state is typically caused by bodily damage and typically causes avoidance and complaint, regardless of whether it is realized in neurons or in silicon. This thesis of multiple realizability is attractive because it explains how creatures with very different brains could share mental states, and it opens the door to minds implemented in hardware utterly unlike our own.

The functionalist’s strongest move against Searle is to insist that the relevant level of description is the causal organization of the whole computation, not the lonely symbol shuffling of any one component. On this view Searle has simply pointed at the wrong thing inside the room.

3.4.3 4.3 The Computational Theory of Mind

The computational theory of mind takes functionalism a step further by specifying the relevant functional organization as computation. On this view, thinking is information processing, the rule governed transformation of internal representations, and the mind stands to the brain roughly as software stands to hardware. If that picture is correct, then a computer running the right program would not merely simulate thought but instantiate it, because thought just is that kind of computation.

The Chinese Room is precisely an attack on this inference. Searle grants that the room (or the computer) carries out the right computation, and insists that understanding still fails to appear, so computation cannot be sufficient for mind. The functionalist and the computationalist must therefore either deny that understanding fails to appear (the systems reply says it appears at the level of the whole) or deny that the room really implements the same functional organization as a Chinese speaker (the brain simulator reply says a genuinely mind realizing program is far richer than a lookup of symbol shapes). This is the precise point on which the field divides, and it is worth keeping in view as we turn to the replies.

3.5 5. Symbol Grounding

3.5.1 5.1 The Problem

Closely related to Searle’s worry is the symbol grounding problem, articulated by Stevan Harnad in 1990 Harnad (1990). The symbols inside a classical AI system are meaningful only to us, the interpreters who read them. To the system they are bare tokens, defined entirely by their relations to other equally meaningless tokens. Harnad’s image is of trying to learn Chinese from a Chinese to Chinese dictionary alone: every definition sends you to more symbols, and you never break out of the circle into the world. How, then, could a symbol manipulating system ever connect its symbols to the things they are supposed to be about?

The problem can be stated as a regress. Suppose the meaning of each symbol is given by its definition in terms of other symbols. Then to understand any symbol you must already understand the symbols in its definition, and to understand those you must understand the symbols in theirs, and so on without end. Either the regress is infinite, in which case nothing is ever understood, or it terminates in symbols whose meaning is fixed by something other than further symbols. Harnad’s claim is that the terminating anchor must be non symbolic, namely a direct causal link between certain symbols and the perceptual categories the system forms from its sensory contact with the world.

3.5.2 5.2 Proposed Solutions and Their Limits

Harnad’s suggested remedy was to ground at least some symbols in the system’s sensory interactions with the world, so that the token for “horse” is tied to the system’s own perceptual capacity to detect horses, with other symbols built up from this grounded base. This line of thought motivates robotics and embodied approaches to AI, which hold that genuine understanding requires a body that perceives and acts, giving symbols a causal anchor in the environment.

The grounding problem presses hard on systems trained only on text. A language model learns from a vast corpus of symbols and their statistical relations, which is exactly the dictionary go round Harnad warned about, and this is one reason critics doubt that such models understand the words they so fluently arrange. Two qualifications complicate the verdict, however, and a careful reader should hold both in mind. First, the text a model ingests is not arbitrary noise: it was produced by grounded humans and carries the imprint of a world, so the statistical structure of language is itself a low resolution shadow of the world’s structure, which a model can partially reconstruct. Whether reconstructing that shadow counts as grounding or merely as a more elaborate dictionary is exactly the contested question. Second, multimodal models trained jointly on images, audio, and text, and embodied agents trained through interaction, change the picture by giving some symbols a perceptual anchor of the kind Harnad demanded, which is why the grounding objection is strongest against pure text models and weakest against richly multimodal embodied ones.

3.6 6. Consciousness and Whether It Matters

3.6.1 6.1 Two Questions, Not One

It is essential to separate two questions that are easily conflated. The first is whether a machine can be intelligent, can solve problems, reason, and use language. The second is whether a machine can be conscious, can have subjective experience, such that there is something it is like to be that machine. These come apart in principle. A system might be highly intelligent with no inner experience whatsoever (a so called philosophical zombie), and conversely a creature might have rich experience with modest intelligence. Much confusion in public discussion comes from sliding between the two, for example by treating fluent and emotionally apt language as evidence of feeling, when it is at most evidence of the capacity to produce such language.

3.6.2 6.2 The Hard Problem

David Chalmers famously distinguished the easy problems of consciousness, which concern explaining cognitive functions such as discrimination, integration, and reportability, from the hard problem, which is explaining why any of this functioning is accompanied by subjective experience at all Chalmers (1995). The easy problems are easy only by comparison: they are the kind of thing a complete cognitive science could in principle solve. The hard problem is hard because even a complete functional account seems to leave open why there is felt experience rather than mere processing in the dark. For AI, the hard problem implies that even a system that perfectly replicated human cognitive function might still face an open question about whether it feels anything, and that no behavioral or even mechanistic test could obviously close that question.

3.6.3 6.3 Does It Matter for AI?

Whether consciousness matters depends on what we want from AI. For most practical and scientific purposes, intelligence is what we are after, and a system that reasons and acts effectively serves us whether or not it has experience. Consciousness becomes central, however, for moral status. If a system can suffer, then how we treat it raises ethical questions, and the difficulty of detecting consciousness from the outside (the same difficulty the Turing Test cannot resolve) means we may face hard moral uncertainty under conditions where the cost of error is high in both directions. For the narrow question of whether a machine understands, many philosophers hold that understanding is a cognitive achievement that need not require consciousness, though Searle himself ties intentionality closely to the biological character of brains. Keeping the intelligence question and the consciousness question on separate tracks is the single most useful discipline a practitioner can adopt when reading sensational claims about AI.

3.7 7. The Systems Reply and Other Responses to the Chinese Room

3.7.1 7.1 The Systems Reply

The most influential response to Searle is the systems reply. It concedes that the person in the room does not understand Chinese but denies that this is the relevant point. Understanding, the reply says, is a property of the whole system, the person together with the rulebook, the scratch paper, the symbols, and the procedures, not of the person alone. Searle is merely the central processing unit of a larger system, and there is no reason to expect a component to possess the understanding of the whole, any more than a single neuron understands English. In the vocabulary of Section 4, the systems reply says Searle has located the computation at the wrong level: the mind, if there is one, is realized by the organization of the entire process, not by the part that happens to shuffle the symbols.

3.7.2 7.2 Searle’s Rejoinder and the Counter

Searle’s reply is to internalize the system. Let the person memorize the entire rulebook and do all the computation in his head, dispensing with the room and the paper entirely. Now the person is the whole system, and he still understands no Chinese, so where is the understanding supposed to reside? Critics counter that internalization smuggles in an intuition pump: it asks us to imagine something cognitively impossible (memorizing and executing a program vast enough to sustain fluent conversation) and then trusts our untrained intuition about that impossible scenario. The defender of the systems reply argues that the person who has internalized the program would be implementing a second cognitive system, distinct from his ordinary self, and that this second system might understand Chinese even though the host person does not, just as a single brain can in unusual cases sustain two streams of awareness. The disagreement here is genuine and unresolved, and recognizing that it turns on the reliability of intuitions about humanly impossible scenarios is itself progress.

3.7.3 7.3 The Robot and Other Replies

Other replies push in different directions. The robot reply grants Searle’s point about a disembodied program but argues that a system embedded in a robot that perceives and acts would have its symbols grounded in the world, addressing the very objection Harnad later formalized. The brain simulator reply imagines a program that simulates the exact firing of a Chinese speaker’s neurons and asks how Searle could deny understanding to that without also denying it to the original brain. Searle resists each, replying to the robot that adding sensors merely adds more uninterpreted symbols, and to the brain simulator that simulating the causal structure of a process is not the same as reproducing its causal powers, just as a simulation of fire does not burn. Whether that last analogy holds for minds, or whether minds are precisely the kind of thing that a sufficiently detailed simulation would reproduce, is one of the deepest open questions in the area. The proliferation of replies shows that the thought experiment, however vivid, does not command universal assent. What it does accomplish is to make undeniable the gap between behaving as if one understands and understanding, which is exactly the gap at issue in contemporary debates.

3.7.4 7.4 A Map of the Debate

The arguments in this chapter form a connected structure rather than a list. The behavioral tradition, anchored by Turing, holds that the right outputs settle the question. The internalist challenge, anchored by Searle, holds that the manner of operation is what matters. Functionalism and the computational theory of mind supply the positive account that the internalist attacks, and the various replies attempt to repair that account. Symbol grounding and consciousness then add two further dimensions that cut across the behavioral and internalist camps alike.

flowchart TD
    Q["Can machines think and understand"]
    Q --> BEH["Behavioral view: right outputs suffice (Turing)"]
    Q --> INT["Internalist view: the process matters (Searle)"]
    BEH --> FUNC["Functionalism and computational theory of mind: mind is the right causal or computational organization"]
    INT --> CR["Chinese Room: syntax is not sufficient for semantics"]
    CR --> SYS["Systems reply: the whole system understands"]
    CR --> ROB["Robot reply: embodiment grounds the symbols"]
    CR --> BRAIN["Brain simulator reply: reproduce the neural causation"]
    FUNC --> GROUND["Symbol grounding: how do symbols connect to the world"]
    ROB --> GROUND
    Q --> CONS["Consciousness: is there subjective experience"]
    CONS --> HARD["Hard problem: function may not explain felt experience"]

3.8 8. How These Debates Inform Modern LLM Discussions

3.8.1 8.1 Stochastic Parrots

The phrase “stochastic parrots,” from a 2021 paper by Emily Bender, Timnit Gebru, and colleagues, crystallized the skeptical position for the language model era Bender et al. (2021). The argument is that a model trained to predict the next token learns the statistical distribution of word forms without any access to meaning or communicative intent, and so it stitches together plausible sequences without understanding them, much as a parrot repeats sounds. This is the symbol grounding problem and the Chinese Room recast for systems trained on text. The model, on this view, manipulates form (syntax) with no grip on content (semantics), and its fluency is precisely what makes the absence of understanding hard to notice.

3.8.2 8.2 Understanding Versus Prediction

The opposing view holds that the dichotomy between prediction and understanding is too crude. Predicting the next token well across a sufficiently rich corpus may require the model to build internal structure that functions like understanding: representations of entities, relations, and even the state of an unfolding situation. The argument has a clean intuition behind it. To assign high probability to the continuation of a paragraph that says “the keys are in the drawer; she opened it and took them out,” a predictor must implicitly track that “it” is the drawer, that the keys were inside, and that taking them out leaves the drawer empty. Doing this reliably across endlessly varied text is hard to achieve by surface statistics alone and easier to achieve by maintaining a model of the described situation.

Empirical work probing the internals of trained models gives this intuition some teeth. A controlled and frequently cited example trained a transformer purely to predict the next move in the board game Othello, with no built in notion of a board, and then found, by training simple probes on its activations, that the network had developed an internal representation of the current board state, and that intervening on that representation changed the model’s predictions in the way one would expect if it were genuinely steering by the board Li et al. (2023). This does not prove that language models understand in any full sense, but it does undercut the strong claim that next token prediction can only ever yield surface form, since here prediction pressure produced a structured world model as a byproduct. A functionalist will say that if a system reliably exhibits the right input output relations and the right internal information processing, then withholding the word “understanding” begins to look like substrate prejudice, the very prejudice Turing warned against. A Searlean will reply that no amount of internal structure converts syntax into semantics, and that grounding through training on text alone remains a dictionary go round. The careful position notes that “understanding” is not all or nothing, and that a system can possess genuine partial competence, a real internal model of some domains, while lacking the grounded, consciously available understanding humans enjoy.

3.8.3 8.3 A Worked Example: Diagnosing a Disputed Claim

Suppose a colleague asserts, after a striking demonstration, that a particular model “truly understands physics.” Rather than agreeing or scoffing, use the apparatus of this chapter to take the claim apart.

First, fix the definition. Is the claim operational (the model answers physics questions at expert level), or is it a real definition claim (the model has genuine semantic grasp of physical concepts)? These need different evidence, and conflating them is the most common error.

Second, separate the questions. The claim is about understanding, not consciousness; nothing about answering physics questions bears on whether the model feels anything, so set the consciousness question aside rather than letting it contaminate the discussion.

Third, identify the camp and its evidence standard. A behavioral defender will point to benchmark performance. A skeptic will invoke stochastic parrots and ask whether the competence generalizes or merely interpolates the training distribution. An internalist will ask what the model’s representations encode.

Fourth, propose a discriminating test, one whose result the two camps predict differently. Memorized competence and genuine modeling diverge on novel, compositional, out of distribution problems: give the model a physically coherent scenario phrased unlike anything in its training data and see whether it tracks the consequences correctly, and whether probing or intervening on its internals reveals a state variable that behaves like the physical quantity in question, as in the Othello case. If competence collapses off distribution and no such internal structure is found, the parrot reading gains support; if competence is robust and structured representations are present, the modeling reading gains support.

This procedure will rarely deliver a final verdict, because the conceptual question of what understanding ultimately requires remains open. What it delivers instead is the discipline to say exactly what is being claimed, what evidence bears on it, and where reasonable people still disagree, which is the most a practitioner can honestly offer.

3.8.4 8.4 Why the Old Arguments Still Bind

These are not new arguments wearing new clothes by accident. The language model debate inherits the exact structure of the older one Mitchell and Krakauer (2023). The behavioral camp points to performance and asks what more could be wanted. The internalist camp points to the manner of operation and insists that performance is not the point. The grounding camp asks how text trained symbols could be about anything. Progress on the empirical questions, such as what representations models actually form and whether multimodal grounding changes the picture, is real and ongoing. But the conceptual questions, what understanding is, whether it requires consciousness, and whether the right functional organization suffices for mind, remain genuinely open. A responsible practitioner should therefore resist both the temptation to declare these systems minds and the temptation to dismiss them as mere parrots, and should instead hold the distinctions this chapter has drawn clearly in view.

3.9 9. When to Use These Concepts, and Common Pitfalls

This chapter is conceptual, but its payoff is practical. The following short guide turns its distinctions into working habits.

Reach for these concepts when you must evaluate a claim that an AI system does or does not understand, is or is not intelligent, or is or is not conscious; when designing an evaluation and choosing between behavioral benchmarks and mechanistic probes; or when communicating about system capabilities to non specialists who may over read fluency as understanding.

Common pitfalls to avoid:

Conflating intelligence with consciousness. Fluent, empathetic sounding output is evidence about language production, not about inner experience. Keep the two questions on separate tracks.
Trusting a single number. A scalar score hides the capability profile of Section 1.2. Always ask which capacities a benchmark exercises and which it ignores.
Mistaking a verbal dispute for a factual one. When two people disagree about whether a system understands, first ask whether they are using “understand” in the operational or the real sense. Often the disagreement dissolves.
Over reading benchmark success. In distribution success can reflect interpolation rather than genuine modeling. Out of distribution, compositional probes are far more diagnostic, and where feasible, inspecting internal representations adds independent evidence.
Treating thought experiments as proofs. The Chinese Room and the blockhead sharpen intuitions; they do not settle the empirical facts about any actual system. Use them to clarify what is at stake, not to foreclose inquiry.
Anthropocentrism. Demanding that a system match human conversation can both flatter mimicry and obscure genuinely nonhuman forms of competence.

For tooling, the empirical side of these questions is increasingly tractable with mature, free, open source instruments. Interpretability libraries built on PyTorch, such as the widely used open source TransformerLens and the captum attribution library, let practitioners train probes on model activations and run intervention experiments of the kind described in Section 8.2, turning some philosophical questions into testable ones. Open evaluation harnesses, such as the community maintained lm-evaluation-harness, support the out of distribution and compositional testing recommended above. None of these tools answers the conceptual questions, but they are how a careful practitioner gathers the evidence that the conceptual questions tell us to look for.

3.10 10. Conclusion

The philosophy of machine intelligence supplies the questions that capability benchmarks cannot answer. Intelligence resists definition because it bundles many capacities and because we disagree about whether behavior or its underlying process is what counts. The Turing Test made the question operational and behavioral, Searle’s Chinese Room challenged the sufficiency of behavior and computation for understanding, and functionalism and the computational theory of mind supplied the framework Searle attacked. Symbol grounding asks how any of these systems could connect to the world, consciousness raises a further question that intelligence alone does not settle, and the systems reply keeps the central dispute alive. When you read a claim that a language model does or does not understand, you are reading a move in this long argument. Knowing the moves will not tell you who is right, but it will let you see clearly what is being claimed, what would count as evidence, and where reasonable people still disagree.

3.11 References

Abate, Pietro, Roberto Di Cosmo, Ralf Treinen, and Stefano Zacchiroli. 2015. “A Modular Package Manager Architecture.” Information and Software Technology 62: 179–92. https://doi.org/10.1016/j.infsof.2015.02.002.

Agüera y Arcas, Blaise, Margaret Mitchell, and Alexander Todorov. 2017. “Physiognomy’s New Clothes.” Medium. https://medium.com/@blaisea/physiognomys-new-clothes-f2d4b59fdd6a.

Ali, Mehdi, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Sahand Sharifzadeh, Volker Tresp, and Jens Lehmann. 2021. “PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings.” Journal of Machine Learning Research 22 (82): 1–6.

Barnard, John, and Donald B Rubin. 1999. “Miscellanea. Small-Sample Degrees of Freedom with Multiple Imputation.” Biometrika 86 (4): 948–55.

Bassani, Elias. 2022. “Ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison.” In European Conference on Information Retrieval, 259–64. Springer.

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54.

Benavoli, Alessio, Giorgio Corani, Janez Demšar, and Marco Zaffalon. 2017. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” Journal of Machine Learning Research 18 (77): 1–36. https://www.jmlr.org/papers/v18/16-305.html.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), 610–23. Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.

Bengio, Yoshua, Aaron Courville, and Pascal Vincent. 2013. “Representation Learning: A Review and New Perspectives.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8): 1798–828.

Block, Ned. 1981. “Psychologism and Behaviorism.” The Philosophical Review 90 (1): 5–43. https://doi.org/10.2307/2184371.

Bodner, Todd E. 2008. “What Improves with Increased Missing Data Imputations?” Structural Equation Modeling: A Multidisciplinary Journal 15 (4): 651–75.

Bottou, Léon, Frank E Curtis, and Jorge Nocedal. 2018. “Optimization Methods for Large-Scale Machine Learning.” SIAM Review 60 (2): 223–311.

Bouthillier, Xavier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Mohammadi Sepahvand, et al. 2021. “Accounting for Variance in Machine Learning Benchmarks.” In Proceedings of Machine Learning and Systems (MLSys), 3:747–69.

Buonaccorsi, John P. 2010. Measurement Error: Models, Methods, and Applications. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton, FL: Chapman; Hall/CRC. https://doi.org/10.1201/9781420066586.

Buslaev, Alexander, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A. Kalinin. 2020. “Albumentations: Fast and Flexible Image Augmentations.” Information 11 (2): 125. https://doi.org/10.3390/info11020125.

Buuren, Stef van, Jaap P. L. Brand, Catharina G. M. Groothuis-Oudshoorn, and Donald B. Rubin. 2006. “Fully Conditional Specification in Multivariate Imputation.” Journal of Statistical Computation and Simulation 76 (12): 1049–64. https://doi.org/10.1080/10629360600810434.

Ceron, Tanise, Neele Falk, Ana Barić, Dmitry Nikolaev, and Sebastian Padó. 2024. “Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in LLMs.” arXiv Preprint arXiv:2402.17649. https://arxiv.org/abs/2402.17649.

Cerqueira, Vitor, Luis Torgo, and Igor Mozetič. 2020. “Evaluating Time Series Forecasting Models: An Empirical Study on Performance Estimation Methods.” Machine Learning 109 (11): 1997–2028.

Chalmers, David J. 1995. “Facing up to the Problem of Consciousness.” Journal of Consciousness Studies 2 (3): 200–219. https://consc.net/papers/facing.html.

Chapelle, Olivier, Jason Weston, Léon Bottou, and Vladimir Vapnik. 2000. “Vicinal Risk Minimization.” In Advances in Neural Information Processing Systems 13 (NeurIPS 2000), edited by T. Leen, T. Dietterich, and V. Tresp, 416–22. MIT Press. https://proceedings.neurips.cc/paper/2000/hash/ba9a56ce0a9bfa26e8ed9e10b2cc8f46-Abstract.html.

Chen, Banghao, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. 2025. “Unleashing the Potential of Prompt Engineering for Large Language Models.” arXiv Preprint arXiv:2310.14735. https://arxiv.org/abs/2310.14735.

Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. “A Simple Framework for Contrastive Learning of Visual Representations.” In International Conference on Machine Learning, 1597–607. PmLR.

Christen, Peter. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer. https://doi.org/10.1007/978-3-642-31164-2.

Cook, Stephen A. 1971. “The Complexity of Theorem-Proving Procedures.” In Proceedings of the Third Annual ACM Symposium on Theory of Computing, 151–58. ACM. https://doi.org/10.1145/800157.805047.

Council, National Research, Committee on National Statistics, and Panel on Handling Missing Data in Clinical Trials. 2011. “The Prevention and Treatment of Missing Data in Clinical Trials.”

Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–22.

Demšar, Janez. 2006. “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research 7: 1–30. https://www.jmlr.org/papers/v7/demsar06a.html.

Detlefsen, Nicki Skafte, Jiri Borovec, Justus Schock, Ananya Harsh Jha, Teddy Koker, Luca Di Liello, Daniel Stancl, Changsheng Quan, Maxim Grechkin, and William Falcon. 2022. “Torchmetrics-Measuring Reproducibility in Pytorch.” Journal of Open Source Software 7 (70): 4101.

Ding, Peng, and Fan Li. 2018. “Causal Inference: A Missing Data Perspective.” Statistical Science 33 (2): 214–37. https://doi.org/10.1214/18-STS645.

Ethayarajh, Kawin. 2019. “How Contextual Are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.” arXiv Preprint arXiv:1909.00512.

Farrell, Joseph, and Garth Saloner. 1985. “Standardization, Compatibility, and Innovation.” The RAND Journal of Economics 16 (1): 70–83. https://doi.org/10.2307/2555589.

Glorot, Xavier, and Yoshua Bengio. 2010. “Understanding the Difficulty of Training Deep Feedforward Neural Networks.” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 249–56.

Gower, John C, and Garmt B Dijksterhuis. 2004. Procrustes Problems. Vol. 30. Oxford university press.

Graham, John W, Allison E Olchowski, and Tamika D Gilreath. 2007. “How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory.” Prevention Science 8 (3): 206–13.

Grimmett, Geoffrey, and Dominic JA Welsh. 2014. Probability: An Introduction. Oxford University Press.

Guha, Neel, Julian Nyarko, Daniel Ho, Christopher Ré, Adam Chilton, Alex Chohlas-Wood, Austin Peters, et al. 2023. “Legalbench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 36: 44123–279.

Gutmann, Michael, and Aapo Hyvärinen. 2010. “Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models.” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 297–304. JMLR Workshop; Conference Proceedings.

Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D: Nonlinear Phenomena 42 (1–3): 335–46. https://doi.org/10.1016/0167-2789(90)90087-6.

Hazineh, Dean S., Zechen Zhang, and Jeffrey Chiu. 2023. “Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT.” arXiv Preprint arXiv:2310.07582. https://arxiv.org/abs/2310.07582.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. “Deep Residual Learning for Image Recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–78. https://doi.org/10.1109/CVPR.2016.90.

———. 2016b. “Identity Mappings in Deep Residual Networks.” In Computer Vision – ECCV 2016, 630–45. Springer. https://doi.org/10.1007/978-3-319-46493-0_38.

He, Kevin, Ran Shorrer, and Mengjia Xia. 2025. “Human Misperception of Generative-AI Alignment: A Laboratory Experiment.” arXiv Preprint arXiv:2502.14708.

Heckman, James J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica: Journal of the Econometric Society, 153–61.

Hedges, Larry V. 1981. “Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators.” Journal of Educational Statistics 6 (2): 107–28. https://doi.org/10.3102/10769986006002107.

Huang, Gao, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. “Densely Connected Convolutional Networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4700–4708. https://doi.org/10.1109/CVPR.2017.243.

Hughes, Rachael A., Ian R. White, Shaun R. Seaman, James R. Carpenter, Kate Tilling, and Jonathan A. C. Sterne. 2014. “Joint Modelling Rationale for Chained Equations.” BMC Medical Research Methodology 14 (1): 28. https://doi.org/10.1186/1471-2288-14-28.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In Proceedings of the 32nd International Conference on Machine Learning (ICML), 448–56.

Ipsen, Niels Bruun, Pierre-Alexandre Mattei, and Jes Frellsen. 2021. “Not-MIWAE: Deep Generative Modelling with Missing Not at Random Data.” In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=tu29GQT0JFy.

———. 2022. “How to Deal with Missing Data in Supervised Deep Learning?” In International Conference on Learning Representations.

Jing, Li, Pascal Vincent, Yann LeCun, and Yuandong Tian. 2021. “Understanding Dimensional Collapse in Contrastive Self-Supervised Learning.” arXiv Preprint arXiv:2110.09348.

Johnsen, Pål VB, Eivind Bøhn, Sølve Eidnes, Filippo Remonato, and Signe Riemer-Sørensen. 2025. “Recency-Weighted Temporally-Segmented Ensemble for Time Series Modeling.” Journal of Artificial Intelligence Research 84.

Kalton, Graham, and Daniel Kasprzyk. 1986. “The Treatment of Missing Survey Data.” Survey Methodology 12 (1): 1–16.

Katz, Michael L., and Carl Shapiro. 1985. “Network Externalities, Competition, and Compatibility.” The American Economic Review 75 (3): 424–40. https://doi.org/10.2307/1814809.

Kaufman, Shachar, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. “Leakage in Data Mining: Formulation, Detection, and Avoidance.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (4): 1–21.

Khashabi, Daniel, Xinxi Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sean Welleck, Hannaneh Hajishirzi, et al. 2022. “Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts.” In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3631–43.

Kohavi, Ron, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to a/b Testing. Cambridge University Press. https://doi.org/10.1017/9781108653985.

Lakens, Daniel. 2013. “Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs.” Frontiers in Psychology 4: 863. https://doi.org/10.3389/fpsyg.2013.00863.

Legg, Shane, and Marcus Hutter. 2007. “Universal Intelligence: A Definition of Machine Intelligence.” Minds and Machines 17 (4): 391–444. https://doi.org/10.1007/s11023-007-9079-x.

Levesque, Hector J., Ernest Davis, and Leora Morgenstern. 2012. “The Winograd Schema Challenge.” In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2012), 552–61. AAAI Press. https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf.

Li, Kenneth, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. “Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task.” In The Eleventh International Conference on Learning Representations (ICLR 2023). https://openreview.net/forum?id=DeG07_TcZvT.

Little, Roderick J. A. 1986. “Survey Nonresponse Adjustments for Estimates of Means.” International Statistical Review 54 (2): 139–57. https://doi.org/10.2307/1403140.

———. 1993. “Pattern-Mixture Models for Multivariate Incomplete Data.” Journal of the American Statistical Association 88 (421): 125–34. https://doi.org/10.1080/01621459.1993.10594302.

Little, Roderick JA. 1988. “A Test of Missing Completely at Random for Multivariate Data with Missing Values.” Journal of the American Statistical Association 83 (404): 1198–1202.

Little, Roderick JA, and Donald B Rubin. 2019. Statistical Analysis with Missing Data. John Wiley & Sons.

Liu, Jingchen, Andrew Gelman, Jennifer Hill, Yu-Sung Su, and Jonathan Kropko. 2014. “On the Stationary Distribution of Iterative Imputations.” Biometrika 101 (1): 155–73. https://doi.org/10.1093/biomet/ast044.

Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. “Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.” ACM Computing Surveys 55 (9): 1–35. https://doi.org/10.1145/3560815.

Loog, Marco, Tom Viering, Alexander Mey, Jesse H Krijthe, and David MJ Tax. 2020. “A Brief Prehistory of Double Descent.” Proceedings of the National Academy of Sciences 117 (20): 10625–26.

Mallinckrod, Craig H, Peter W Lane, Dan Schnell, Yahong Peng, and James P Mancuso. 2008. “Recommendations for the Primary Analysis of Continuous Endpoints in Longitudinal Clinical Trials.” Drug Information Journal 42 (4): 303–19.

Mattei, Pierre-Alexandre, and Jes Frellsen. 2019. “MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets.” In Proceedings of the 36th International Conference on Machine Learning (ICML), 97:4413–23. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v97/mattei19a.html.

Mayer, Imke, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager, and Julie Josse. 2020. “Doubly Robust Treatment Effect Estimation with Missing Attributes.” The Annals of Applied Statistics 14 (3). https://doi.org/10.1214/20-aoas1356.

McGraw, Kenneth O., and S. P. Wong. 1992. “A Common Language Effect Size Statistic.” Psychological Bulletin 111 (2): 361–65. https://doi.org/10.1037/0033-2909.111.2.361.

Meng, Xiao-Li. 1994. “Multiple-Imputation Inferences with Uncongenial Sources of Input.” Statistical Science 9 (4): 538–58. https://doi.org/10.1214/ss/1177010269.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems 26.

Min, Sewon, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. “Rethinking the Role of Demonstrations: What Makes in-Context Learning Work?” arXiv Preprint arXiv:2202.12837.

Mitchell, Melanie, and David C. Krakauer. 2023. “The Debate over Understanding in AI’s Large Language Models.” Proceedings of the National Academy of Sciences 120 (13): e2215907120. https://doi.org/10.1073/pnas.2215907120.

Mohan, Karthika, and Judea Pearl. 2021. “Graphical Models for Processing Missing Data.” Journal of the American Statistical Association 116 (534): 1023–37.

Mohan, Karthika, Judea Pearl, and Jin Tian. 2013. “Graphical Models for Inference with Missing Data.” Advances in Neural Information Processing Systems 26.

Mozes, Maximilian. 2024. “Understanding and Guarding Against Natural Language Adversarial Examples.” PhD thesis, University College London. https://discovery.ucl.ac.uk/id/eprint/10190224/.

Mu, Jiaqi, Suma Bhat, and Pramod Viswanath. 2017. “All-but-the-Top: Simple and Effective Postprocessing for Word Representations.” arXiv Preprint arXiv:1702.01417.

Muzellec, Boris, Julie Josse, Claire Boyer, and Marco Cuturi. 2020. “Missing Data Imputation Using Optimal Transport.” In Proceedings of the 37th International Conference on Machine Learning (ICML), 119:7130–40. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v119/muzellec20a.html.

Nanda, Neel, Andrew Lee, and Martin Wattenberg. 2023. “Emergent Linear Representations in World Models of Self-Supervised Sequence Models.” In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 16–30. Association for Computational Linguistics. https://aclanthology.org/2023.blackboxnlp-1.2/.

Pearl, Judea. 2009. Causality. Cambridge university press.

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” The Journal of Machine Learning Research 12: 2825–30.

Poggio, Tomaso, Gil Kur, and Andrzej Banburski. 2019. “Double Descent in the Condition Number.” arXiv Preprint arXiv:1912.06190.

Putnam, Hilary. 1967. “Psychological Predicates.” In Art, Mind, and Religion, edited by W. H. Capitan and D. D. Merrill, 37–48. University of Pittsburgh Press. https://doi.org/10.2307/j.ctt6wrc73.6.

Radovanovic, Milos, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. “Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data.” Journal of Machine Learning Research 11 (sept): 2487–2531.

Rizopoulos, Dimitris. 2012. Joint Models for Longitudinal and Time-to-Event Data: With Applications in r. Chapman & Hall/CRC Biostatistics Series. Boca Raton, FL: Chapman; Hall/CRC. https://doi.org/10.1201/b12208.

Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66. https://doi.org/10.1080/01621459.1994.10476818.

Roediger, Henry L, and Jeffrey D Karpicke. 2006. “Test-Enhanced Learning: Taking Memory Tests Improves Long-Term Retention.” Psychological Science 17 (3): 249–55. https://doi.org/10.1111/j.1467-9280.2006.01693.x.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” In Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), 9351:234–41. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/978-3-319-24574-4_28.

Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55. https://doi.org/10.1093/biomet/70.1.41.

Rubin, Donald. 1987. “Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Son.” (No Title).

Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92.

Scharfstein, Daniel O., Andrea Rotnitzky, and James M. Robins. 1999. “Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models.” Journal of the American Statistical Association 94 (448): 1096–1120. https://doi.org/10.1080/01621459.1999.10473862.

Schönemann, Peter H. 1966. “A Generalized Solution of the Orthogonal Procrustes Problem.” Psychometrika 31 (1): 1–10.

Searle, John R. 1980. “Minds, Brains, and Programs.” Behavioral and Brain Sciences 3 (3): 417–57. https://doi.org/10.1017/S0140525X00005756.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781107298019.

Simard, Patrice Y., Yann A. LeCun, John S. Denker, and Bernard Victorri. 1998. “Transformation Invariance in Pattern Recognition: Tangent Distance and Tangent Propagation.” In Neural Networks: Tricks of the Trade, 1524:239–74. Lecture Notes in Computer Science. Springer. https://doi.org/10.1007/3-540-49430-8_13.

Simard, Patrice Y., Dave Steinkraus, and John C. Platt. 2003. “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.” In Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), 958–63. IEEE. https://doi.org/10.1109/ICDAR.2003.1227801.

Smith, Leslie N. 2017. “Cyclical Learning Rates for Training Neural Networks.” In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 464–72. IEEE.

Srivastava, Rupesh Kumar, Klaus Greff, and Jürgen Schmidhuber. 2015. “Highway Networks.” arXiv Preprint arXiv:1505.00387. https://doi.org/10.48550/arXiv.1505.00387.

Stekhoven, Daniel J, and Peter Bühlmann. 2012. “MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18.

Sullivan, Thomas R., Ian R. White, Amy B. Salter, Philip Ryan, and Katherine J. Lee. 2018. “Should Multiple Imputation Be the Method of Choice for Handling Missing Data in Randomized Trials?” Statistical Methods in Medical Research 27 (9): 2610–26. https://doi.org/10.1177/0962280216683570.

Sun, BaoLuo, Lan Liu, Wang Miao, Kathleen Wirth, James Robins, and Eric J. Tchetgen Tchetgen. 2018. “Semiparametric Estimation with Data Missing Not at Random Using an Instrumental Variable.” Statistica Sinica 28 (4): 1965–83. https://doi.org/10.5705/ss.202016.0324.

Tashman, Leonard J. 2000. “Out-of-Sample Tests of Forecasting Accuracy: An Analysis and Review.” International Journal of Forecasting 16 (4): 437–50.

Tchetgen Tchetgen, Eric J., and Ilya Shpitser. 2012. “Semiparametric Theory for Causal Mediation Analysis: Efficiency Bounds, Multiple Robustness and Sensitivity Analysis.” The Annals of Statistics 40 (3): 1816–45. https://doi.org/10.1214/12-AOS990.

Tsiatis, Anastasios A. 2006. Semiparametric Theory and Missing Data. Springer Series in Statistics. New York: Springer. https://doi.org/10.1007/0-387-37345-4.

Turing, A. M. 1950. “Computing Machinery and Intelligence.” Mind 59 (236): 433–60. https://doi.org/10.1093/mind/LIX.236.433.

Vafa, Keyon, Peter G. Chang, Ashesh Rambachan, and Sendhil Mullainathan. 2025. “What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models.” In Proceedings of the 42nd International Conference on Machine Learning (ICML), 267:60727–47. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v267/vafa25a.html.

Vafa, Keyon, Justin Y. Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. 2024. “Evaluating the World Model Implicit in a Generative Model.” In Advances in Neural Information Processing Systems 37 (NeurIPS 2024). https://proceedings.neurips.cc/paper_files/paper/2024/hash/2f6a6317bada76b26a4f61bb70a7db59-Abstract-Conference.html.

Vafa, Keyon, Ashesh Rambachan, and Sendhil Mullainathan. 2024. “Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function.” arXiv Preprint arXiv:2406.01382.

Van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45: 1–67.

Vapnik, Vladimir N. 1998. Statistical Learning Theory. New York: Wiley.

Veit, Andreas, Michael J. Wilber, and Serge Belongie. 2016. “Residual Networks Behave Like Ensembles of Relatively Shallow Networks.” In Advances in Neural Information Processing Systems (NeurIPS). Vol. 29.

Vershynin, Roman. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science. Vol. 47. Cambridge university press.

Wang, Tongzhou, and Phillip Isola. 2020. “Understanding Contrastive Representation Learning Through Alignment and Uniformity on the Hypersphere.” In International Conference on Machine Learning, 9929–39. PMLR.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

White, Ian R, and John B Carlin. 2010. “Bias and Efficiency of Multiple Imputation Compared with Complete-Case Analysis for Missing Covariate Values.” Statistics in Medicine 29 (28): 2920–31.

White, Ian R, Patrick Royston, and Angela M Wood. 2011. “Multiple Imputation Using Chained Equations: Issues and Guidance for Practice.” Statistics in Medicine 30 (4): 377–99.

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76. https://doi.org/10.1145/1498765.1498785.

Xu, Da, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2020. “Inductive Representation Learning on Temporal Graphs.” arXiv Preprint arXiv:2002.07962.

Yoon, Jinsung, James Jordon, and Mihaela van der Schaar. 2018. “GAIN: Missing Data Imputation Using Generative Adversarial Nets.” In Proceedings of the 35th International Conference on Machine Learning (ICML), 80:5689–98. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v80/yoon18a.html.