227  Reasoning Models and Test-Time Compute

227.1 1. Introduction

Between the original transformer and the instruction-tuned chat models of 2023, the dominant lever for improving large language models was scale at training time: more parameters, more data, more pre-training compute. In late 2024 a second lever appeared, and it changed the frontier. Reasoning models are trained to spend variable inference-time compute, generating an explicit, often long internal chain of thought before committing to an answer, trading latency and tokens for accuracy. This opened a second scaling axis, test-time compute, alongside pre-training scale, and essentially every frontier model shipped through 2025 sits on it.

The shift matters enough that a treatment of LLMs that stops at instruction-tuned chat models is a generation behind. This chapter develops what reasoning models are, the empirical scaling behaviour that justifies them, the two roads to building them (OpenAI’s o-series and DeepSeek’s open R1), the test-time compute methods that operationalize “thinking,” the economics of a thinking budget, and the honest limits, including the benchmarks that reveal what reasoning models still cannot do.

227.2 2. The Core Idea: A Second Scaling Axis

A standard chat model maps a prompt to an answer in a single forward pass per token, emitting the answer directly. A reasoning model is trained instead to first produce a long reasoning trace, exploring approaches, checking intermediate steps, backtracking, and only then emit a final answer. Chain-of-thought prompting had shown since 2022 that eliciting such traces helps; the new idea is to make extended, self-correcting reasoning a trained behaviour reinforced by outcome rewards, rather than something coaxed out by prompt engineering.

The empirical payoff is an inference-time scaling law: within a model, accuracy on hard reasoning tasks rises smoothly as you allow more thinking tokens, much as accuracy rose with parameters and data at training time. This gives a genuinely new control knob, the same deployed model can be made more capable on a hard problem simply by letting it think longer, without retraining.

Accuracy on hard reasoning tasks rises with compute (log scale) along two complementary axes. Pre-training scale (bigger model or more data) and test-time compute (thinking longer per query) each push accuracy upward, with test-time compute extending the curve at inference for a fixed model.

xychart-beta
    title "Two scaling axes: accuracy vs compute (log)"
    x-axis "Compute (log)" [low, medium, high, "very high"]
    y-axis "Accuracy" 0 --> 100
    line "Pre-training scale" [30, 50, 64, 72]
    line "Test-time compute" [30, 58, 78, 90]

The two axes are complementary, not competing: a strong base model trained at scale is what makes the reasoning RL stage effective, and reasoning then extracts more capability from that base at inference.

227.3 3. The o-Series: Reasoning as a Product

OpenAI’s o1, released September 12, 2024, was the first widely deployed reasoning model. It was trained to “think before answering,” hid its raw chain of thought behind a summarized version, and demonstrated the inference-scaling curve on competition mathematics, coding, and science.

o3 and o4-mini followed: o3 was previewed in December 2024 and reached general availability with o4-mini in April 2025. The headline results made the paradigm impossible to ignore. On the ARC-AGI-1 abstraction benchmark, designed specifically to resist memorization, o3 scored about 87.5% in a high-compute configuration (at a cost reported on the order of thousands of dollars per task), versus the low single digits typical of prior LLMs. On FrontierMath, a set of research-level problems where earlier models scored around 2%, o3 reached roughly 25%. These were not incremental gains; they were the signature of a new capability regime, bought with inference compute.

The o-series also crystallized the thinking budget as a product control: users and developers can request “low,” “medium,” or “high” reasoning effort, directly trading cost and latency for accuracy, the inference-time scaling law exposed as an API parameter.

227.4 4. DeepSeek-R1: Reasoning from Reinforcement Learning, in the Open

The o-series demonstrated that reasoning worked but disclosed little about how. DeepSeek-R1, released in January 2025 and later peer-reviewed in Nature (September 2025), opened the recipe and delivered a striking scientific result: sophisticated reasoning can emerge from reinforcement learning alone.

The key artifact was R1-Zero, trained by applying RL directly to a base model with no supervised fine-tuning cold-start. Given only a reward for producing correct, verifiable answers (and a format reward), the model spontaneously learned to generate long chains of thought, to allocate more steps to harder problems, and to exhibit self-correction, the paper’s much-quoted “aha moment,” where the model learns to pause and re-evaluate. The reinforcement-learning machinery behind this, Group Relative Policy Optimization (GRPO), which estimates its baseline from a group of sampled answers and dispenses with a separate value network, is developed fully in the next chapter on reinforcement learning for LLMs; here the point is the result: reasoning need not be hand-taught through demonstrations, it can be incentivized and allowed to emerge.

The full R1 added a small amount of cold-start data and a multi-stage pipeline to fix R1-Zero’s readability and language-mixing issues, but the scientific headline stands. Because DeepSeek released open weights and a detailed report, R1 became the reference point for an entire wave of open reasoning models and reasoning research.

227.5 5. Test-Time Compute Methods

“Let the model think longer” can be realized in several distinct ways, which it is worth separating because they have different cost and reliability profiles.

  • Long chain-of-thought (sequential). The model generates one long, self-correcting trace. Capability comes from depth: revisiting, checking, and backtracking within a single sample. This is what o1/o3 and R1 primarily do, and it is what RL trains.
  • Parallel sampling with selection. Draw N independent answers and choose among them, best-of-N against a verifier or reward model, or self-consistency, which marginalizes over reasoning paths by majority-voting the final answers. Cost scales linearly in N; accuracy improves then plateaus.
  • Verifier-guided search. Use a process reward model (which scores intermediate steps, not just final answers) to guide a search, beam search or tree search, over reasoning steps, expanding promising partial traces. More compute-efficient per unit of accuracy than blind sampling, at the cost of needing a good step-level verifier.
  • Sequential revision. Have the model critique and revise its own answer over multiple rounds.

A useful framing is parallel versus sequential test-time compute: parallel methods (sampling, search) explore breadth; sequential methods (long CoT, revision) exploit depth. The research consensus through 2025 is that the two compose, and that the optimal allocation between them depends on the problem, easy problems waste compute under heavy search, while the hardest benefit from both depth and breadth.

227.6 6. Reasoning Distillation

A practical and important finding from the R1 work: the long reasoning traces produced by a strong reasoning model can be used as supervised training data to teach smaller, cheaper, dense models to reason. DeepSeek distilled R1’s traces into a range of small models that substantially outperformed same-size models trained conventionally. This reasoning distillation matters economically, it pushes capable reasoning down the model-size curve, so that the expensive reasoning RL need only be run once on a large model and its behaviour can then be transferred cheaply. It also reframes reasoning traces themselves as a valuable synthetic-data asset, a theme picked up in the post-training chapter.

227.7 7. The Economics of a Thinking Budget

Reasoning is not free, and its cost structure differs from ordinary inference. A reasoning model may emit thousands of hidden thinking tokens per query, Qwen3, for instance, exposed thinking budgets of tens of thousands of tokens, and the user typically pays for those tokens and waits for them. This reshapes the LLM cost model covered earlier in the book: the relevant unit is no longer “tokens in the answer” but “tokens spent thinking to reach the answer,” and the right amount varies by query.

The operational implications are concrete. Latency-sensitive applications cannot afford maximum reasoning on every call; the discipline is to route, spend heavy reasoning only on queries that need it, and answer easy queries directly. “Overthinking,” where a model burns budget on trivial questions, is a real failure mode and an active area of work (adaptive or budget-aware reasoning). The thinking budget is thus both a capability knob and a cost-control problem, and treating it as either alone is a mistake.

227.8 8. When Reasoning Helps, and When It Does Not

Reasoning models shine on problems with verifiable structure and multiple steps: competition mathematics, algorithmic coding, formal logic, and scientific problems where intermediate work can be checked. The reason is partly architectural and partly training-driven, RL with verifiable rewards (next chapter) requires a checkable answer, so the capability is sharpest exactly where answers are checkable.

The gains are smaller, and sometimes negative, on tasks that are not decomposable into verifiable steps: open-ended writing, simple factual lookup, or tasks dominated by knowledge rather than inference. On easy queries, extended reasoning can reduce quality by overthinking, and it always costs more. A mature system therefore treats reasoning as a tool to be applied selectively, not a universal upgrade.

227.9 9. Evaluation and the Limits of Reasoning

The reasoning era forced new benchmarks, because the old ones saturated. As models approached ceiling on MMLU and similar tests, evaluation migrated to harder, less saturable sets: GPQA-Diamond (graduate-level science), FrontierMath (research mathematics), SWE-bench Verified (real software issues), and Humanity’s Last Exam.

The most instructive case is ARC-AGI. o3’s ~87.5% on ARC-AGI-1 was hailed as a breakthrough in fluid, on-the-fly abstraction. But when ARC-AGI-2 was released in 2025, redesigned to defeat brute-force search and to require more compositional adaptation, the same class of models that saturated version 1 scored in the low single digits at moderate compute. The gap is the lesson: reasoning models made enormous progress on problems amenable to search and verification, yet a measurable adaptation gap to human-style generalization persists. High accuracy on a benchmark a model was effectively optimized against is not evidence of general intelligence, and ARC-AGI-2 is the cleanest demonstration that the frontier, however impressive, has not closed that gap.

A second caution concerns the reasoning trace itself. The visible chain of thought is not guaranteed to be a faithful account of the computation that produced the answer; treating it as a transparent explanation, for audit or safety purposes, is unsound without separate evidence of faithfulness. The trace is a means to an answer, not a certified justification of it.

227.10 10. Conclusion

Reasoning models introduced the first genuinely new scaling axis since the transformer: test-time compute, the ability to make a fixed model more capable by letting it think longer. OpenAI’s o-series proved the paradigm commercially and exposed the thinking budget as a control; DeepSeek-R1 opened the recipe and showed, in a peer-reviewed result, that reasoning can emerge from reinforcement learning alone, a finding whose machinery is the subject of the next chapter. The methods (long chain-of-thought, parallel sampling, verifier-guided search), the economics (paying to think, and routing to avoid overthinking), and the honest limits (the ARC-AGI-2 adaptation gap, unfaithful traces) together define the regime in which all current frontier systems operate. Reasoning did not make the older lessons obsolete; it added a second dimension on top of them, and understanding that dimension is now prerequisite to understanding the field.

227.11 References

  1. OpenAI. “Learning to Reason with LLMs” (o1). September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
  2. OpenAI. “Introducing OpenAI o3 and o4-mini.” April 16, 2025. https://openai.com/index/introducing-o3-and-o4-mini/
  3. DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” Nature, September 2025. https://www.nature.com/articles/s41586-025-09422-z
  4. Chollet, F. et al. “ARC Prize 2024 / o3 breakthrough on ARC-AGI-1.” https://arcprize.org/blog/oai-o3-pub-breakthrough
  5. ARC Prize. “Announcing ARC-AGI-2 and ARC Prize 2025.” 2025. https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025
  6. Snell, C. et al. “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.” 2024. https://arxiv.org/abs/2408.03314
  7. Wang, X. et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” 2022. https://arxiv.org/abs/2203.11171
  8. Lightman, H. et al. “Let’s Verify Step by Step” (process reward models). 2023. https://arxiv.org/abs/2305.20050
  9. Glazer, E. et al. “FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.” 2024. https://arxiv.org/abs/2411.04872
  10. Qwen Team. “Qwen3 Technical Report” (thinking budgets). 2025. https://github.com/QwenLM/Qwen3