Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reasoning Effort

Pattern

A named solution to a recurring problem.

Choose how much inference-time thinking a hybrid model spends before it answers, matching the dial to the shape of the task rather than its apparent difficulty.

Also known as: Thinking Budget, Extended Thinking, Reasoning Mode

You’ve already turned this dial, maybe without noticing the name. You ask Claude Code a quick question and the answer comes back in a second. You ask it to untangle a gnarly bug and it sits there “thinking” for half a minute before the first word appears. That pause is the model spending extra inference-time compute on internal reasoning before it commits to an answer. In 2026, how much of that thinking happens is a parameter you control, and choosing it well is one of the biggest per-call decisions you make.

Understand This First

  • Model — the dial only exists because hybrid models pair a fast mode with an extended-thinking mode.
  • Tradeoff — every effort setting is a cost, latency, and quality tradeoff.

Context

At the agentic level, the model underneath your workflow is almost certainly a hybrid: it carries both a fast mode that answers immediately and an extended-thinking mode that reasons internally before responding. The frontier converged on this design through 2025, and with it came a runtime dial that selects how much thinking to allow.

Each vendor ships the dial under a different name. OpenAI exposes reasoning_effort with tiers minimal, low, medium, and high. Anthropic ships an extended-thinking budget on Claude Opus 4.5, set as a token allowance or an effort level. Google’s Gemini 2.5 takes a thinkingBudget. xAI’s models expose a reasoning_mode, and DeepSeek toggles an internal chain of thought. The names differ; the primitive is the same. You are deciding how long the model gets to think before it speaks.

This sits one layer below model routing. Routing chooses which model handles a task. Reasoning effort tunes within a single hybrid model once you have chosen it. The two decisions compose: route to a model, then set its effort.

Problem

How much thinking should the model do before it answers? Spend too little, and it rushes a hard problem and ships a wrong solution that costs you more to fix than the thinking would have cost. Spend too much, and you pay a multiple of the price, wait several times as long for the first token, and on some tasks get a worse answer because the model over-engineers what should have been simple.

The naive rule, “harder problem, more thinking,” feels right and it’s wrong often enough to matter. The optimal setting depends on the task’s structure, not just its difficulty, and the dial has real costs in both money and latency that you pay whether or not the extra thinking helped.

Forces

  • Quality rises with effort on hard reasoning, but not everywhere. On math and science benchmarks, more thinking buys real accuracy. On code, the curve flattens and can even bend down.
  • Cost scales steeply. Higher effort generates more internal reasoning tokens, which you pay for. The fee inflation runs roughly 4 to 17 times between the lowest and highest tiers.
  • Latency scales worse. Extended thinking pushes time-to-first-token up by 5 to 60 times. For interactive work where you wait on each response, that delay compounds across a session.
  • Task structure matters more than task difficulty. A task with a clear specification and a tight feedback loop rarely needs deep thinking even when it’s “hard”; an underspecified open-ended task may need it even when it looks small.

Solution

Match the dial to the task’s structure, and treat medium as the default for code. Reach for high effort only when the task genuinely rewards extended reasoning, and drop to minimal when the work is mechanical.

A useful split runs across task classes. Planning and analysis rewards high effort. Architecture decisions, root-causing a subtle bug, weighing a design tradeoff: this is exactly the multi-step reasoning extended thinking was built for. Mechanical execution wants minimal or low effort. Applying an agreed change across files, formatting, generating boilerplate from a clear spec: there’s no reasoning to do, so every extra thinking token is waste. Verification and review sits in the middle, where medium effort is usually enough to catch real problems without over-thinking clean code.

The non-obvious finding deserves its own callout, because most practitioners get it wrong. They reach for the highest setting on the hardest task and assume that’s the safe choice. For code, it isn’t.

On code, medium usually beats high

For coding tasks, medium effort is the sweet spot, not high. On Expert-SWE-style benchmarks, medium lands around 71 to 73 percent pass rate while high regresses by three to five points: the model spends its extra thinking second-guessing a correct approach, gold-plating, or introducing complexity the task never needed. Default to medium for code and escalate to high only when medium visibly struggles, not as a reflex.

The benchmark picture, drawn from cross-vendor cost-quality measurements published in 2026, makes the shape concrete. On AIME 2026 (competition math), moving from low to high effort buys 18 to 22 percentage points: this is where thinking pays. On GPQA Diamond (graduate science), medium to high buys a smaller 3 to 7 points. On Expert-SWE (real-world coding), medium is the peak and high regresses. The pattern is clear once you see it: the more a task resembles a self-contained reasoning puzzle, the more effort helps; the more it resembles software engineering against a real codebase with tests, the less.

Reasoning effort is not chain-of-thought prompting, and conflating them dates you. Chain-of-thought is a prompting technique from the 2022 to 2024 era: you write “think step by step” into the prompt and coax the model into showing its work. Reasoning effort is an inference-time parameter: you set a knob and the model allocates internal compute, with the thinking happening in a reasoning pass you do not write and often do not see. The old vocabulary was a prompt move; the current vocabulary is a runtime dial.

How It Plays Out

A developer is working interactively in Claude Code on a payments service. She starts a session diagnosing why a refund occasionally double-fires under concurrent requests, and sets effort high: the bug is a reasoning problem, and she wants the model holding the full set of race conditions in mind at once. The model traces the concurrent access paths, finds the missing idempotency check, and proposes a fix. With the approach settled, she drops effort to minimal and has the model apply the same guard across the other twelve mutation endpoints. The mechanical part flies; the hard part got the thinking it needed. Her bill for the session is a fraction of what running everything at high effort would have cost, and the mechanical edits were no worse for the lower setting.

A team builds the dial into their harness instead of choosing by hand. Their agent pipeline classifies each task and sets effort programmatically: planning steps and design reviews run at high, code generation from an approved spec runs at minimal, and the verification loop that checks each change runs at medium. A subagent doing file searches runs at minimal, since there’s nothing to reason about. Over a month, the team’s spend drops sharply against their earlier “everything at high, to be safe” baseline, and their coding pass rate ticks up a couple of points, because the high-effort regression on code is no longer hitting their mechanical edits.

Warning

“To be safe, use high effort everywhere” is the expensive mistake. It inflates your bill several times over, slows every interaction, and on code it actively lowers quality. Safety on hard tasks comes from a verification loop, not from a maxed-out dial.

Consequences

Benefits. Matching effort to task structure is one of the largest cost levers in agentic coding, often cutting spend on a mixed workload by a wide margin while improving results on the tasks that were silently being hurt by over-thinking. Interactive loops tighten when mechanical work stops waiting on thinking it never needed. And naming the dial gives a team shared vocabulary: “run that at minimal” or “this one wants high effort” is a precise instruction in a code review or a harness config.

Liabilities. Every task now carries an effort decision, which is one more judgment to get right or encode. A bad call cuts both ways: too little effort on a genuine reasoning problem ships a wrong answer; too much on mechanical work burns money and time for nothing, and on code can degrade the result. The optimal settings also drift as models change, so a harness’s effort policy needs periodic review the same way a routing strategy does. And high effort’s latency tax makes it a poor fit for tight interactive loops even when it would improve the answer: sometimes the right move is a lower setting plus a verification pass rather than a long wait on a single deep response.

Sources

  • OpenAI introduced the reasoning_effort parameter with its reasoning models; the reasoning guide documents the minimal/low/medium/high tiers and how the model allocates internal reasoning tokens.
  • Anthropic’s extended thinking documentation describes the thinking-budget control on Claude Opus 4.5, where a token allowance governs how much the model reasons before answering.
  • Google’s Gemini thinking documentation specifies the thinkingBudget control on Gemini 2.5, the equivalent dial in that vendor’s API.
  • Jason Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022), established the prompting-era technique that the inference-time reasoning dial later subsumed; the distinction between the two is what dates the older vocabulary.
  • The cross-vendor cost-quality curves (AIME 2026, GPQA Diamond, Expert-SWE) and the code-specific finding that medium effort outperforms high were measured and published by independent practitioner benchmarking in early 2026; the numbers cited here reflect that body of measurement rather than any single vendor’s claims.