Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Model

Concept

A foundational idea to recognize and understand.

The inference engine underneath every agentic coding workflow — a large language model whose properties shape what you can ask it to do.

What It Is

A model is a large language model (LLM): the inference engine that powers agents, coding assistants, and every other agentic workflow. When you interact with an AI coding assistant, the model is the part that reads your prompt, processes it within a context window, and produces a response.

At its foundation, a model is a neural network trained on vast amounts of text and code that has learned statistical patterns in language. That description undersells what modern models actually do. Frontier models decompose multi-step problems, plan solutions, self-correct when they notice errors, and generate working code for tasks they haven’t seen expressed in exactly that form. The “just predicts the next word” framing is like saying a chess engine “just evaluates board positions.” Technically accurate, practically misleading.

A model has intrinsic properties that hold no matter how it’s used:

Models are stateless between calls. Each request starts fresh. The model doesn’t remember your last conversation unless previous context is explicitly included. This is why instruction files and memory patterns exist.

Models have knowledge cutoffs. They were trained on data up to a specific date. They don’t know about libraries released last week or APIs that changed last month. In agentic settings, tools partially compensate: an agent with web search, file reading, and documentation retrieval can look up current information rather than relying on stale training data. The model still can’t know what it doesn’t know, so providing current documentation for recent technologies remains good practice.

Models optimize for plausibility. When uncertain, a model produces the most likely-sounding response, not an admission of uncertainty. This is why AI smells exist and why verification loops matter.

Models process more than text. Frontier models accept images alongside text universally. Several (including GPT-5 and Gemini 2.5) accept native audio and video as well, though support varies by vendor: Claude Opus 4.5, for example, handles text and images but not audio or video. For agentic coding, this means a model can examine screenshots of a broken UI, read diagrams and architecture sketches, inspect visual test output, and (when the chosen model supports it) listen to a developer’s recorded explanation or watch a screencast of a failing test. Multimodal input expands what you can communicate in a prompt beyond what words alone can express.

Why It Matters

People new to agentic coding often treat the model as either a magic oracle (it knows everything) or a simple autocomplete (it just predicts the next word). Both framings lead to poor results.

The oracle framing leads to uncritical acceptance of output. You ask, the model answers, you ship. When the answer is wrong, you find out the hard way: a fabricated API call that doesn’t exist, a confidently-cited library function that’s two versions out of date, a security check that looks careful but skips the case that actually matters. The autocomplete framing leads to the opposite failure: underusing the model’s genuine capacity for reasoning, planning, and synthesis. You ask only for keystroke completions and miss that the same model could have read your test failures, traced the offending code path, and proposed a fix.

The accurate framing is in the middle and has texture. Models are highly capable but context-dependent collaborators. They reason well within their context window but can’t access information outside it. They generate plausible output by default and correct output when given sufficient context and clear constraints. They respond to framing: the same question asked differently produces different quality responses, which is the entire basis of prompt engineering and context engineering. Carrying this mental model into every interaction is what separates working with the system from fighting it.

How to Recognize It

You’ll see model nature in the texture of its output and in the failure modes you hit when you treat it as something it isn’t.

  • Fluency that’s independent of correctness. Model output sounds authoritative regardless of whether it’s right. The same confident prose carries a correct quicksort and a fabricated API. Trust calibration is your job, not the model’s.
  • Training-data shaped knowledge. The model is fluent on libraries that existed at training time and silent or wrong on libraries that didn’t. Sources reflect what was prevalent in the training corpus, including its biases and errors.
  • Broad competence with uneven depth. A single model handles many languages, frameworks, and domains, but depth varies by how much of each appeared in training. Popular topics get strong responses; obscure ones get plausible-sounding guesses.
  • Stochasticity at every level. The same prompt can produce different outputs on different runs. Agent harnesses often drop the temperature to near zero to reduce variance on deterministic-feeling tasks, but bit-for-bit reproducibility is rarely achievable in practice. GPU floating-point ordering, tie-breaking at the top logit, and serving-layer batching each leak small amounts of non-determinism even at temperature zero. As of late 2025, a known engineering recipe (batch-invariant kernels combined with deterministic serving stacks like SGLang) can deliver bit-identical output across runs, but most production APIs still do not enable it.
  • A capability spectrum, not a single point. No single model is best at everything. Fast models, reasoning models, and specialized coding models each suit different tasks. The frontier has converged on hybrid models that combine a fast mode and an extended-thinking mode in the same model, with a router or an effort parameter selecting per call. GPT-5 has a runtime router and a reasoning_effort API knob. Claude Opus 4.5 ships hybrid reasoning with an effort parameter. Gemini 2.5 exposes a thinkingBudget. Smaller and older models still ship as separate fast and reasoning SKUs, and specialized coding models can still beat general-purpose models on cost or local-deployment constraints (though on raw capability the gap has narrowed: Claude Opus 4.5 hit 80.9% on SWE-bench Verified at launch). Matching effort to task remains a practical skill. Spending high reasoning effort on string formatting wastes time and money; using minimal effort on a tricky concurrency bug wastes attempts.

How It Plays Out

A developer asks a model to implement a sorting algorithm. The model produces a clean, correct quicksort. Encouraged, the developer asks it to integrate with a proprietary internal API. The model produces confident-looking code that calls endpoints and uses data structures that don’t exist. It has no knowledge of this private API. The developer learns to provide API documentation in the context when asking for integration work.

A team uses a model to review a pull request. The model identifies a potential race condition that three human reviewers missed, because it systematically traced the concurrent access paths. The same model, in the same review, suggests a “best practice” that’s actually outdated advice from a deprecated framework. The team learns that model output requires verification even when parts of it are excellent.

Example Prompt

“I need you to integrate with our internal inventory API. Here is the full API documentation: read it before generating any code, because you won’t have training data on this private system.”

Consequences

Carrying an accurate mental model of the model lets you work with it productively rather than fighting its limitations. You learn to provide the context it needs, verify the output it produces, and choose the right model for each task. Routine work moves faster; harder work gets the deeper variant of the same model and a more carefully constructed context.

The cost is dual awareness. You appreciate the model’s capabilities and remain skeptical of any individual output, both at once. This is a cognitive skill that takes practice to develop. Over time, it becomes second nature, similar to how experienced developers learn to trust a compiler’s output while distrusting their own assumptions.

Sources

  • The concept of the large language model traces to Vaswani et al., “Attention Is All You Need” (2017), which introduced the transformer architecture underlying all modern LLMs.
  • Jason Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022), demonstrated that models can perform multi-step reasoning when prompted appropriately, challenging the “just predicts the next word” framing.
  • OpenAI’s release of o1 (September 2024) marked the emergence of dedicated reasoning models that spend compute on extended thinking before responding, establishing the fast-vs-reasoning model distinction as a practical concern for practitioners. The split it defined was later subsumed by hybrid models (GPT-5 in August 2025, Claude Opus 4.5 in November 2025, Gemini 2.5) that combine both modes in a single model with a runtime router or an effort dial.
  • Bartosz Mikulski, “The Temperature=0 Myth: Why Your LLM Still Isn’t Deterministic (And How to Fix It)”, explains why temperature zero gives greedy sampling rather than true determinism, and catalogs the non-determinism sources (GPU floating-point ordering, batching, mixture-of-experts routing) that persist below the sampling layer.
  • Horace He and Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference” (September 2025), identified batch-invariance (not floating-point ordering) as the dominant practical cause of non-determinism in LLM inference, and shipped a companion library of batch-invariant kernels for matmul, RMSNorm, and attention that achieved bit-identical output across 1,000 runs even under dynamic batching.