Structured Outputs
Constrain a language model’s response to a known schema so the next program in the pipeline can parse it without guessing.
Also known as: JSON Mode, Constrained Decoding (the implementation technique), Response Format
Understand This First
- Tool — the dominant consumer of structured outputs; tool calls only work when the model returns parseable arguments.
- Schema (Serialization) — the vocabulary (JSON Schema, Pydantic, Zod) that a structured-output contract is written in.
- Agent-Computer Interface (ACI) — the design surface where response-shape decisions are made.
Context
A language model emits text. The program that called the model usually wants something else: a tool invocation, a typed record, a list of extracted entities, a routing decision, a graded score. Somewhere between the model’s free-form text and the next program’s typed input, the gap has to close.
The original move was to ask the model nicely. “Reply with a JSON object that has these three fields.” The model would mostly comply. Then it would helpfully add a markdown code fence, or apologize before answering, or invent a fourth field, or omit a comma, and the downstream JSON.parse would crash. Logs filled up with retry loops and regex patches. OpenAI’s own data shows compliance with a target schema hovering under 40% when the shape was requested in the prompt and left to the model’s discretion.
Structured outputs close that gap at the model layer. The caller declares a schema; the provider constrains generation so the response is guaranteed to conform. The downstream program no longer guesses. The pattern is now standard across OpenAI, Anthropic, Google, Cohere, vLLM, and the cross-provider routing layers (LangChain, LiteLLM, OpenRouter) that wrap them.
Problem
How do you connect a model that produces tokens to a program that needs typed values, without spending the rest of your career writing parser fallbacks?
The intermediate step has to be reliable enough that the calling code can treat the model’s response as a typed result, not as an untrusted blob to defensively parse. It has to be cheap enough to use on every call. And it has to leave the model enough room to actually think: a schema so tight that it suppresses reasoning is worse than a schema that occasionally fails.
Forces
- Reliability versus expressiveness. A strict schema rules out malformed responses, but it can also rule out useful answers the schema-author didn’t anticipate. The right shape lets the model say what it needs to say while ruling out shapes the caller can’t handle.
- Latency cost of constrained decoding. Constraining generation at the token-sampling layer adds work to each step. On short responses the cost is invisible; on long ones it shows up in the wall clock.
- Reasoning quality versus structural rigor. Practitioners report that very tight schemas sometimes degrade the model’s chain of thought, because the model can’t write its way to the answer. Leaving a free-form
reasoningfield, or doing the thinking in a separate unconstrained call, often outperforms forcing the whole response into a strict shape. - Schema drift between client and model. When the schema lives in two places (the calling code and the request body) it will eventually fall out of sync. The team that doesn’t generate one from the other will spend an afternoon a quarter chasing the divergence.
- The wrong required field. A required field the model can’t fill cleanly produces a fabricated value rather than an honest gap. This is one of the most common ways structured outputs go wrong, and it’s invisible until you read the data.
Solution
Declare a schema, hand it to the provider, and let the provider’s constrained-decoding layer guarantee the response conforms. The schema is part of the request, not the prompt. The model’s natural-language instruction can still describe what to fill the fields with; the shape is no longer the model’s responsibility.
Three implementation styles are common, and most production systems use a mix:
Provider-native JSON Schema. OpenAI, Anthropic, Google, and Cohere all accept a JSON Schema (or a Pydantic / Zod model that compiles to one) on the request. The provider runs constrained decoding under the hood: at each token-sampling step, the candidate next-tokens are filtered to those that keep the response on a path that can still satisfy the schema. OpenAI calls this response_format: { type: "json_schema", strict: true }; Anthropic exposes it through tool-use input schemas; Google through responseSchema. Strict mode is what closes the 40%-compliance gap: with the schema enforced at the sampling layer, conformance reaches 100% on the same evaluations.
Tool-call schemas. Every tool the model can call is declared with an input schema. When the model decides to call a tool, the response is structurally a tool invocation: a tool name plus arguments that satisfy the schema. Tool use is structured outputs in disguise — the schema happens to live on the tool definition rather than on the request itself, but the constraint mechanism is the same. This is the path most agentic systems use most of the time.
Validate-and-retry frameworks. Libraries like Instructor, LangChain’s structured output, and Pydantic AI wrap any model behind a typed interface: the caller passes a target type; the library serializes a schema, sends the request, validates the response, and retries on failure with the validation error injected back into the next prompt. This is the right answer when working across providers that don’t all support native constrained decoding, or when the schema is too dynamic to express in the provider’s format.
The cross-cutting discipline is the same in all three: schema is contract, prompt is intent. Keep the what to fill in the prompt and the how it must look in the schema. Don’t restate the shape in the prompt; the model can already see it. Don’t try to enforce shape from the prompt; the model is no longer the right enforcement layer.
Leave the model room to think. If a schema requires the model to commit to a final answer in one field with no scratch space, consider adding a reasoning (or analysis, or thinking) string field before the answer field. The model fills it on its way to the answer, and the cost is a few extra tokens. Strict-schema-only responses tend to underperform on tasks where the answer is genuinely a conclusion rather than a lookup.
How It Plays Out
A team builds an extraction pipeline that pulls structured records out of inbound contracts. The first pass uses prompt-only instruction: “Reply with a JSON object containing party_a, party_b, effective_date, term_months.” It works most of the time. Once a week, the model returns a date in a format the parser doesn’t recognize, or wraps the JSON in a markdown code fence, or apologizes that one of the fields wasn’t visible in the document. The downstream pipeline catches the parse error and retries. After three months the retry rate is 6% and the retry log is the team’s largest unread Slack channel.
The second pass switches to provider-native structured outputs. The schema declares effective_date as an ISO-8601 date string and term_months as an integer. The team adds a notes string field for the model to flag fields it couldn’t extract cleanly, replacing the missing-data fabrication problem with an honest “field not present in document” annotation. Parse-error rate drops to roughly zero. The Slack channel goes quiet.
A few weeks in, the team notices a problem they hadn’t seen before: contracts written with relative dates (“the third Tuesday of next month”) show up as fabricated absolute dates, because the schema is too tight on the date format to admit anything else. They add a date_is_relative boolean field and a relative_date_text string field; the model now surfaces the cases the parser was previously hiding.
A coding agent uses tool-call schemas as its primary interaction surface. Every action (read a file, run a test, search the codebase, write a patch) is exposed as a tool with a typed input schema. When the model decides to read a file, it doesn’t emit text describing what it wants to do; it emits a structured tool call with path, start_line, and end_line arguments that the harness can dispatch directly. The agent never has to worry about whether its action is parseable, because the model can’t emit one that isn’t. The harness logs are clean tool invocations rather than free-form text the harness has to interpret. The whole stack downstream of the model gets simpler.
A generator-evaluator loop has the evaluator return a structured judgment: a numeric score (0–10), a categorical verdict (accept, revise, reject), and a free-text rationale. Without a schema, the evaluator’s responses ranged across formats; the loop spent more time normalizing the verdict than acting on it. With a strict schema, the verdict is reliably one of three enum values and the score is reliably an integer in range. The next stage of the loop can be a simple switch statement.
“You are an extraction agent. The user will paste a meeting transcript. Use the extract_actions tool, which has a schema requiring action_text, owner_name, and due_date_iso. For each action item that doesn’t have a clear owner or due date, set the corresponding field to null and add a one-line note in the rationale field explaining what was unclear. Don’t fabricate names or dates.”
Consequences
The wins show up immediately. Parse-error rates collapse: providers that publish numbers report 100% schema compliance on strict mode versus 30–40% on prompt-only instruction. The downstream pipeline gets simpler because every defensive parser branch can be deleted. Tool use becomes practical at scale because the model can’t emit an unparseable tool call. The whole agent ecosystem rests on this foundation; without it, the harness would spend more code on response normalization than on doing actual work.
The cost is a discipline, not an outage. Constrained decoding adds latency on long responses. Strict schemas occasionally degrade reasoning quality, which is usually fixable by adding a free-form thinking field but requires the engineer to notice. The most subtle failure mode is fabricated values for required fields the model can’t honestly fill: the schema validates but the data is wrong. Make absent-data values explicit in the schema (nullable fields, or a confidence field, or a structured missing_reason enum) and the model will use them; force the field as required and unbounded and the model will invent.
A second cost is the architectural commitment. Once the schema is in production, changing it has the same cost as any other API change. Versioning structured-output contracts the way you version any other interface (additive changes only, deprecate before remove, never reuse a field name with a different type) pays off as soon as more than one consumer reads the data.
A third is portability. Provider-native structured outputs work brilliantly inside one provider’s stack. Cross-provider abstractions (LiteLLM, OpenRouter) flatten the differences but at the cost of dropping to the lowest common denominator on schema features. Teams that need to swap providers at the model-routing layer eventually pick a validate-and-retry framework as the portable substrate and accept the extra round-trip cost on responses that fail validation.
Structured outputs also shrink certain attack surfaces. A response constrained to a fixed schema can’t smuggle arbitrary control-flow text into a downstream parser, which closes off some prompt-injection routes that depend on the response containing free text. They are not a substitute for output encoding at the human-facing surface, which is a separate problem with its own discipline. The schema constrains what the model can say; encoding constrains what the rendering layer can do with what was said.
Related Patterns
| Note | ||
|---|---|---|
| Complements | MCP (Model Context Protocol) | MCP standardizes how tools are exposed; structured outputs are how the model returns parseable arguments to call them. |
| Complements | Verification Loop | Validation failures on structured output are the most common signal that drives the loop's next iteration. |
| Contrasts with | Output Encoding | Output encoding is downstream sanitization for human-facing surfaces; structured outputs constrain the model itself, upstream of any sanitizer. |
| Depends on | Code Mode | An agent that writes code calling tools needs structured-output discipline at every call site. |
| Detects | Smell (AI Smell) | When a strict schema forces the model to fabricate values for fields it cannot fill, the result is a recognizable AI smell. |
| Enabled by | Generator-Evaluator | An evaluator that returns a graded JSON judgment depends on the model's response conforming to a fixed shape. |
| Enables | Tool | Tool calls only work because the model's response can be constrained to the tool's input schema. |
| Mitigates | Prompt Injection | A response forced into a fixed schema cannot smuggle arbitrary control-flow text into a downstream parser. |
| Shaped by | Agent-Computer Interface (ACI) | The shape of a structured response is one of the ACI's central design surfaces. |
| Uses | Schema (Serialization) | Structured outputs ride on JSON Schema, the same wire-format vocabulary used to describe data in transit. |
Sources
The mechanism draws on two decades of constrained-decoding research, ported to the autoregressive language-model setting. The vocabulary “Structured Outputs” stabilized across the industry in late 2024 and early 2025, as OpenAI, Anthropic, Google, and Cohere converged on the same provider-side feature under the same name.
Will Kurt and Brandon Willard’s Outlines (2023) described an efficient algorithm for constrained generation against arbitrary regular expressions and context-free grammars, and showed that the cost of constraining generation can be made nearly free with the right pre-processing. The technique sits underneath several of the major providers’ implementations.
Jason Liu’s Instructor library popularized the validate-and-retry pattern in the Python ecosystem from 2023 onward. Instructor’s framing (“ask for a Pydantic model, get a Pydantic model back”) became the dominant developer-facing abstraction even in environments that later got native structured-output support, because the typed-interface ergonomics matter independently of the underlying mechanism.
JSON Schema itself, originally drafted by Kris Zyp in 2010 and steered through IETF since, is the substrate every native implementation reads. The fact that the same vocabulary already had a decade of tooling around it is part of why the industry standardized on it rather than inventing a new schema language for LLM outputs.
The “leave room to think” practice (adding free-form reasoning fields inside an otherwise strict schema) emerged from the agentic-coding practitioner community through 2024 and 2025 as teams discovered that strict-schema-only responses underperformed on reasoning-heavy tasks. The technique has no single canonical author; it converged independently in multiple frameworks.
Further Reading
- OpenAI, “Structured model outputs” — the canonical vendor reference for
response_format: json_schemawith strict mode, including the published 40%-to-100% compliance result. - Cohere, “How do Structured Outputs Work?” — a vendor-neutral explanation of the mechanism that’s useful even if you’re not using Cohere.
- LiteLLM, “Structured Outputs (JSON Mode)” — the cross-provider abstraction’s documentation, which is also the most concise inventory of which providers support which features.
- Snyk, “Building Safer AI Agents with Structured Outputs” — the security framing: how a constrained response shape closes injection-attack surfaces that a free-text response leaves open.