Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Model Routing

Pattern

A reusable solution you can apply to your work.

Match the model to the task so you spend your budget where it matters and your time where it counts.

Understand This First

  • Model – the capability spectrum that makes routing necessary.
  • Tradeoff – routing is a cost/capability/latency tradeoff made at the system level.

Context

At the agentic level, you rarely use just one model for everything. Models vary in cost, speed, and capability. A frontier reasoning model might charge ten times what a fast general-purpose model charges, and take ten times longer to respond. For tasks that need deep reasoning — debugging a subtle concurrency bug, reviewing an architectural decision, writing a security audit — that cost is worth it. For generating boilerplate, formatting code, or filling in documentation from an outline, it’s waste.

Model routing is the practice of directing different tasks to different models based on what each task actually requires. It applies whether you’re a single developer choosing which model to use for a given prompt, a harness that selects models automatically, or an agent team where each member runs on a model matched to its role.

Problem

How do you get good results across a wide range of tasks without burning through your budget on work that doesn’t need your most expensive model?

Using a single frontier model for everything is simple but costly. Using only cheap models saves money but produces worse results on hard tasks. You end up either overspending on routine work or underinvesting in the work that actually needs strong reasoning.

Forces

  • Cost scales with capability. More capable models cost more per token. Using a reasoning model for string formatting is like hiring a surgeon to apply a bandage.
  • Latency scales with capability. Reasoning models with extended thinking take longer to respond. For interactive work where you’re waiting on each response, that delay compounds.
  • Task difficulty varies within a single session. You might move from renaming a variable across files to designing a caching strategy and back. The model that’s right for one is wrong for the other.
  • Quality thresholds differ. A first draft of a test file can tolerate rough edges that a production security review can’t.

Solution

Route each task to the cheapest model that can handle it well. Develop a sense for which tasks need strong reasoning and which don’t, then select models accordingly.

Most developers who’ve tuned their workflow converge on a similar split: a capable but affordable model (Sonnet-class) handles 70-80% of coding interactions, with a frontier reasoning model (Opus-class) reserved for the rest. That ratio alone can cut costs by 60% or more without meaningful quality loss on routine work.

Two questions drive the routing decision. First, does this task require multi-step reasoning? Architecture decisions, complex debugging, and security analysis benefit from a reasoning model. Code generation from a clear spec, mechanical refactoring, and documentation formatting don’t. Second, how much does a mistake cost? Output that goes straight to production or informs an irreversible decision warrants a stronger model. Output that will be reviewed, tested, or used as a rough draft can come from a lighter one.

Latency matters too, though it cuts differently. For interactive work where you’re blocked until the model responds, a faster model keeps you in flow. For background tasks — a subagent running tests, a batch of file searches — cost matters more than speed.

At the system level, routing takes several forms:

Manual routing is the simplest. You pick the model yourself, switching mid-session or per-task as the work shifts between easy and hard. Most individual developers start here and many stay here. The overhead is low, and the judgment improves with practice.

Rule-based routing moves the decision into the harness or orchestration layer. Code reviews go to the reasoning model; test execution goes to the fast model; documentation goes to the mid-tier. The rules are explicit, predictable, and easy to audit — but brittle when tasks don’t fit the categories cleanly.

Cascading automates the “try cheap first” instinct. The system sends every request to the cheapest viable model and checks the result against a quality gate (a confidence score, a schema validation, a secondary evaluation prompt). If the gate fails, the same request escalates to the next tier. Because most requests pass at the cheap tier, the system spends frontier-model prices only when it has to. One customer support platform cut monthly LLM spend from $42,000 to $18,000 this way, routing simple queries to a fast model and escalating only the complex ones.

Learned routing uses a lightweight classifier — often itself a small model — to examine each request and choose the best model dynamically. The classifier adds a small overhead per request but can reduce total cost by 40-80% compared to a single model. This is the approach large-scale agent systems use when optimizing across thousands of daily requests.

Tip

When you’re unsure which model a task needs, start with a lighter model. If the result isn’t good enough, escalate to a stronger one. The few extra seconds you spend on hard tasks are repaid many times over by the savings on easy ones.

How It Plays Out

A developer building a REST API uses a fast model for scaffolding endpoint stubs, generating request/response types, and writing the initial test harness. She hits a tricky validation problem involving nested transactions and switches to a reasoning model that can hold the full constraint set in working memory. Once she has a solution, she drops back to the fast model for implementing it across the remaining endpoints. Her total cost for the session is a third of what the reasoning model alone would have charged.

An engineering team configures their agent pipeline with three tiers: a small, fast model for formatting and boilerplate; a mid-range model for feature implementation and test writing; and a frontier reasoning model for architecture reviews and complex debugging. A lightweight router classifies incoming tasks based on keywords and context. Over the first month, API costs drop by 65%. Quality on high-stakes tasks actually improves — the reasoning model’s context window is no longer cluttered with routine work that belonged at a lower tier.

Consequences

The most visible benefit is cost. Teams that route intelligently report 40-80% reductions in model API spending. Those savings change what’s economically viable: tasks that weren’t worth running through a frontier model become affordable when routed to the right tier.

Speed improves in lockstep. When routine tasks zip through a lightweight model, your interactive development loop tightens and background pipelines finish sooner.

The tradeoff is complexity. Every task now carries a routing decision, whether you’re making it yourself, encoding it in rules, or delegating it to a classifier. A bad routing call — sending a hard task to a weak model — produces output that costs more to fix than the routing saved. Over-routing in the other direction (“use the big model just to be safe”) erases the savings entirely. Getting the split right takes experimentation, and the split itself drifts as models improve and pricing changes.

The model field moves fast enough that your routing strategy needs periodic review. A model that was frontier-class six months ago may sit in the mid-tier today, and a new release from a different provider may outperform your current favorite on specific task types.

  • Depends on: Model – the capability spectrum across models is what makes routing necessary.
  • Uses: Tradeoff – every routing decision is a cost/capability/latency tradeoff.
  • Enables: Subagent – subagents are a natural routing boundary; delegate to a cheaper model for focused subtasks.
  • Complements: Agent Teams – team members can run on different models matched to their roles.
  • Informed by: Harness (Agentic) – the harness implements the routing logic.
  • Related: Parallelization – cost savings from routing make large-scale parallelization economically viable.

Sources

  • Micheal Lanham published “The Model Routing Playbook” (February 2026), one of the first practitioner guides organizing routing strategies by task type and providing cost-optimization benchmarks for multi-model workflows.
  • The CLEAR framework for enterprise agentic evaluation (2026) quantified the cost of ignoring routing: systems optimized solely for accuracy were 4.4 to 10.8 times more expensive than cost-aware alternatives that achieved comparable performance.
  • Addy Osmani’s “The Code Agent Orchestra” (2026) documented model tiering in multi-agent setups, where orchestrator agents use reasoning-class models while worker agents use faster, cheaper models for execution-level tasks.