Tool Sprawl
A single agent’s tool catalog grows past the model’s ability to choose among its members, and accuracy collapses even as the list of capabilities keeps expanding.
Symptoms
- The agent picks the wrong tool for an obvious task, or invents a tool call that doesn’t exist.
- Accuracy drops as the catalog grows. Adding tool number seventeen makes the agent worse at the first sixteen.
- The system prompt balloons. Tool descriptions dominate every turn’s context budget before the user’s message is even considered.
- Step counts rise without the work getting harder. The agent chains three lookups where one would do, because the narrow tools invite chaining.
- Two tools do almost the same thing with different names and slightly different arguments. The agent has to disambiguate every time, and sometimes guesses wrong.
- Nobody on the team can recite the full catalog from memory. New tools get added; old tools never get removed.
- Latency creeps up. Each turn spends more time reading tool descriptions than producing output.
Why It Happens
Every new capability feels free. A narrow tool takes an afternoon to write, solves the immediate problem, and ships. The incremental cost to the catalog looks like zero because no existing tool had to change. Repeat this across a team and a year, and the catalog grows by addition because nothing in the process ever says “retire one first.”
The underlying belief is that models handle tool selection gracefully at any scale. That belief is wrong in an important direction. Tool descriptions sit in the context window and compete with the user’s task for attention. At small catalog sizes the cost is invisible. Past some threshold that no one warns you about, the model’s selection quality degrades faster than each new tool adds value, and the break-even point is much lower than intuition suggests.
Organizational pressure makes this worse. A request for a new capability is easier to answer with “I’ll add a tool” than with “let me redesign two of the existing ones.” Refactoring a tool catalog requires convincing colleagues to change what they depend on. Adding a tool requires convincing no one. The path of least resistance is addition, and sprawl is what that path accumulates into.
The habit of copy-pasting tool definitions from examples compounds the drift. Example catalogs are designed for demos, not production. When a team copies a six-tool starter kit and then adds its own tools on top, the original six become load-bearing because nobody audits whether they still earn their slot.
The Harm
The headline number is accuracy. The most widely discussed 2026 case study came from an engineering team that pared its agent’s catalog from sixteen tools down to one and reported the success rate jumping from 80% to 100%, with latency falling from roughly four and a half minutes per task to about a minute and a quarter, and token use dropping by around 40%. The same model, same prompts, same tasks; the only change was the tool surface. The agent got dramatically better at its job by losing capabilities.
That result sounds unreasonable until you look at the mechanism. Tool descriptions are prose the model reads on every turn, and they compete with the user’s request for the model’s attention. Past a threshold, the model starts confusing tools that sound alike, invoking the wrong one, or calling something that doesn’t exist because its cached pattern of “call a tool” is stronger than its memory of which tools the current catalog actually contains. This is context rot with a specific cause: the rot is coming from inside the agent-computer interface, not from the user’s history.
Token cost is the visible tax. Every turn pays for the entire tool catalog’s description whether the task needs it or not. A catalog with forty tools and three-paragraph descriptions can burn a substantial fraction of a modern context window before the agent starts working. For teams running thousands of sessions a day, the arithmetic bites.
Latency follows token cost, and the step-count inflation piles on top. A catalog split into narrow, single-purpose tools invites chaining, and each chain step costs a full round-trip to the model. Broad, well-designed tools finish work in one or two calls. Narrow, sprawling tools turn the same work into five or eight.
There’s a security dimension the accuracy numbers don’t capture. Every registered tool is a surface that least privilege has to bound. A catalog that exceeds anyone’s working memory also exceeds anyone’s ability to reason about its blast radius. Prompt-injection attacks have more tools to misuse; privilege-escalation chains have more links to find. Sprawl widens the attack surface not because any one tool is bad but because nobody can fit the whole set in their head.
Maintenance cost is the quiet compounding harm. Each tool needs descriptions, schemas, error messages, and tests, and each of those drifts as the catalog grows. The drift isn’t uniform; the tools that get attention get better, and the long tail rots. When the agent’s accuracy drops, diagnosis is expensive because any of forty tools could be the cause.
The Way Out
The corrective habit isn’t minimalism for its own sake. It’s treating the tool catalog like a product surface rather than an append-only list.
Start with the smallest possible tool surface and add only on measured need. Begin with one broad tool if you can — bash, filesystem, a single search — and watch where the agent fails. Add a narrow tool only when the data says the general-purpose one is actually costing accuracy or tokens at meaningful scale. Reverse the default: tools have to earn their seat, not occupy one until someone removes them.
Treat a tool addition like a dependency addition. Before adding, ask whether an existing tool could cover the case with a small schema change. Ask whether two existing tools could consolidate. Ask what the model’s attention budget looks like after this change. Apply bounded autonomy and least privilege from the start; if this tool would be the seventeenth, it had better justify the seat.
Prefer one well-designed tool over many narrow ones when the domain allows. The sixteen-to-one story is an extreme; the general lesson is that consolidated tools with typed schemas often outperform narrow tools with overlapping responsibilities. This is the ACI lesson applied at the catalog level: good interface design reduces the number of choices the agent has to make per turn.
Use tool search or on-demand loading for catalogs that genuinely have to be large. Some domains legitimately need dozens of tools, like orchestrators that cross four system boundaries. For those cases, don’t ship the whole catalog into every turn. Load tools into context only when the agent asks for them by name or category. Anthropic’s MCP tool search feature exists for exactly this reason: it’s the infrastructure response to catalogs that outgrew the ship-everything-every-turn approach.
Filter tools by mode or phase. An agent that plans in one phase and executes in another doesn’t need the execution tools visible while planning. Separate the catalogs by the work the agent is currently doing. A smaller catalog per phase selects better even if the total tool count is unchanged.
Run periodic tool garbage collection. Instrument the catalog. Count how often each tool fires across a month of real traffic. Retire the tools that no one calls. Retire the tools that call each other in predictable chains and replace them with one consolidated tool. Treat this as a recurring habit, not a one-time cleanup, the same way Garbage Collection treats the agent fleet. A catalog without pruning is a catalog that sprawls.
Before you ship a new tool, print the full tool manifest your agent will see on its next turn and count the tokens. If the answer is “more than 10% of the context window before the user says anything,” the catalog is already large enough that adding another tool is likely to make the agent worse, not better.
How It Plays Out
A platform team at a mid-sized SaaS company has built what they consider a capable coding agent. Over fifteen months, their in-house catalog has grown from three tools to thirty-one, tracking capabilities requested by product teams. The agent’s benchmark accuracy has been flat for a quarter and declining on newer tasks. Engineers have started adding prompt suffixes like “use read_file_v2, not read_file” to work around confusion. An intern, running an ablation on a whim, discovers that removing twenty-three of the tools and replacing them with a consolidated search and a consolidated edit lifts the same benchmark by eleven points. The team spends a sprint consolidating, retires eighteen tools outright, and finds that their production error rate drops by roughly a third. The budget they thought they needed to train on a larger model was being spent on tool descriptions the model was drowning in.
A DevOps consultancy is building an agent that has to touch six different cloud providers, a ticketing system, a chat platform, and an internal CMDB. They try the consolidation playbook and find it doesn’t transfer: their agent genuinely crosses nine system boundaries, and the “one bash tool” story doesn’t apply because there’s no shell that spans those nine worlds. Instead, they adopt on-demand tool loading: the agent starts with a short catalog of orchestration tools and a single load_tools meta-tool, and it pulls in a cloud-specific or system-specific toolkit only when the current task requires it. The total number of tools the company maintains stays large, but the number visible on any single turn stays small. Accuracy recovers, and the catalog becomes something their platform team can keep extending without fearing that every addition will degrade the fleet.
A solo developer notices their coding agent has gotten flakier over three months. They haven’t touched the agent’s instructions. They have, however, enabled four MCP servers that colleagues recommended, and between the servers and their own custom tools the agent now sees fifty-two tools on every turn. They disable three of the MCP servers to test the hypothesis. The agent becomes noticeably better immediately, and the failure modes they had been blaming on the model (“it keeps forgetting the project conventions”) turn out to have been attention dilution from the tool catalog. They re-enable one of the servers with only the tools they actually use, leave the others off, and make a note to review the catalog quarterly.
Related Patterns
- Scaled up from: Tool – each tool is innocuous on its own; sprawl is what the population of them looks like when no one is pruning.
- Localizes: Agent Sprawl – agent sprawl is the organizational variant (too many agents, nobody tracking); tool sprawl is the single-agent variant (too many tools, the model can’t choose among them).
- Violates: Agent-Computer Interface – ACI design treats the tool surface as a product; sprawl treats it as an append-only log.
- Causes: Context Rot – tool descriptions live in the context window; sprawl is one of the fastest ways to degrade the quality of everything the agent sees.
- Prevented by: Context Engineering – deliberate choices about what enters the context window, including which tools and when.
- Prevented by: Bounded Autonomy – each new tool widens the set of actions an agent can take; bounded autonomy forces the team to ask whether that widening is scoped.
- Prevented by: Least Privilege – a tool that can’t justify its scope can’t justify its seat in the catalog.
- Countered by: Garbage Collection – ongoing sweeps that retire unused or drifted tools before the catalog compounds.
- Related: Harnessability – a well-harnessable environment keeps the agent’s surface small enough to reason about.
- Related: Attack Surface – every tool in the catalog is an entry point; sprawl widens the surface faster than anyone can bound it.
Sources
The term “tool sprawl” entered software vocabulary well before the agentic era. IT operations teams used it through the 2010s to describe organizations accumulating overlapping monitoring, security, and build tools faster than anyone could consolidate. Industry analysts treated it as a governance problem: too many tools mean too many bills, too many dashboards, and too many gaps nobody owns. The agentic usage inherits the word and the diagnosis, then points them at a different surface: the catalog a single agent carries rather than the catalog an organization runs.
The empirical case for aggressive consolidation crystallized in early 2026 when an engineering team widely reported cutting its agent’s tool count by roughly an order of magnitude and publishing the before-and-after numbers: accuracy up, latency down, token use down, all on the same model. That report gave the pattern a reproducible shape rather than just a slogan, and a cluster of practitioner writing through the first half of 2026 converged on the same name and the same remedy. Independent treatments frame the problem from security, operations, and agent-accuracy angles; they agree that additive catalogs degrade faster than additive thinking expects.
The infrastructure response followed the diagnosis. Frontier labs released on-demand tool loading features so catalogs that must be large can still present small surfaces per turn. That choice validated the framing: the problem isn’t “agents can’t use many tools,” it’s “models can’t choose well among many tools presented all at once,” and the fix is to change what the model sees, not what the organization offers.
The broader lineage is Donald Norman’s line that bad interfaces make users look stupid — Yang et al.’s SWE-agent paper applied that to language-model agents and coined agent-computer interface as the discipline that takes the model’s perceptual limits seriously. Tool sprawl is the failure mode that discipline exists to prevent at the catalog level.
Further Reading
- Anthropic, “Writing tools for agents” – practitioner guide to tool description, consolidation, and response shaping, with an emphasis on the attention budget argument.
- Yang et al., SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (NeurIPS 2024) – the paper that named ACI and established the empirical pattern that a smaller, better-designed tool surface outperforms a larger, raw one on real software tasks.