Back-Pressure (Agent)
Back-pressure is the set of pacing mechanisms that keep an agent from overwhelming itself, its tools, or the humans and systems around it.
Also known as: Agent Throttling, Pacing, Rate Control
Understand This First
- Tool – the surface most back-pressure applies to: the calls an agent makes outward.
- Subagent – parallel sub-agents are the most common saturation source.
- Feedback Sensor – back-pressure decisions are driven by sensor signals (latency, error rate, queue depth).
Context
You’re running an agent that can do a lot in a short window. It can fan out parallel sub-agents, hammer an MCP server, retry a flaky tool, fire hooks on every file change, and ask you to approve actions faster than you can read them. Most of the time that throughput is the point. Some of the time it’s the bug.
This sits at the agentic and operational level, alongside the other configuration surfaces a harness tunes. Approval Policy and Bounded Autonomy decide what the agent is allowed to do. Back-pressure decides how fast and how often it’s allowed to do it. The two questions look similar from a distance and are answered by completely different mechanisms.
The vocabulary comes from reactive systems. In a streaming pipeline, back-pressure is the signal that flows upstream from a slow consumer back to a faster producer, telling it to slow down before it overruns the buffer. The Reactive Streams specification, Akka, RxJava, and TCP windowing all encode the same idea: the only safe way to couple a fast producer to a slower consumer is to let the consumer push back. Agents are the new fast producers. The tools, APIs, downstream services, and humans they touch are the consumers. The pattern transfers directly.
Problem
How do you keep an agent’s throughput from becoming the failure mode it’s supposed to deliver against?
Crank an agent up and characteristic failures appear that don’t look like classical software bugs. A parallel-subagent fan-out hits an API quota in seconds and locks the whole team out for an hour. A Ralph Wiggum Loop spins on a flaky MCP call, racking up token cost without progress. A pre-write hook fires on every edit until the build server can’t keep up. A confirmation-fatigued reviewer (Approval Fatigue) gets buried by approval prompts arriving faster than she can read them and starts pattern-matching her way through. Each one is a rate problem. None of them are caught by the gates that ask whether the action is permitted; the action is permitted, just not at this rate.
Forces
- Throughput is a feature until it isn’t. The same parallel fan-out that finishes a refactor in ten minutes can drain a quota or melt a downstream service. The line between “fast” and “out of control” is rate, not capability.
- Downstream limits are unevenly visible. Some consumers (a rate-limited API) tell you exactly when to slow down. Others (a flaky internal tool, a tired human reviewer) degrade silently and you have to infer the limit.
- Pacing and permission look similar but aren’t. An approval policy that requires sign-off on each destructive command doesn’t slow a benign-looking burst of 200 file edits. A back-pressure cap of five edits per minute does, without changing what’s permitted.
- Static rate limits go stale. A cap that was generous last month can be brittle this month as the codebase, the model, or the tool ecosystem changes. Back-pressure is most useful when it responds to live signals, not just to hard-coded numbers.
- Over-throttling is its own failure. A harness with aggressive back-pressure feels sluggish, drives the human to bypass it, and earns a reputation for getting in the way. The point isn’t to be slow; it’s to be sustainable.
Solution
Treat pacing as a first-class harness surface, separate from permission. For every place the agent talks to something (a tool, an API, a sub-agent pool, a human), name the rate signal you’d use to know it’s saturated, and the response you’d take when it is.
The mechanisms cluster into a few categories:
- Rate limits cap how often a specific tool or API can be called within a window. Useful when the downstream limit is known and stable. Cheap to express; brittle if the limit moves.
- Concurrency caps limit how many things run at once: maximum parallel sub-agents, maximum simultaneous tool invocations, maximum open file handles. The right setting tracks the bottleneck, not the budget.
- Cooldowns insert a minimum gap between successive actions. They smooth bursts and give downstream systems room to breathe. Especially useful between writes, between commits, and between approval prompts shown to a human.
- Queueing with bounded depth lets a producer stay busy while a slower consumer catches up, but caps the queue so a runaway producer can’t accumulate work indefinitely. When the queue fills, the producer blocks.
- Adaptive throttling raises and lowers limits based on observed signals: latency creep, error-rate spikes, 429 responses, sub-agent failure rates. The signal sources come from feedback sensors and AgentOps telemetry.
- Circuit breakers stop a call path entirely once it crosses an error threshold, then probe periodically to see if it has recovered. They’re the last-resort form of back-pressure: when slowing down isn’t enough, stop until something changes. Cascade Failure covers the systemic version of this; the agentic application is the same mechanism scoped to a single tool or sub-agent.
A useful question when you’re designing the harness: if this part of the agent ran twice as fast tomorrow, what would break first? The answer names where back-pressure belongs.
Don’t tune back-pressure in the abstract. Tune it after a near miss. The shape of the failure tells you which mechanism fits: a rate-limit response from a vendor wants a rate cap, a thrashing Ralph Wiggum loop wants an error-rate circuit breaker, a buried human wants a cooldown on approval prompts. Generic global limits set in advance tend to be either too loose to help or too tight to live with.
How It Plays Out
A small team builds a refactoring agent that fans out into eight parallel sub-agents, one per module. The first run finishes in twelve minutes and feels like magic. The second run, on a larger refactor, fires off the same eight sub-agents and they collectively make 2,400 calls to the team’s GitHub MCP server in under a minute. GitHub’s secondary rate limit kicks in and locks every developer on the team out of the API for the next hour. The fix isn’t to give up on parallel sub-agents; it’s to add a concurrency cap (no more than three sub-agents holding a GitHub-MCP slot at once) and a per-sub-agent rate cap (one MCP call per second). The next big refactor takes seventeen minutes instead of twelve. Nobody loses their afternoon.
A solo developer leaves a Ralph Wiggum Loop running overnight on a long migration. One of the tools the agent calls is a flaky third-party API that succeeds about 40% of the time. By morning the agent has burned through $90 of model spend, made no real progress beyond the fifth task in the plan, and the tool is in a worse state than when it started, with a poisoned-cache pattern of half-completed retries. The retrofit is two pieces: a per-tool error-rate sensor that notices the API has dropped below 60% success over the last twenty calls, and a circuit breaker that pauses calls to that tool for thirty minutes once the threshold trips. The next morning the loop finishes the migration, having paused twice when the tool went bad and resumed when it recovered.
A reviewer using a harness with aggressive approval policy gates finds himself approving thirty changes an hour and starting to rubber-stamp. The right response isn’t to weaken the policy; the changes really do want sign-off. The right response is to add back-pressure to the prompt rate. The harness queues approval requests, batches them into review windows every fifteen minutes, and shows them in a single diff view rather than as individual interruptions. Same approvals, different cadence. The reviewer’s accuracy comes back, Approval Fatigue recedes, and the agent doesn’t notice. It sees the same gate, just answered in batches.
Consequences
When back-pressure is in place, an agent’s failure modes change shape. Saturation incidents stop being surprises and become observable events: latency creeps, the throttle engages, the agent slows, telemetry surfaces the cause. Cost becomes more predictable because the worst-case rate is bounded by design rather than by hoping the agent stays well-behaved. Human reviewers stop being a leakage point in the steering loop, because the prompts hit them at a rate they can actually process. And paradoxically, well-tuned back-pressure often increases end-to-end throughput on long tasks, because the agent stops triggering the recovery delays (rate-limit lockouts, retried failed calls, cleanup of half-finished work) that swallow more time than the original throttle would have cost.
The costs are real. Back-pressure is another harness surface to design, monitor, and prune as the codebase and tools change. Static caps go stale and need attention. Adaptive throttling needs reliable feedback signals, and getting those signals wrong (counting transient errors as real ones, missing latency creep) makes the throttle either too eager or asleep. There’s a discoverability problem too: when the agent gets slow because back-pressure engaged, the cause has to surface clearly, or the next person looking at the harness will be debugging a phantom. Logging when a throttle activates, and why, is part of the pattern, not an afterthought.
There’s also a cultural risk. A team that adds back-pressure aggressively without naming the underlying constraint can end up with a harness that feels arbitrary: full of caps and cooldowns whose original justifications were lost. Every back-pressure mechanism should have a one-line note explaining what saturation it’s protecting against. When the protected resource changes, the cap can change with it. When the resource is gone, the cap goes too. Garbage Collection applies here as much as it does to memory.
Related Patterns
- Distinct from: Approval Policy – approval policy gates what the agent can do; back-pressure paces how fast and how often.
- Distinct from: Bounded Autonomy – bounded autonomy scopes the kinds of action permitted; back-pressure regulates the rate of permitted actions.
- Configured by: Harness Engineering – back-pressure is one of the named configuration surfaces a harness engineer tunes.
- Applies to: Tool, Hook, MCP – the call paths most often in need of pacing.
- Applies to: Subagent, Parallelization – parallel sub-agents are the most common source of accidental saturation.
- Senses with: Feedback Sensor – back-pressure decisions need live signals: latency, error rate, queue depth, downstream 429s.
- Monitored by: AgentOps – the operational telemetry that surfaces saturation and the throttles that engaged in response.
- Detects: Ralph Wiggum Loop failure modes – error-rate-based back-pressure catches loops thrashing on a bad tool before they burn the budget.
- Prevents: Approval Fatigue – pacing and batching the rate of approval prompts protects the human reviewer from the volume that defeats them.
- Companion to: Cascade Failure – cascade failure is the systemic outcome back-pressure prevents at the agent scale; the same mechanisms (circuit breakers, bulkheads, queue limits) appear in both.
- Cooperates with: Garbage Collection – pacing rules need pruning when the resource they protected goes away.
Sources
The conceptual ancestor is the reactive-systems literature. The Reactive Streams specification, published in 2014 and 2015 by a consortium of JVM-platform vendors, established back-pressure as a first-class signal in async data pipelines, a response to Erik Meijer’s argument that asynchronous boundaries can’t be made safe without explicit back-pressure. Akka and RxJava are the most widely used reference implementations; TCP’s sliding-window flow control is the same idea expressed at the network layer.
Michael Nygard’s Release It! (Pragmatic Bookshelf, second edition 2018) is the canonical practitioner treatment of how rate-related failures actually look in distributed systems and what to do about them. The “Stability Patterns” chapter introduces circuit breakers, bulkheads, and timeouts as the working vocabulary; this article treats them as the agent-scoped applications of the same ideas.
The naming of back-pressure as a distinct configuration surface for coding agents is newer. It emerged in the agentic coding practitioner literature of early 2026, as writers working on harness engineering started listing pacing alongside instructions, tools, sub-agents, hooks, and governance rather than folding it into one of those categories. That enumeration is still unsettled; this article treats back-pressure as its own surface for the same reason the reactive-systems community did — the mechanisms don’t fit anywhere else cleanly.
The “alert fatigue” framing for the human-pacing case (and the resulting need to throttle approval prompts rather than approval scope) comes out of the clinical decision-support and security-operations literatures, where reviewers facing high-volume repetitive alerts were the first populations studied at scale. Goddard, Roudsari, and Wyatt’s 2012 paper on automation bias in clinical decision-support systems is the most-cited academic anchor.
Further Reading
- Reactive Streams Specification – the canonical articulation of back-pressure as a first-class signal in async pipelines, and the source of the vocabulary this article borrows.
- Michael Nygard, Release It! (2nd ed., 2018) – the practitioner reference for the failure modes back-pressure protects against, with circuit breakers and bulkheads as core tools.
- Erik Meijer, “Your Mouse is a Database” – the 2012 ACM Queue piece that argued back-pressure is what makes async composition safe.