Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Circuit Breaker

Pattern

A named solution to a recurring problem.

A circuit breaker is a stateful guard that stops calling a failing dependency or runaway loop once it crosses a failure threshold, fails fast while the path is broken, and probes for recovery before letting traffic through again.

Also known as: Trip Switch, Fail-Fast Breaker

Where the name comes from

The name is borrowed from the panel in your house. A household circuit breaker trips when current crosses a safe threshold, cutting power before the wiring overheats, and you flip it back on once you’ve fixed whatever caused the overload. The software version does the same job for calls instead of current: it cuts the path when failures cross a threshold, then lets you (or a timer) close it again once the trouble clears. The metaphor carries the whole pattern, so the rest of this article is mostly about what “trips” it and how it decides the path has recovered.

Understand This First

  • Back-Pressure (Agent) — back-pressure paces a call path; the circuit breaker is the last-resort mechanism that stops it entirely.
  • Fail Fast and Loud — the breaker’s open state is failing fast: refuse immediately rather than hang on a dead path.
  • Cascade Failure — the systemic failure the breaker exists to interrupt.

Context

This is an operational pattern that lives wherever an agent repeatedly calls something that can stop working: a tool, a sub-agent, an external API, or the agent’s own act-observe loop. It sits next to Back-Pressure, Retry Budget, and Cascade Failure, and you reach for it once an agent runs beyond a single supervised session.

The idea comes straight from distributed systems, where it’s one of the standard stability patterns for protecting a service from a dependency that has gone bad. If you’ve worked with one before, the agent version will feel familiar, because it’s the same mechanism scoped down to a single tool or loop. What’s new is the failure it has to catch. A distributed-systems breaker mostly worries about a slow or erroring dependency. An agent breaker worries about that too, plus a class of failure the agent generates itself: a loop that keeps making the same doomed call all night, with a meter running on every attempt.

Problem

How do you stop an agent from grinding away on something that isn’t working, before the grinding does real damage?

An agent left to run unsupervised has a characteristic failure that classical software doesn’t. It calls a tool, the tool fails, the agent reasons that it should try a slightly different approach, calls again, fails again, and settles into a tight loop making thousands of near-identical failing calls. Nobody is watching at 3 a.m. By morning the agent has burned a week’s model budget, made no progress, and possibly left the target system in a worse state than it found it.

A retry budget caps the retries of any one operation, but the agent isn’t retrying one operation. It’s failing across a hundred slightly different ones against the same broken dependency, and each individual attempt looks reasonable. What’s missing is something that watches the path, the whole stream of calls to that tool, and says: stop, this isn’t working, come back later.

Forces

  • Transient failures clear; persistent ones don’t. A breaker that trips on the first error is useless, and one that never trips is just an unbounded loop. The threshold has to tell ongoing trouble apart from noise.
  • The agent can’t be trusted to police itself. A confused agent will reason its way around a soft “please stop” instruction in its own context. The guard has to live outside the agent’s control or it isn’t a guard.
  • Stopping has a cost too. Every call the breaker refuses while open is a call that might have succeeded. Trip too eagerly and you’ve made the system flaky on purpose.
  • Recovery is a guess. The breaker can’t know the dependency is healthy again without testing it. The test might fail, or worse, it might stampede a service that’s barely back on its feet.
  • The trip signal isn’t always “errors.” For an agent, the runaway condition is sometimes cost per hour, or the same call repeated with near-identical arguments, not an error rate at all.

Solution

Wrap the failing path in an out-of-band state machine. It trips open on an aggregate failure signal, fails fast while open, and probes once before closing. The classic shape has three states.

  • Closed. Calls flow normally. The breaker counts failures (and, for an agent, watches cost velocity and repeated-call signals). This is the healthy default.
  • Open. The path has crossed its threshold. The breaker refuses every call immediately and returns a fast failure instead of letting the agent wait on a dead dependency. It stays open for a cooldown period.
  • Half-open. After the cooldown, the breaker lets exactly one trial call through. If it succeeds, the breaker closes and traffic resumes. If it fails, the breaker snaps back to open and waits again.
stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failure threshold crossed
    Open --> HalfOpen: cooldown elapsed
    HalfOpen --> Closed: trial call succeeds
    HalfOpen --> Open: trial call fails

Four decisions turn that skeleton into a working breaker:

  • Pick the trip signal. Consecutive failures, an error rate over a sliding window, spend-per-hour over a ceiling, repeated near-identical calls, or an iteration count that keeps climbing without progress can all signal that the path is broken.
  • Pick the scope. A breaker can wrap one tool, one sub-agent, one act-observe loop, or the whole agent. Finer scope isolates better but multiplies the breakers you maintain.
  • Decide what the agent does while open. Fall back to an alternative, escalate to a human, or stop and report. An open breaker that leaves the agent spinning on the refusals has solved nothing.
  • Enforce it out of band. The breaker must run in the harness or the tool layer, not as an instruction in the agent’s prompt. A confused agent can’t disable the thing that’s protecting it.

Warning

A circuit breaker written as a soft instruction in the system prompt (“if a tool fails repeatedly, stop calling it”) is not a circuit breaker. It’s a suggestion the agent can and will reason past once it’s confused, which is exactly when you need the guard to hold. The breaker only works if it lives where the agent’s reasoning can’t reach it.

How It Plays Out

A developer leaves a coding agent running overnight on a long migration. One of the tools it calls is a flaky internal service that has started returning 500s on every request after a bad deploy. With no breaker, the agent treats each failure as a fresh problem to reason about, rewords the call, and tries again, about four times a second. By morning it’s logged 90,000 failed calls and spent the better part of a hundred dollars going nowhere. The fix is a per-tool breaker that trips after twenty consecutive failures, refuses calls for fifteen minutes, then probes. The next time that service breaks, the agent makes twenty calls, the breaker opens, the agent escalates the blocked step to the developer’s queue, and the run pauses instead of burning the night.

A team runs a research agent that fans out to a paid search API. A pricing change quietly drops their rate limit, and the agent’s normal fan-out now trips the limit constantly, with every rejected call still billing a fraction of a cent. Error rate alone wouldn’t catch it fast enough, so they wire the breaker to a cost-velocity signal: if spend on that API crosses a dollar a minute, trip open for five minutes. The breaker now stops a runaway before it becomes a runaway bill, and the five-minute cooldown gives the limit window time to reset before the half-open probe tries again.

A platform team protects a shared code-execution sandbox that several agents hit at once. When the sandbox host gets overloaded, calls start timing out, and the agents’ retries pile more load onto a host that’s already drowning, a textbook cascade. A breaker scoped to the sandbox, tripping on an error rate above 50% over the last forty calls, opens the path the moment the host degrades. The agents fail fast and back off, the sandbox gets room to recover, and the single trial call on half-open keeps every agent from stampeding it the instant the cooldown ends.

Consequences

Benefits. A breaker turns the silent overnight grind on a broken path into a bounded, observable event. The path fails a known number of times, the breaker opens, and the cause shows up in telemetry instead of in next month’s bill. Cost is capped by design rather than by hoping the agent behaves. It stops a local failure from cascading: the open state is the brake that keeps one bad dependency from dragging its callers down. The half-open probe gives you automatic recovery without a human in the loop, so a dependency that comes back on its own gets picked back up on its own.

Liabilities. The thresholds are tuning parameters, and bad values cause their own failures. Trip too tight and the breaker opens on normal transient noise, making a healthy path flaky. Trip too loose and it never engages when it should. The half-open probe is a genuine hazard. If more than one call gets through on recovery, a fleet of waiting agents can stampede a service the instant it comes back and knock it over again, so the single-probe discipline matters. A breaker is also one more piece of harness to build, monitor, and prune. When it opens and the agent goes quiet, the reason has to surface clearly, or the next person debugging the agent will chase a phantom stall with no idea a breaker is holding the line.

Sources

  • Michael T. Nygard, Release It! Design and Deploy Production-Ready Software (Pragmatic Bookshelf, 2007; 2nd ed. 2018), introduced the circuit breaker as a named stability pattern in its “Stability Patterns” chapter, establishing the closed, open, and half-open states and the failure-threshold-and-cooldown shape this article scopes down to a single tool or loop.
  • Martin Fowler’s “CircuitBreaker” write-up gives the reference articulation of the state machine, including the half-open trial call and the choice of trip signal, and is the most-cited practitioner treatment of how to implement one.
  • Agent-scoped trip conditions such as cost velocity, repeated near-identical calls, iteration count without progress, and out-of-band enforcement emerged from the agentic coding practitioner community through 2026, as teams running unsupervised agents found that error-rate breakers alone missed the runaway-loop and runaway-spend failures unique to autonomous agents.

Further Reading