--- slug: retry-budget type: pattern summary: "Cap how often a failed operation may be retried, under what backoff, for which failures, and what happens when the attempts run out." created: 2026-06-20 updated: 2026-06-20 related: idempotency: relation: uses note: "A retry is only safe when the operation is idempotent, so repeated attempts don't duplicate side effects." back-pressure: relation: complements note: "Back-pressure slows the producer; a retry budget caps how much extra load retries are allowed to add." cascade-failure: relation: mitigates note: "Unbounded retries are a classic cascade trigger; a retry budget is one of the brakes that stops a retry storm." checkpoint: relation: uses note: "Checkpoints let a retried run resume from the last good step instead of replaying completed work." externalized-state: relation: uses note: "Persisted step results let the loop skip already-finished work when it retries after a failure." verification-loop: relation: contrasts-with note: "A verification loop acts on a failure signal to improve the next attempt; a retry budget decides whether the same attempt is even worth repeating." autonomous-remediation: relation: used-by note: "The remediation loop's stop condition is a retry budget: repair, verify, and escalate when the budget runs out." agentops: relation: supports note: "Retry counts, exhaustion events, and failure classifications are operational signals AgentOps watches." runbook: relation: complements note: "A runbook names the escalation path a retry budget hands work to once attempts are exhausted." --- # Retry Budget > **Pattern** > > A named solution to a recurring problem. *A retry budget bounds how many times a failed operation may be retried, under what backoff, for which failures, and what happens when the attempts run out.* A tool call times out. A model returns a malformed response. A sandbox command exits non-zero because a dependency wasn't ready yet. The cheapest fix is often to try again, and frequently the second attempt works. So agent frameworks make retries easy, and that ease is the trap. "Try again" with no limit turns a transient hiccup into a runaway loop that burns tokens, duplicates side effects, and pounds on a service that was already struggling. A retry budget is the policy that says how much retrying is allowed before the system gives up and does something else. The word *budget* is doing real work. A budget is finite, it is decided in advance, and when it's gone you stop spending. The same is true here: you set the attempt count, the backoff schedule, and the set of failures worth retrying *before* the loop runs, not in the middle of an incident. ## Understand This First - [Idempotency](idempotency.md) — retrying is only safe when repeating the operation doesn't repeat its side effects. - [Back-Pressure](back-pressure.md) — retries add load, and a budget keeps that added load bounded. - [Cascade Failure](cascade-failure.md) — retry storms are one of the fastest ways a small failure becomes a system-wide one. ## Context This is an **operational** pattern that lives at every boundary where an agent calls something that can fail: a model endpoint, a tool, an API, a queue, a sandboxed command, a sub-agent. It sits next to [Cascade Failure](cascade-failure.md), [Runbook](runbook.md), and [Autonomous Remediation](autonomous-remediation.md), and it depends on the data-and-state patterns that make a repeat attempt safe. The idea isn't new. Site reliability engineering has carried retry budgets for years as a defense against retry amplification, the situation where a struggling backend gets buried under retries from every client at once. What's new is where the pattern now has to apply. Agent runtimes retry at several nested levels at the same time: a single tool call retries, the node that owns the tool call retries, and the whole run can be replayed from a checkpoint. Without a budget at each level, the levels multiply. Three retries at the tool, times three at the node, times two at the run, is eighteen attempts at one operation, and nobody chose that number on purpose. ## Problem Agents fail transiently all the time, and most transient failures do clear on a second try. That makes blind retry tempting and locally correct: each individual retry looks reasonable. The damage is in the aggregate. Unbounded retries against a slow dependency multiply the load exactly when the dependency can least afford it. Retries of a non-idempotent operation duplicate the side effect: two charges, two emails, two rows. Retries of a failure that will *never* clear (a bad credential, a malformed request, a logic bug) waste the entire budget on attempts that were doomed from the first one. So the question isn't "should I retry?" It's sharper than that. Which failures deserve another attempt, how many attempts, how far apart, and what happens to the work that's still failing when the attempts are used up? ## Forces - **Most transient failures clear on retry, but some never will.** A retry policy that can't tell the two apart wastes effort on the hopeless cases. - **Retries help one caller and hurt the shared system.** Every retry is extra load on a dependency that may already be the bottleneck. - **Backoff trades latency for stability.** Waiting between attempts protects the backend but slows the path that would have recovered quickly. - **Synchronized retries are worse than the original failure.** Many clients retrying on the same schedule produce a thundering herd; jitter is what breaks the synchronization. - **Side effects make retries dangerous unless the operation is idempotent.** The safety of the whole pattern rests on being able to repeat the call without repeating its consequences. ## Solution **Classify the failure, retry only the transient classes with bounded attempts and jittered backoff, keep the work idempotent, and route exhausted attempts to compensation or human review.** A retry budget is four decisions made up front. First, classify. Split failures into retryable (timeouts, rate limits, transient network errors, a degraded model response) and non-retryable (authentication failures, malformed input, validation errors, anything that signals a permanent condition). Retry the first group. Fail fast on the second: retrying a bad credential just spends the budget for nothing. Second, cap the attempts. Pick a maximum: often three to five for a tool call, lower for an expensive model call. The cap is per logical operation, and when retries nest you set a budget at each level rather than letting them compound silently. Third, back off with jitter. Space attempts out on an increasing schedule (commonly exponential: wait one second, then two, then four) so a struggling backend gets room to recover. Add randomness to each wait so that many clients failing at once don't all retry in lockstep. The jitter is not optional polish; without it, a synchronized retry wave can be the thing that finishes off the backend. Fourth, decide what happens at exhaustion. When the budget runs out, the operation has failed for real. The loop must now do something other than try again: roll back, run a compensating action, hand the work to a [runbook](runbook.md) or a human, or record it as a known failure for later. Combine the budget with [Checkpoint](checkpoint.md) and [Externalized State](externalized-state.md) so a retried run resumes from the last good step instead of replaying work that already succeeded. > **⚠️ Warning** > > Never retry a non-idempotent operation without a guard. If a tool call charges a card, sends a message, or appends a row, a retry after a timeout can fire it twice. The first attempt may have succeeded before the network dropped the response. Make the operation idempotent (an idempotency key, an upsert, a dedup check) before you let anything retry it. ## How It Plays Out An agent runs a nightly data pipeline that calls an enrichment API for each record. The provider has a brief outage and starts returning 503s. The pipeline's retry logic kicks in with no cap, and every worker hammers the recovering API in a tight loop, which keeps it from recovering. The team adds a retry budget: five attempts per call, exponential backoff with jitter, and treat a persistent 503 after the budget as a deferred record written to a dead-letter queue. The next outage barely registers. The recovering API gets breathing room, the budgeted records drain on the following run, and the pipeline never spirals. A coding agent runs a build-and-test step in a sandbox. The first run fails because a container dependency hadn't finished starting, a transient and retryable failure. A retry budget of three with a short backoff catches exactly this case, and the second attempt passes. But when the build fails because the code doesn't compile, the same budget should *not* burn three attempts on a deterministic failure. The fix is classification: a non-zero exit from a flaky-infrastructure signal is retryable; a compiler error is not. Without that split, the agent wastes two extra builds proving the code still doesn't compile, then escalates anyway. A multi-step agent workflow checkpoints after each node. Step three calls a payment service and times out. Because the call is keyed with an idempotency token, the retry is safe: the service recognizes the token, returns the original result, and the workflow continues. The retry budget bounds how long step three may keep trying; when it's exhausted, the run doesn't restart from step one. The checkpoint lets it resume at the failed node, and the budget hands the still-failing step to a human with the full trace attached. > **💡 Tip** > > Emit a metric every time a retry fires and every time a budget is exhausted. A rising retry rate is an early warning that a dependency is degrading before it fails outright, and a spike in exhaustion events tells you a failure class has stopped being transient. Watching retry counts turns a hidden coping mechanism into an operational signal. ## Consequences **Benefits.** A retry budget recovers the cheap, common case (the transient failure that clears on a second try) without letting that convenience turn into a liability. It caps the blast radius of retries, so a struggling dependency isn't buried by the very clients waiting on it. It makes the system's failure behavior legible: there's a known attempt count, a known backoff, and a known escalation path, instead of an unbounded loop nobody chose. And because the budget produces retry counts and exhaustion events, it hands [AgentOps](agentops.md) a leading indicator of trouble that raw error rates miss. **Liabilities.** The pattern adds parameters that have to be tuned, and bad values cause their own problems. Too few attempts and you fail on hiccups that would have cleared; too many and you've reinvented the unbounded loop with extra steps. Backoff adds latency to the recovery path, which is the wrong trade for a user-facing operation that needs to fail fast. The whole thing rests on idempotency, so a team that adds retries without auditing for duplicate side effects has built a duplication machine. And a budget that always quietly escalates to a human can mask a dependency that's been broken for weeks — the retries succeed often enough that nobody notices the failure rate climbing underneath. ## Sources - Google's *[Site Reliability Engineering](https://sre.google/sre-book/handling-overload/)* (O'Reilly, 2016), in its chapter on handling overload, introduced the per-request and per-client retry-budget framing as a defense against retry amplification, including propagating an "overloaded; don't retry" signal so a struggling backend isn't buried by its own clients. - Michael T. Nygard, *[Release It! Design and Deploy Production-Ready Software](https://openlibrary.org/works/OL8848832W/Release_It!)* (Pragmatic Bookshelf, 2007; 2nd ed. 2018), established the operational patterns — circuit breakers, timeouts, bulkheads — that bound failure propagation and that a retry budget composes with at the call boundary. - The exponential-backoff-with-jitter technique is long-standing distributed-systems practice for breaking up synchronized retry waves; it predates any single framework and is treated here as common engineering knowledge. - The pattern's reappearance at node, step, and run boundaries follows from the spread of durable workflow and agent runtimes, where retries, timeouts, and failure handlers are attached as first-class policies around each unit of work. --- - [Next: Socio-Technical Systems](socio-technical-systems.md) - [Previous: Cascade Failure](cascade-failure.md)