Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Failure Mode

Concept

Vocabulary that names a phenomenon.

A failure mode is a specific, named way a system breaks or degrades, and the word is what lets a team talk about a single way of failing instead of “the system went wrong.”

What It Is

A failure mode is one identifiable way a system breaks, degrades, or produces the wrong result. A database that becomes unreachable is one failure mode. A request that times out before its work completes is another. A disk that fills to capacity is a third. Each names a distinct failure path through the same system, with its own trigger, its own evidence, and its own appropriate response.

The phrase is singular on purpose. Every nontrivial system has many failure modes, and the discipline is to keep them separate rather than collapsing them into “it broke.” A crash and a Byzantine return value are both failures, but the operational response to each is different: a crash you restart from, a Byzantine result you have to detect before it corrupts everything downstream. Without separate names, the team conflates them and ends up with one generic “alert on errors” rule that misses half the categories that matter.

The common failure-mode vocabulary covers the categories practitioners see most often:

  • Crash — the process or component terminates unexpectedly.
  • Timeout — an operation runs longer than its budget and is abandoned.
  • Resource exhaustion — memory, disk, file handles, connections, or threads run out.
  • Data corruption — stored state becomes inconsistent or invalid.
  • Dependency failure — a service or library the system relies on stops working.
  • Byzantine failure — a component returns wrong results while reporting success.
  • Silent failure — work fails but no observable signal records that it did.

The set isn’t exhaustive and isn’t meant to be; a real system catalogs its own failure modes at the level of detail it needs to operate. The vocabulary’s value is that it gives the team a fixed set of buckets to argue with.

In agentic coding, the same vocabulary applies to the agent. An agent can time out by exhausting its context window, return Byzantine output by hallucinating with confidence, fail silently by claiming a change was made that wasn’t, or crash by hitting a tool error mid-task. Treating the agent as a component with named failure modes is what lets a team design safeguards for each one separately, rather than treating “the agent was wrong” as a single undifferentiated event.

Why It Matters

Without the word, failures get described by their symptoms rather than by what they are: “the dashboard is slow,” “users are getting errors,” “the deploy didn’t work.” Symptoms are not failure modes. Two very different failures can present the same symptom, and the same failure mode can present different symptoms in different contexts. A team that doesn’t separate the failure mode from its presentation ends up debugging blindly: it treats every slow dashboard the same way, when one slow dashboard is a timeout against a dependency and another is resource exhaustion on the server itself.

Naming the mode is also what makes a response designable. A timeout has a different response than a crash: timeouts want retry-with-backoff or a graceful fallback; crashes want a restart and a postmortem. Resource exhaustion wants backpressure or capacity; data corruption wants a rollback and an audit. Until the mode is named, the team can’t argue about which response is appropriate, because they’re not yet talking about the same thing.

There’s a second-order effect that matters more over time. A team that catalogs its failure modes builds an institutional memory of how its system breaks. Each named mode is a hypothesis the team has tested against reality: sometimes confirmed by a real incident, sometimes refuted by one that didn’t fit any existing category. The catalog grows. A new engineer can read it and learn what this system has actually done in the wild, instead of inheriting “be careful, things can break” as their only briefing.

The concept also reframes what “reliability work” means. Reliability is not the absence of failure; it’s the deliberate handling of known failure modes. A system that has thought through its failure modes and chosen a response for each one is reliable; a system that has only thought about its happy path is fragile, even if it hasn’t broken yet. The word makes that distinction visible.

How to Recognize It

You’re looking at a named failure mode when three things hold: there’s a specific trigger, there’s specific evidence, and there’s a specific category. “The service went down” is not a failure mode; it’s a symptom. “The service crashed because the database connection pool was exhausted, evidenced by ConnectionTimeoutException in the logs and zero successful queries for the next two minutes” is a failure mode (resource exhaustion of the connection pool, triggered by load above provisioned capacity).

Signs that a team is reasoning in failure-mode vocabulary rather than around it:

  • Postmortems classify the incident. The writeup names which failure mode fired, not just what happened. “This was a Byzantine failure: the upstream returned 200 OK with garbage in the response body” is the language of a team that has the vocabulary.
  • Monitors are named by mode, not by service. A team that has “alert: payment-service down” is monitoring symptoms. A team that has “alert: payment-service timeout rate above 1%” and “alert: payment-service 5xx rate above 0.1%” is monitoring modes.
  • Tests exercise specific modes. Chaos tests inject a particular failure (kill the process, drop the connection, fill the disk) rather than just “make something fail.” Each test names which mode it’s exercising.
  • Runbooks branch by mode. The on-call runbook starts with “what kind of failure is this?” and routes to a different response for each category. A runbook that says “if anything looks wrong, page the SRE lead” is one that hasn’t catalogued failure modes yet.
  • Architecture conversations name the modes the design defends against. “We’re choosing a queue here because it lets us survive a downstream timeout without dropping work” names the mode the design is built around. “We’re using a queue because queues are reliable” doesn’t.

The deeper signal is what the team says when a flake or an outage shows up. If the response is “the system was just having a bad day,” the team is missing the vocabulary. If the response is “that was the third disk-full event this quarter; we need to add a watchdog or move the logs off-volume,” the team is reasoning in failure modes.

Note

The most dangerous failure modes aren’t the obvious ones (crash, timeout) but the subtle ones: data that is almost correct, responses that are slightly wrong, processes that succeed but produce garbage. These are the failures that survive testing and reach users, and they’re often the ones the team doesn’t have a name for yet.

How It Plays Out

A weather application depends on a third-party API for forecast data. The team enumerates the failure modes for this dependency before launch: the API can be unreachable (timeout), it can return stale data (data quality), or it can return an error (explicit failure). For timeouts, the app shows the last cached forecast with a “data may be outdated” banner. For stale data, it checks the timestamp on the response and warns the user if the freshness budget is exceeded. For errors, it falls back to a simplified forecast from a secondary source. None of these responses is perfect, but each is a designed answer to a specific mode, and the team can talk about them separately when one of them turns out to be wrong.

A payments team runs a postmortem on a partial outage. The summary line in the writeup names the mode: “Byzantine failure of the rate-limiter: the limiter returned allowed: true for requests it should have blocked, with no error logged.” The writeup separates that mode from the symptoms it produced (duplicate charges on a small fraction of customers) and from the cascading mode it set off downstream (database corruption when duplicate inserts collided on a unique key). Naming the three modes (Byzantine, duplication, corruption) is what lets the followups split into three separate fixes rather than one vague “make the rate-limiter better.”

A platform team running a coding agent against a large refactor begins to catalog the agent’s failure modes the same way they catalog the rest of their system. The agent silently failing (returning “done” when no change was made) gets a verification gate: every claimed file change is checked against git diff before the agent’s report is accepted. The agent Byzantine-failing (returning code that compiles but is semantically wrong) gets a test-suite gate. The agent timing out (running out of context before finishing) gets a checkpoint-and-resume protocol. The team isn’t trying to make the agent reliable in the abstract; they’re handling each named mode with a specific safeguard.

Example Prompt

“For our dependency on the payments API, list the failure modes we should plan for: timeout, stale read, 5xx error, rate-limit response, and Byzantine response (200 OK with malformed body). For each mode, propose a detection signal and a fallback behavior. Add a test that simulates each mode.”

Consequences

A team that catalogs and names failure modes ends up with a more honest picture of its system. Each named mode is a place where the design has been thought through and a place where the team can argue about whether the current response is good enough. The picture isn’t comforting (most systems have more failure modes than anyone expected), but it’s actionable: you can decide which modes to prevent, which to detect, which to mitigate, and which to accept.

The catalog also changes what “broken” means in conversation. Once modes have names, “the system is broken” stops being a single thing; it’s “we’re seeing the database-connection-exhaustion mode again” or “this is a new mode we don’t have a name for.” That precision compounds over time: incidents teach the team something specific rather than reinforcing a vague sense of fragility.

The cost is the analysis itself. Enumerating failure modes takes time, the catalog needs maintenance as the system changes, and there’s a real judgment call in how granular to be. A failure-mode catalog with 200 entries and no monitors is theater; one with five entries and live monitors on each is real reliability work. Most teams err toward over-categorizing on paper and under-instrumenting in production. The discipline is to keep the catalog small enough to act on and tied to actual signals.

There’s also a coverage limit nobody escapes. The catalog only contains modes you’ve thought of; the next incident is often a mode you hadn’t named. That’s not a refutation of the discipline; it’s why the catalog has to grow with the system. The point isn’t to enumerate every possible failure in advance; it’s to have a working vocabulary that lets the team absorb new failures as additions to the catalog rather than as undifferentiated chaos.

Sources

  • The technique of systematically enumerating ways a system can fail comes from Failure Mode and Effects Analysis (FMEA), codified by the U.S. military in the 1949 procedure MIL-P-1629: Procedures for Performing a Failure Mode, Effects and Criticality Analysis and adopted by NASA contractors during the Apollo program in the 1960s. The catalog-of-modes approach used here — list each way the component can break, then choose a response — is the software engineer’s inheritance from that tradition.
  • Charles Perrow’s Normal Accidents: Living with High-Risk Technologies (Princeton University Press, 1984) supplied the framing that failures in tightly coupled, complex systems are not exceptional events but expected outcomes, and that they tend to cascade through component interactions in ways no single designer foresaw. The cascading-modes intuition in this article is Perrow’s argument compressed to a sentence.
  • The Byzantine failure category named above comes from Leslie Lamport, Robert Shostak, and Marshall Pease’s The Byzantine Generals Problem (ACM Transactions on Programming Languages and Systems, 1982), which formalized the worst-case mode in which a component reports success while producing arbitrary or contradictory results, the failure that survives most testing because the component never says it failed.
  • Werner Vogels’s “Everything Fails All the Time” (Communications of the ACM, February 2020) is the modern statement of the design-for-failure mindset behind this article: in distributed systems, dependencies will fail, and the engineering job is to plan responses for each mode rather than to prevent failure outright. Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, eds., Site Reliability Engineering: How Google Runs Production Systems (O’Reilly, 2016) is the standard practitioner reference for catalogs of failure modes and the response patterns — graceful degradation, fallback, fail-fast, alerting — sketched above.