Flaky Test Quarantine
Isolate a test that fails intermittently regardless of code changes — tag it, move it to a non-blocking suite, and put it on a fix-or-delete deadline — so its noise stops blocking merges while the failure stays tracked rather than ignored.
Also known as: Test Quarantine, Flaky Test Isolation
You’ve seen the build go red on a change that couldn’t possibly have caused it. You re-run the job, it goes green, and you merge. That re-run is the moment the suite starts dying. A test that fails for no reason teaches everyone (every engineer, and now every agent) that red doesn’t mean broken. Once the team learns to click “rerun” on red, a real failure and a fake one look identical, and the suite has stopped being a gate. Quarantine is the named policy for handling the fake failures without losing the ability to trust the real ones.
Understand This First
- Regression — the article that draws the line between a real broken behavior and a flake; quarantine is what you do once you’ve decided it’s the latter.
- Determinism — flakiness is non-determinism that has leaked into the suite, so the cure for the test is usually a cure for some hidden non-determinism.
- Merge Queue — the machinery a flaky test sabotages, and the reason quarantine is worth the trouble.
Context
This is a tactical pattern that lives inside the test suite and the merge pipeline, next to Test, Smoke Test, and Regression. It applies the moment a suite is large enough and run often enough that some of its tests fail intermittently, which in practice is every suite past a few hundred tests. A flaky test is one that passes and fails on the same code, depending on timing, ordering, shared state, network weather, or the phase of the moon. It is not a regression: nothing real broke. It is not a new bug: the behavior under test is fine. The test itself is unreliable.
The agentic angle sharpens the stakes. When agents generate and run tests at high speed, two things change at once. First, the volume of test runs goes up, so the absolute number of flaky failures goes up with it. Second, a non-deterministic test wedged inside a generator-evaluator or Verification Loop can stall the agent or, worse, send it off to “fix” working code because a flake told it something was broken. The new frontier cuts the other way too: agents are now good at the detection and even the repair, which is what makes quarantine worth formalizing rather than leaving to folklore.
Problem
A flaky test sits in the worst possible spot: it is in the gating suite, so it blocks merges, but it carries no real signal, so blocking is pure cost. Leaving it there trains the team to ignore red, which destroys the value of every other test. Deleting it on the spot throws away whatever real coverage it had and hides the underlying non-determinism it was accidentally detecting. Neither “keep it gating” nor “rip it out” is right.
So what do you do with a test you can’t trust but can’t yet fix or justify deleting? You need a third option: somewhere it can keep running, keep being visible, and keep being on the hook for a fix, without holding the merge queue hostage in the meantime.
Forces
- A flaky failure looks exactly like a real failure until someone investigates, and investigation is expensive.
- Every ignored red build erodes trust in every other test; tolerance for flakes is contagious.
- Deleting a flaky test removes a false alarm but also removes whatever real coverage it had.
- A quarantine that has no exit becomes a graveyard, a place tests go to be forgotten, which is deletion by neglect with extra steps.
- Agents run the suite far more often than humans do, so the cost of each flake is multiplied, and an agent can misread a flake as a defect to repair.
Solution
Move the flaky test out of the gating suite, mark it as quarantined, and put it on a clock to be fixed or deleted. Four steps hold the policy together, and skipping any one of them turns quarantine into either theater or a graveyard.
Detect on evidence, not on vibes. A test is flaky when its pass/fail history shows both outcomes on the same code, not when one engineer remembers it failing once. Keep a rolling history of each test’s results across runs and flag any test whose pass rate sits between “always passes” and “always fails.” This is the detection step agents are now good at: a pass-rate monitor over recent runs is a mechanical signal, and that signal is what justifies quarantine rather than a hunch.
Quarantine by tagging and rerouting. Tag the test (@flaky, a quarantine annotation, a skip-list entry, whatever your framework supports) and route it to a separate suite that still runs but does not gate the merge. The “still runs” half is essential. A quarantined test that stops executing tells you nothing about when it gets fixed or when it gets worse. It runs, its results are recorded, but a failure there does not eject a good change from the queue.
Bound the population and the time. Cap how much of the suite may sit in quarantine at once (a small single-digit percentage is a common ceiling) and put every entry on a deadline, often a sprint or two. The population cap stops quarantine from becoming the place tests go to die; if you are over the cap, you have a suite-health problem to fix before you quarantine one more. The per-test deadline is what makes the next step inevitable.
Resolve at the deadline: fix it or delete it. When a quarantined test’s clock runs out, it has exactly two exits. Either someone fixes the flake (usually by removing the non-determinism, since the test is normally correct and the system or the test’s setup is what’s unstable) and it returns to the gating suite, or it’s deleted outright because nobody could justify the cost of the fix. There is no third exit. A test that is allowed to sit in quarantine indefinitely has been deleted in effect while still consuming compute and attention, which is the worst of both options.
The discipline that ties these together is treating the quarantine population as a health signal, not a parking lot. A growing quarantine, or a quarantine full of overdue tests, is the suite telling you that non-determinism is accumulating faster than you are paying it down. Watch the number the way you watch any other Metric.
Never let an agent “fix” a flaky test by weakening the assertion it guards. Loosening a check until it stops failing converts a noisy-but-honest test into a quiet-and-useless one, and the next real regression sails straight through. A flake fix removes the source of non-determinism; it does not lower the bar the test was holding.
How It Plays Out
A 200-engineer org runs a merge queue that batches changes and tests them together before landing. One end-to-end test fails about one run in twenty because of a race between a background job and the assertion that reads its output. Because the queue tests batches, a single flaky failure kicks the whole batch back, and good changes get re-queued behind a fake failure. The platform team’s CI dashboard flags the test from its pass-rate history, tags it @flaky, and reroutes it to a non-blocking nightly suite. The queue stops ejecting good changes overnight. The test stays on a two-sprint clock; an engineer fixes the race by having the test wait on the job’s completion signal instead of a fixed sleep, and the test returns to the gate. Total quarantine population never exceeds three tests, which is the cap the team set.
A smaller team lets quarantine rot. They tag flaky tests and route them to a skipped suite that never runs, with no deadline. Eighteen months later the “quarantine” holds 140 tests, several of which were guarding behaviors that have since regressed silently. Nobody noticed, because the tests that would have caught it had been quietly disabled. The lesson is the one the policy is built to prevent: a quarantine without a running suite and a deadline is just a slow delete with a paper trail.
A coding agent is working a refactor and the post-change suite goes red on a test that has nothing to do with the change. Without a flake policy, the agent does the dangerous thing: it reads the red as a defect it introduced and starts “fixing” the working code around the test, sometimes succeeding only in breaking it. With a quarantine list the agent can read, the failing test is already tagged @flaky, so the agent knows to treat that result as non-blocking, report it, and proceed, exactly as a seasoned engineer would. The frontier version goes one step further: a repair agent picks the quarantined test off the queue, reproduces the flake by running it many times, isolates the non-determinism, proposes a patch, and routes the patch through an LLM-as-Judge check and human review before the test rejoins the gate. The human stays in the loop precisely because the failure mode (weakening the assertion to make red go away) is one an unsupervised agent will reach for.
“This test fails intermittently on unchanged code; its last 20 runs are 18 green, 2 red. Treat it as flaky, not as a regression: tag it @flaky, move it to the non-blocking suite so it stops gating the merge queue, and open a fix-or-delete ticket with a two-sprint deadline. Do not change the assertion to make it pass. If you can reproduce the flake by rerunning it in a loop, identify the source of non-determinism and propose a fix that removes it.”
Consequences
Benefits. The gating suite goes back to meaning what it says: red means broken. The merge queue stops ejecting good changes over false alarms, which is often the single largest source of wasted developer time in a flaky-suite org. The flaky test keeps running and stays visible, so its eventual fix is tracked rather than forgotten, and the deadline forces a decision instead of letting the test rot. Agents gain a machine-readable signal, the quarantine list, that lets them correctly ignore a known flake instead of chasing it, which removes one of the more dangerous ways an agent wastes a loop or damages working code. And the quarantine population becomes a real-time gauge of how much non-determinism the suite is carrying.
Liabilities. A quarantined test is, by definition, coverage that isn’t gating, so for as long as it sits there the behavior it guards can regress without blocking a merge, which is exactly why the population cap and the deadline aren’t optional. Quarantine also adds machinery: a tag convention, a separate suite, a pass-rate monitor, and a ticketing discipline to enforce the deadline. The worst failure mode is the graveyard, where quarantine becomes a permanent skip-list and the deadlines are never enforced; at that point you’ve institutionalized ignoring red, which is the precise disease quarantine was meant to cure. And the policy creates a standing temptation, for humans and especially for agents, to “resolve” a flake by weakening its assertion, which converts a noisy honest test into a quiet useless one.
When It Fails
The graveyard. Tests get tagged and routed to a suite that never runs, with no deadline. The quarantine grows without bound and becomes a list of silently disabled tests. Remedy: enforce both the population cap and the per-test deadline, and make the quarantine count a visible metric the team reviews.
Assertion-weakening “fixes.” Someone, often an agent, makes the flake stop failing by loosening the check until it always passes. The test now guards nothing. Remedy: require that a flake fix remove a named source of non-determinism, reviewed by a human, and reject any fix that only relaxes the assertion.
Quarantine as the default. The team reaches for quarantine on the first red instead of investigating, so real regressions get tagged @flaky and waved through. Remedy: quarantine only on pass-rate evidence of intermittency, never on a single failure, and treat a test that fails consistently as a regression, not a flake.
Flaky smoke. A check in the Smoke Test suite goes flaky and gets quarantined, quietly hollowing out the fast trusted signal the whole pipeline depends on. Remedy: treat a flaky smoke check as a same-day P1; fix or remove it now, because smoke is the one suite that can’t afford a tolerance for flakes.
Related Articles
Sources
- The three-way framing of flaky-test responses (retry, quarantine, delete) and the mechanics of quarantine (tag the test, move it to a non-blocking job, fix it within a deadline) are standard across modern continuous-integration practice. The discipline of capping the quarantine population and putting every entry on a fix-or-delete clock is what separates a working quarantine from a graveyard.
- The cost case for taking flakiness seriously comes from large-scale engineering practice. Google’s testing teams have written extensively on flaky tests as a tax on every developer, and Microsoft’s published policy of a short fix-or-remove window for flaky tests is the operational ancestor of the deadline in this article; both established that intermittent failures, left alone, erode trust in the entire suite.
- The merge-queue interaction, where a single flaky failure in a batched queue ejects good changes, is documented in the engineering practice of teams running merge queues at scale, where quarantine is wired directly into the queue so flaky tests stop blocking landings while still being reported.
- The agentic frontier, which covers automated detection from pass-rate history and LLM-driven reproduction and repair of flaky tests, is an active research direction. Recent work on agentic test repair (for example, the FlakyGuard line of research, arXiv 2511.14002) reports that an LLM agent can reproduce and repair a substantial fraction of reproducible flaky tests at industry scale, with a meaningful share of its fixes accepted by developers. The standing caution in this article, that an agent must never be allowed to fix a flake by weakening the assertion, is the discipline that keeps that automation honest.