Regression

A regression is a behavior that used to work and no longer does, broken not by an intentional requirements change but by an unrelated edit, and the word is what lets a team distinguish that category of defect from every other way software fails.

Concept

Vocabulary that names a phenomenon.

What It Is

A regression is the reappearance of incorrect behavior in software that previously behaved correctly. The defining feature isn’t the bug itself; it’s the prior baseline. Yesterday the feature worked, today it doesn’t, and nobody asked for the change. Something in the codebase moved, and a behavior that was working got dragged along with it.

The word has two related meanings in practice, and they’re worth keeping straight.

The defect category. A “regression” is any bug that fits the pattern above: a previously working behavior that’s broken, where the breakage is a side effect of an unrelated change rather than an intended one. The change might be a feature addition, a refactor, a dependency upgrade, a configuration tweak, or an agent-generated edit. The class is defined by what happened to a working behavior, not by what was being changed.
The defense. “Regression testing” is the practice of running an existing test suite after every change to verify that the previously-working behaviors still work. The phrase covers two slightly different things: running the whole suite in protective mode after every change, and writing specific tests to lock in fixes so the same bug can’t return. Both senses matter and the word covers both.

Regressions sit next to several adjacent concepts that aren’t quite the same thing. A new bug is a defect in newly-written behavior; it was never working, so there’s no baseline to regress against. A breaking change is an intentional behavior change that breaks consumers; the change is deliberate, just unwelcome. A flaky test is a test that fails intermittently regardless of changes; the baseline is unstable rather than regressed. The word regression is reserved for the case where something that was working isn’t anymore, and the change responsible didn’t mean to touch it.

In agentic coding the same vocabulary applies, with an additional dimension. When an agent edits a codebase, every behavior outside the part the agent is changing is at risk of a regression, because the agent’s mental model of the system’s interconnections is partial at best. An agent that confidently refactors a payment helper and breaks an unrelated reporting job has produced a regression in exactly the classical sense. The shift is that the rate of changes goes up sharply, so the rate of potential regressions goes up with it, and the only credible defense is automated.

Why It Matters

Software is interconnected. A change to the payment module can break the email notification system. A performance optimization in the database layer can subtly alter query results. An updated dependency can change behavior in ways the changelog didn’t mention. The larger and older the codebase, the more likely that any change will perturb something the change-author wasn’t thinking about. Without a name for that category of defect, teams describe each occurrence as a one-off (“the search feature broke the cart”) rather than as an instance of a recurring class with a known defense.

Naming the class is what makes the defense designable. A team that has the word “regression” in its vocabulary doesn’t argue about whether to invest in automated tests; the tests are the response to a recognized phenomenon. The team writes them as a matter of course, runs them after every change, and treats a failing previously-passing test as the high-signal event it is: not just “a test failed” but “we just broke a behavior that was working.” The vocabulary changes what the failure means.

There’s a second-order effect. Once a team is fluent in regression as a category, it starts to treat every shipped bug as a missing test rather than as a finished story. The fix lands with a new test that would have caught it, and that test joins the suite permanently. The suite becomes a record of every category of failure the team has ever seen, and the rate of repeat-incidents drops accordingly. Without the vocabulary, each incident gets fixed but doesn’t compound into institutional defense.

For agentic workflows the stakes are larger, because the rate of change is larger. A team running coding agents against a substantial codebase will see many more edits per day than a team of humans alone would produce. If those edits are not gated by a credible regression defense, the regression rate scales with the edit rate and the codebase becomes progressively less trustworthy. The vocabulary matters here in a specific way: the team’s job is to keep the agent’s edits inside a feedback loop that catches regressions before they ship, not to make the agent careful enough that regressions don’t happen. The first job is achievable; the second is not.

How to Recognize It

You’re looking at a regression when three things hold together: there’s a behavior that was working at some prior point, that behavior is now broken or wrong, and the change responsible wasn’t aiming at that behavior. All three matter. A behavior that’s “broken” but was never tested may have been broken all along; the prior baseline has to be real. A behavior the team intentionally changed is a breaking change, not a regression. And a behavior that “broke” because nothing about it changed is usually a flake or an environmental issue, not a regression in the codebase.

Concrete signs that a regression has occurred:

A previously-green test is now red. The most direct signal. The test passed yesterday, the test fails today, the only thing between them is a commit. The commit hash is the suspect.
A user reports “it worked yesterday.” The phrase is diagnostic. Users who say this are reporting a baseline shift, which is what a regression is. Triage starts by trusting that the baseline was real.
A bisect lands on a non-obvious commit. When git bisect (or its equivalent for agent-edited branches) traces a broken behavior to a commit that doesn’t appear to touch the broken code path, the team has found a regression by definition: the change wasn’t aimed at the broken behavior.
Production telemetry shifts after a deploy. Error rate ticks up, latency P99 climbs, a previously-quiet log line starts firing. The shape is “something changed in deploy X” rather than “something is new.” That shape is regression-shaped.

A few common patterns are worth recognizing as regression-prone in their own right:

Shared mutable state. A new feature that reuses an existing global, cache, or session store is the textbook regression generator. The new code is correct in isolation; it breaks the old code by changing what the shared state contains or when.
Behavior that was implicit. A function that “happened to” do the right thing because of an undocumented invariant is one refactor away from regressing. The invariant wasn’t enforced, just observed.
Dependency upgrades. A bumped library version is a change to thousands of behaviors, almost all of which the team won’t notice. The ones that matter become regressions.
Agent-generated edits across module boundaries. When the agent edits one file and the broken behavior is in another, the change wasn’t aimed at the broken behavior, which is the definition of the category.

The deeper signal is what the team says when something breaks. “We must have regressed that” is the language of a team that has the vocabulary; the response is to find the commit, write the test, and move on. “It’s just acting weird” is the language of a team that doesn’t, and the response is to guess.

Warning

A regression found by a user is a process failure, not just a code failure. If the suite didn’t catch it, ask why, and add the missing test.

How It Plays Out

A team ships a new search feature. Two days later, users report that the shopping cart is dropping items. Investigation shows the search feature introduced a session-handling change that conflicts with the cart’s session logic. Nothing about the cart was edited; it just happened to depend on the session shape the search feature reworked. The team fixes the bug and adds a test (“after adding three items to the cart, the cart contains three items”) that locks in the prior baseline. The next agent or human who edits the session layer will trip that test before the cart breaks again.

A platform team upgrades a JSON-parsing library across the monorepo. The new version is strictly faster, which is why the upgrade was approved, and the suite passes. A week later, an integration partner reports that timestamps in webhook payloads are now off by an hour. The library’s date-parsing default changed between versions; the changelog mentioned it; nobody on the upgrade ticket read that line. The team’s response isn’t “be more careful next time”; it’s to write a test that pins the wire format of timestamps in outbound webhooks, so the next dependency upgrade that changes the format will be a red test instead of a silent regression.

A coding agent working a refactor pass renames a helper used in a payment flow. The agent updates every call site it can find, and the suite passes. The agent didn’t find the call sites in a downstream reporting service that imports the helper through a dynamic import string the agent’s static analysis missed. The reporting job fails silently in the next nightly run, producing empty reports for two days before an analyst notices. The fix is the missing test for “the reporting job produces non-empty output for last week’s data,” but the regression itself is the high-signal artifact: it tells the team where the agent’s reach exceeded its sight. The team responds by adding a pre-merge gate that runs the downstream service’s tests against any branch that touches the shared helper module, which is a regression-defense investment proportional to the agent’s edit rate.

Example Prompt

“A user reported that adding items to the cart sometimes drops existing items. Treat this as a suspected regression: write a test that reproduces it (add three items, verify all three are present), find the recent commit that broke the behavior with git bisect if needed, fix the bug, and keep the test in the suite.”

Consequences

Treating regression as a named category, rather than as one more way bugs happen, changes what a team’s testing investment is for. The suite stops being a quality-assurance gate at release time and starts being a continuous record of what the team has decided must keep working. Each new test is a behavior the team has committed to preserving against future changes; the suite is the inventory of those commitments.

Benefits. A team that takes regression seriously can change code without fear. Refactors land routinely, dependency upgrades happen on schedule, and agent-generated edits clear merge in minutes instead of waiting for a human to spot-check the whole system. The cost of change drops, which compounds over time because lower change-cost permits more changes. The suite also becomes a teaching artifact: a new engineer reading the tests learns what the codebase considers important, not just what it considers possible. And the discipline of “every fix lands with a test” turns every incident into a permanent piece of the defense, which is the slow-compounding asset that separates a one-year-old codebase from a ten-year-old one that still ships weekly.

Liabilities. The suite costs something to maintain. Tests have to be updated when behavior intentionally changes, or they become obstacles to legitimate work. A team that hasn’t internalized the difference between a regression (the change wasn’t aimed at this behavior) and an intentional change (this behavior was supposed to move) will end up either rubber-stamping test updates (defeating the defense) or fighting every update (paralyzing the work). The discipline is judgment-heavy: deciding which behaviors are permanent commitments and which are negotiable is the central editorial call of running a test suite, and it doesn’t have a mechanical answer.

There’s also a coverage limit no team escapes. The suite only catches regressions for behaviors the suite tests; the next regression will probably be in a behavior nobody wrote a test for. That isn’t a refutation of the discipline; it’s why the suite has to grow with the system, and why every shipped regression becomes a new test rather than just a fix. The point isn’t to enumerate every behavior in advance; it’s to convert each surprise into a permanent defense against its return.

For agentic workflows the calculus shifts in one specific way. The agent’s edit rate is high, so the suite has to be fast enough to run on every edit, and rich enough to catch the regressions the agent’s blind spots produce. A team with a slow, narrow suite and a fast, broad agent is producing regressions faster than it’s catching them, which is the failure mode the vocabulary is meant to make visible. The remedy is investment in the suite, not in restraining the agent.

Sources

The category of “regression” as a software defect predates the word: the practice of re-running existing tests after a change goes back to early software engineering, but the term became standard through the U.S. military’s discipline of regression testing in mission-critical systems. The IEEE 610.12-1990 Standard Glossary of Software Engineering Terminology gave the term its widely-cited working definition: “selective retesting of a system or component to verify that modifications have not caused unintended effects.”
The argument that every fix should land with a test that locks in the corrected behavior is the operational core of Kent Beck’s Test-Driven Development: By Example (Addison-Wesley, 2002). Beck’s framing of the test suite as a living artifact that records the team’s behavioral commitments — rather than a one-time quality gate — is the move that makes regression a manageable category instead of a recurring surprise.
The defense at scale comes from the practice literature. Jez Humble and David Farley’s Continuous Delivery (Addison-Wesley, 2010) made the case that a credible automated suite is the only mechanism that lets teams deploy often without accumulating regressions; the deployment-pipeline argument throughout the book is, in effect, an argument that regression-defense is the entire point of pre-release automation. Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, eds., Site Reliability Engineering: How Google Runs Production Systems (O’Reilly, 2016) extended the same intuition into production: regressions don’t all show up in pre-merge testing, so the same vocabulary applies to telemetry and rollback strategy at runtime.
The agent-specific framing — that an automated suite is the credible defense against high-edit-rate change from coding agents — is implicit in the broader argument running through Lauri Apple’s coverage of agentic engineering practice on the GitHub blog and explicit in the Anthropic engineering team’s discussions of agent feedback loops. The pattern that converts each shipped regression into a permanent test is the same one those teams describe under names like “guardrails” and “evals”; the vocabulary in this book treats them as agentic-specific instances of a discipline software engineering has been building for thirty years.

Keyboard shortcuts