Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Garbage Collection

Pattern

A reusable solution you can apply to your work.

Recurring, agent-driven sweeps that find where a codebase has drifted from its standards and fix the drift before it compounds.

Also known as: Codebase Hygiene Loop, Drift Remediation

Understand This First

  • Feedback Sensor – garbage collection uses feedback sensors (linters, type checkers, tests) to detect drift.
  • Steering Loop – the recurring sweep is itself a steering loop operating on a longer cadence than per-change checks.
  • Harnessability – the codebase needs codified standards for the agent to enforce.

Context

At the agentic level, garbage collection addresses a problem that emerges after the inner loops are working well. Your feedforward controls steer each change. Your feedback sensors catch mistakes before they merge. Your steering loop converges on correct output for each task. None of these operate at the scale of the whole codebase over time.

Codebases drift. A naming convention followed consistently six months ago now has exceptions in three modules. Documentation that matched the implementation at release has fallen behind. A dependency that was current when added is now two major versions old. These aren’t bugs. No single change introduced them. They accumulated through hundreds of small, individually correct decisions that collectively moved the codebase away from its own standards.

OpenAI named this pattern while describing how a small team used Codex agents to build and maintain a product exceeding one million lines of code. The third pillar of their harness, alongside architectural constraints and context engineering, was recurring background tasks that scanned for drift and opened targeted fixes.

Problem

How do you keep a fast-moving codebase from accumulating the kind of slow rot that no individual change introduces but that makes every future change harder?

Code review catches problems in new code. Tests catch regressions in existing behavior. Linters catch style violations at commit time. But none of these look at the codebase as a whole and ask: are we still following our own rules? The answer, in any codebase older than a few months, is almost always “mostly, with growing exceptions.”

The problem is worse with agent-generated code. SlopCodeBench, a 2026 benchmark that tracked code quality across iterative agent tasks, found that structural erosion increased in 80% of agent trajectories and verbosity grew in nearly 90%. Human-maintained codebases stayed flat over the same period. Agents don’t just fail to clean up drift. They amplify it, because they replicate whatever patterns they find locally, including the drifted ones.

Forces

  • Drift is invisible at the per-change level. Each commit follows the rules. The drift emerges from the aggregate over weeks and months.
  • Manual audits don’t scale. A human reviewing the entire codebase for convention compliance is expensive and boring. It happens rarely, if ever.
  • Agents amplify existing patterns. An AI agent generating new code in a drifted area will follow the local patterns it finds, including the drifted ones. Drift begets more drift.
  • Standards evolve. The rules themselves change. A logging convention adopted in January gets replaced by a better one in March. The old convention lingers in every file that hasn’t been touched since.

Solution

Run recurring agent tasks that scan the codebase against its codified standards, flag deviations, and open targeted pull requests to fix them. Think of memory garbage collection in programming languages: a background process that reclaims order from accumulated entropy. This pattern applies the same idea to the codebase itself.

Codified standards. The agent needs a machine-readable definition of what “correct” looks like. Linter configurations, architectural boundary rules in an instruction file, a style guide the agent can reference, or a set of “golden principles” checked into the repository all qualify. Without codified standards, the agent has nothing to enforce. Your garbage collection is only as good as your rules.

Scheduled scans. The agent runs on a recurring cadence, not triggered by a specific change. It reads the standards, examines some portion of the codebase, and identifies where reality has diverged from intent. The scan doesn’t need to cover everything every time. Sampling a subset of files per run and rotating through the codebase keeps each sweep focused and the pull requests reviewable.

Targeted fixes. When the agent finds drift, it opens small, focused pull requests that address one category of deviation at a time. “Update 12 files to use the new logging convention.” “Replace deprecated API calls in the payments module.” Each fix is narrow enough to review quickly and safe enough to merge with confidence. The agent isn’t refactoring architecture. It’s picking up litter.

Measurement. Track what the sweeps find. If the same category of drift keeps appearing, your standards aren’t reaching developers (or agents) at the point of creation. If sweep findings drop over time, the loop is working. Without this feedback, garbage collection becomes ritual instead of remedy.

Tip

Start with the cheapest signals. Linter violations, outdated imports, and naming inconsistencies are safe for agents to fix autonomously. Architectural drift and design-level deviations need human review before the agent acts.

The cadence depends on the pace of change. A team shipping dozens of PRs per day might run garbage collection nightly. A slower project might run it weekly. The key is regularity: drift compounds, and the longer you wait between sweeps, the bigger and harder each cleanup becomes.

How It Plays Out

A platform team maintains a service with 200,000 lines of TypeScript. They adopted a new error-handling convention in February: all service-layer functions return a Result type instead of throwing exceptions. New code follows the convention. Old code doesn’t. By April, 60% of the service layer uses Result types and 40% still throws. New developers can’t tell which pattern to follow. Their AI agent, asked to add a feature, finds both patterns in the same codebase and picks whichever appears in the file it happens to be working in.

The team sets up a weekly garbage collection sweep. The agent scans all service-layer files, identifies functions that still throw instead of returning Result, and opens one PR per module with the conversions. Each PR is small, tested, and reviewable in minutes. Over three weeks, the convention reaches 100% adoption without anyone scheduling a “tech debt sprint.”

A solo developer uses an AI agent to build a side project. She writes an instruction file describing her naming conventions, directory structure, and test expectations. Over two months and 400 commits, the project grows to 30,000 lines. She notices the agent has started placing utility functions in three different directories, depending on which existing file it used as a model. She adds a garbage collection task to her workflow: every Sunday, the agent audits the project against the instruction file, reports deviations, and proposes reorganization. The first run finds 14 misplaced files and two modules that violate the dependency rules. The fixes take the agent ten minutes. Without the sweep, the inconsistencies would have kept multiplying.

Six months into an agentic migration, a fintech company checks their sweep logs and notices something. The same three categories of drift keep appearing: inconsistent date formatting across API responses, mixed use of camelCase and snake_case in internal interfaces, and stale feature flags that were never cleaned up after launch. The first two are agent-amplified: the agents find both conventions in the codebase and propagate whichever they encounter first. The third is a human problem that the sweeps make visible.

The team responds at the source. They add date format and casing rules to their linter configuration, catching future drift at commit time. For feature flags, they write a sweep rule that flags any flag older than 30 days with no conditional references. The sweeps didn’t just clean up the codebase. They surfaced the root causes.

Consequences

Benefits. Drift gets caught early, when fixing it is cheap. Standards stay real instead of aspirational. Agents working in the codebase find consistent patterns to follow, which improves the quality of their generated code. Cleanup happens continuously instead of in expensive “tech debt sprints.”

Liabilities. The agent needs accurate, up-to-date standards to enforce. Outdated or rigid rules produce false positives that waste reviewer attention and erode trust. Running scans costs tokens and compute. Automated fixes can introduce regressions if tests are insufficient, especially for changes that are syntactically simple but semantically risky. There’s also a governance question: who reviews the garbage collection PRs? If nobody does, you’ve given the agent unsupervised write access to the entire codebase. If everyone does, you’ve created a stream of low-priority review requests that contribute to approval fatigue.

  • Uses: Feedback Sensor – the scan phase relies on computational sensors (linters, type checkers, test suites) to detect deviations.
  • Uses: Instruction File – codified standards often live in instruction files that the agent reads before scanning.
  • Extends: Steering Loop – garbage collection is a steering loop operating on a longer time horizon (days or weeks rather than seconds).
  • Complements: Harnessability – garbage collection maintains the harnessable properties that make agents effective.
  • Reduces: Technical Debt – recurring sweeps prevent drift from compounding into structural debt.
  • Risk of: Approval Fatigue – a high-frequency sweep can flood reviewers with low-stakes PRs if not calibrated carefully.
  • Informed by: Bounded Autonomy – the scope of what the agent can fix autonomously should match the risk tier of each change.

Sources

OpenAI’s “Harness Engineering” article (2026) named garbage collection as the third pillar of their agent-driven development process. Their team used Codex agents to build and maintain a codebase exceeding one million lines across roughly 1,500 automated pull requests, with recurring background sweeps enforcing “golden principles” that kept the codebase legible for future agent runs.

Martin Fowler’s companion essay on harness engineering for coding agent users placed the concept within his feedforward/feedback taxonomy, distinguishing the recurring maintenance loop from both pre-action controls and post-action checks.

The SlopCodeBench benchmark (Sprocket Lab, March 2026) provided empirical evidence for the drift problem this pattern addresses. Across 20 iterative coding tasks, structural erosion increased in 80% of agent trajectories while human-maintained code stayed flat, confirming that agents without active maintenance processes degrade the codebases they work in.