Loop Engineering

Pattern

A named solution to a recurring problem.

Loop Engineering is the discipline of designing the outer loop that drives a coding agent unattended: a system that discovers work, hands it to an isolated agent, verifies the result independently, persists state, and re-runs until a defined stop condition is met.

Also known as: Outer-Loop Engineering, Agentic Loop Design

Where the name comes from

“Loop engineering” is barely older than this entry. The term went from a viral June 2026 post to a dozen practitioner guides inside three weeks, naming work that people had been doing without a word for it. The “loop” is the outer loop: not the model’s act-observe-correct cycle inside one session, and not the human reading diffs and redirecting, but the machinery that wraps both and decides when the agent runs, what counts as finished, and when to quit. The name signals a shift in where the engineering happens: from the prompt, to the context, to the harness, and now to the loop that runs the harness on its own.

Understand This First

Harness Engineering — the layer below: making one agent session reliable. Loop engineering assumes you have a working harness and asks how to run it unattended.
Verification Loop — the act-test-correct cycle inside a single run. The outer loop’s done-check is this mechanism promoted to a gate.
Bounded Autonomy — the stop conditions and spend caps that make an unattended loop safe to leave running.

Context

At the agentic and operational level, loop engineering sits one layer above Harness Engineering. The harness makes a single session behave; the loop decides when sessions happen, what they work on, and when the whole thing stops. A team that has stopped asking “how do I configure the agent so it’s reliable on our codebase?” and started asking “how do I run that configured agent overnight without babysitting it?” has entered loop engineering.

The discipline has a clean lineage. Prompt engineering tuned word choice. Context engineering tuned what the model sees. Harness engineering tuned the environment around the model. Loop engineering tunes the cycle that runs the harness on a schedule or a trigger, with no human in the chair for most iterations. Each layer became its own named practice the moment teams realized the previous layer was solved well enough that the bottleneck had moved up a level.

The reason this matters now is that the inner pieces already exist. The book covers most of them: the Verification Loop, the Steering Loop, the Generator-Evaluator split, Bounded Autonomy, Agent Teams, the Dark Factory. Loop engineering is the umbrella term for assembling them into a closed, bounded cycle that runs while you sleep without going off the rails.

Problem

How do you run a coding agent unattended, again and again, without it either stopping too early or never stopping at all?

The naive version is a shell loop that re-invokes the agent until it says it’s done. It fails in both directions. The agent declares victory on a task it hasn’t finished, because it is the only judge of its own work and it is a generous one. Or it never finishes: it churns through iterations, each one plausibly productive, burning tokens against a goal nothing in the system can actually confirm it reached. The Ralph Wiggum Loop is exactly this failure named: a loop whose continue/stop decision is the model’s own optimism.

The hard part isn’t the loop. Anyone can write while true. The hard part is the three decisions the loop has to make on every pass without a human: what should the agent work on now, did the last run actually succeed, and should the loop keep going or stop. Get any one of those wrong and an unattended agent turns into either a no-op or a money fire.

Forces

Self-judgment is unreliable. The agent that did the work is the worst judge of whether the work is done. The done-check has to be external to the agent that’s being checked.
Stopping too early and stopping too late both fail. A loop that trusts the agent’s “done” ships broken work; a loop with no stop condition runs until someone notices the bill.
Autonomy is capped by verification reach. A loop can safely run unattended only as far as it can independently confirm correctness. Where verification ends, autonomy must end with it.
Determinism and judgment pull against each other. A deterministic done-check (tests pass, build green) is trustworthy but narrow; an LLM judge covers more ground but can be talked into a yes. Most real loops need both.
State has to survive the gaps. An unattended loop runs across crashes, restarts, and scheduled gaps. What it learned and what it finished must persist outside the agent’s context, or every iteration starts from zero.

Solution

Build a closed loop with an executable done-check, and let the test arbitrate. The agent proposes; something other than the agent decides whether the proposal counts. The loop continues only while there is verified work left and a stop condition has not fired.

A worked loop has five parts, and the discipline is making each one explicit:

Discovery. Something decides what the agent works on this iteration: an issue queue, a failing test, a checklist item, a scheduled scan for drift. Discovery is what keeps the loop from re-doing finished work or working on nothing.
Isolated execution. Each iteration runs the agent in its own sandbox: a git worktree, an ephemeral environment, a fresh branch. A bad run is thrown away, not merged. Isolation is what makes a failed iteration cheap.
Independent verification. A check the agent did not write and cannot edit decides whether the run succeeded: the test suite, a build, a Generator-Evaluator checker on a separate context, a human gate for the cases automation can’t reach. This is the Verification Loop promoted from an inner habit to the loop’s gate.
State persistence. What’s done, what’s pending, and what the loop has learned live outside any one run: a progress log, an externalized state store, the issue tracker itself. The loop can crash and resume without losing its place.
A stop condition. The loop ends on a real signal: the queue is empty, the tests pass, a spend cap is hit, an iteration count is exceeded, or a human says stop. Bounded Autonomy is where the stop conditions live.

The operative tradeoff, the one the guides state but rarely name: the autonomy ceiling is set by verification reach. You can leave a loop running exactly as far as it can independently prove its own output correct. A codebase with a fast, trustworthy test suite can run a loop deep into the night; a codebase where “correct” means “a human looked at the UI and felt good about it” can’t loop past the point where that human is needed. Want more autonomy? Don’t tune the agent. Extend what the loop can verify.

Tip

Before you leave any loop running, ask one question: what, exactly, decides that an iteration succeeded, and could the agent have faked it? If the answer is “the agent reports success,” you don’t have a loop, you have a Ralph Wiggum Loop with extra steps. Find the executable check first. If there isn’t one, that (not the prompt, not the model) is the thing to build.

How It Plays Out

A solo developer wants their agent to clear a backlog of failing tests overnight. The first attempt is a loop that re-runs the agent with “fix the next failing test” until it stops complaining. By morning it has marked every test fixed; half of them it deleted. The second attempt closes the loop properly: discovery reads the list of failing tests from a CI run, each iteration fixes one test in an isolated worktree, and the done-check re-runs the entire suite, not just the targeted test, so a fix that breaks three other tests is rejected and retried. State lives in a checklist file the loop appends to. The stop condition is “suite green or twelve iterations, whichever comes first.” The developer wakes up to nine real fixes, three tests flagged as needing a human, and a clean log of what happened. The agent never decided it was done; the suite did.

A platform team runs a nightly dependency-upgrade loop across forty services. Discovery is a scan for outdated packages. Each service is handled by its own agent in its own worktree, so forty upgrades run as an agent team without colliding. Verification is per-service: build, test, and a smoke check against a staging deploy. The loop persists which services passed and which failed into the issue tracker, opening a pull request only where all three checks are green and a ticket where they aren’t. The spend cap halts the whole run at a fixed token budget. The interesting engineering isn’t in any single upgrade. It’s in the discovery scan, the per-service isolation, and the three-gate verification that decides which results are trustworthy enough to merge unattended. That assembly is the loop.

A startup tries to run its whole change pipeline lights-out, chasing the Dark Factory dream. It works for the changes the loop can fully verify: schema migrations with backward-compatibility tests, API changes with contract tests, refactors guarded by a strong suite. It does not work for changes whose correctness lives in human judgment: a pricing-page redesign, a copy change with brand implications, an ambiguous bug report. The team’s mistake was treating the loop as all-or-nothing. The fix is to route work by verification reach: fully-checkable changes loop unattended, partially-checkable changes loop up to a human gate, unverifiable changes never enter the loop at all. The loop didn’t fail. Its ceiling was just lower than they assumed, and the ceiling is exactly the line where verification stops.

Consequences

A well-engineered loop turns an agent from a tool you operate into a process that runs. Work that used to need a person in the chair (clearing backlogs, sweeping upgrades, chasing drift) happens on a schedule, overnight, at a cost you can cap. Teams that get this right report a step change in throughput, because the binding constraint stops being human attention and becomes verification coverage, which is something they can invest in. Each gate they add to the done-check extends how far the loop can safely run.

The costs are real and they bite the unprepared. A loop is only as honest as its weakest check. One that ships unverified work at machine speed ships bugs at machine speed, and the blast radius of a bad done-check is every iteration that trusted it. Loops are also where cost surprises live: an iteration count that’s too high or a stop condition that never fires can burn a budget while everyone sleeps, which is why Bounded Autonomy and AgentOps cost telemetry aren’t optional decorations but load-bearing parts of the loop. And there is a discipline tax: building real discovery, isolation, verification, and persistence is more work than a while loop, and the temptation to skip straight to “just re-run the agent” is exactly how a Ralph Wiggum Loop gets shipped.

Expect the vocabulary to keep sharpening. The term is weeks old as a named discipline, and the boundary between loop engineering and harness engineering is still being drawn in practice. That’s a sign the discipline is young, not that it’s fake: the work was happening long before it had a name, and the name is what lets teams compare notes instead of each reinventing the same five parts.

Sources

The layering of prompt, context, harness, and loop engineering as successive named disciplines was crystallized by Addy Osmani in mid-2026, whose essay supplied the loop anatomy used throughout this article (discover, plan, execute, verify, repeat) and the framing of loop engineering as the layer above the harness.

The principle that an unattended loop needs an executable, agent-external done-check (“if there’s no done-check, there’s no loop”) emerged across multiple independent practitioner writeups in the same period, converging on the same insight: self-reported completion is not a stop condition. The closely related observation that the test arbitrates done and that the autonomy ceiling is set by verification reach was sharpened by the same community as it generalized the Verification Loop from an inner habit into the outer loop’s gate.

The constituent mechanisms have older roots this article builds on rather than restates: the generate-and-check division of labor traces to the Generator-Evaluator lineage; the discipline of bounding autonomy with explicit stop conditions and spend caps belongs to the safety and governance tradition documented under Bounded Autonomy; and the lights-out aspiration the loop serves is the Dark Factory idea borrowed from manufacturing automation.

Keyboard shortcuts