Harness Engineering

Pattern

A named solution to a recurring problem.

Harness Engineering is the discipline of designing the configuration surfaces around a coding agent so that a fixed model produces reliable outcomes in a specific codebase.

“The harness is becoming its own engineering discipline.” — Martin Fowler

Also known as: Agent Harness Design, Coding-Agent Configuration, Agent Runtime Engineering

Understand This First

Harness (Agentic) – the mechanism this discipline works on.
Harnessability – the codebase-side counterpart that determines what a harness has to work with.
Context Engineering – harness engineering is, among other things, context engineering done across many sessions.

At the agentic and operational level, harness engineering sits one layer above day-to-day agent use. Where the Harness (Agentic) article defines what a harness is, this one defines the practice of engineering one. A team that’s stopped asking “which tool should I buy?” and started asking “how do I configure Claude Code (or Codex, or Cursor) so it’s reliable on our codebase?” has entered harness engineering.

The shift matters because the frontier of agentic coding is no longer raw model capability. When LangChain ran Terminal Bench 2.0 on the same underlying model, they moved a coding agent from 52.8% to 66.5% by changing only the harness. OpenAI spent two years and more than a million lines of production code on an internal harness that sits around Codex, because they found that harness decisions (instructions, tools, sub-agent topology, approval policy) drive more of the result than model choice does. The model is roughly fixed for any given team at any given week; the harness isn’t. Everything a team can still tune lives here.

Harness engineering is what you do with that room.

Problem

How do you turn a capable general-purpose model into an agent that reliably does your work on your codebase?

Out of the box, a coding agent will produce plausible-looking changes that miss your conventions, forget your constraints, and over- or under-use the tools it has. Crank it up and it writes too much, approves too freely, or burns tokens thrashing on a flaky tool. Crank it down and it becomes a slow autocomplete. The knobs that move the agent between those extremes (which tools it sees, which instructions it reads, which sub-agents it spawns, which hooks fire, how much it’s allowed to do before asking) aren’t incidental settings. They’re the system.

Without a name for the work, teams treat each knob as a configuration detail and each incident as a surprise. With a name, the knobs become a designed surface and the surprises become testable hypotheses.

Forces

The model is a fixed input; the harness isn’t. You can’t cheaply retrain a foundation model for your codebase, but you can redesign the surface around it this afternoon.
Surfaces interact. A change to instructions affects what tools get called; a new hook affects what context fills the window; a sub-agent policy affects cost and latency. You can’t tune one surface in isolation.
Under-configuration and over-configuration fail differently. A thin harness produces generic output and frustrated users. A thick harness produces rigid output and maintenance debt, because the harness itself becomes a project.
Harness quality has a ceiling set by the codebase. No amount of configuration fixes an untyped, untested, undocumented codebase. Harness engineering and Harnessability are paired disciplines.
The surfaces are still being named. The vocabulary is younger than the practice. Early adopters have to translate between their tool’s terminology and the concepts.

Solution

Treat the configuration around the agent as an engineered surface, not a pile of dotfiles. Name each surface. Reason about what it’s for. Change it with the same discipline you apply to the code itself.

The surfaces that have stabilized as first-class objects in most modern harnesses are:

Instruction files – durable, project-scoped guidance (Instruction File). The agent reads them at the start of every session; they are the cheapest surface to change and usually the one that pays back most.
Tools – the callable capabilities the agent can reach (Tool). Too few and the agent is helpless. Too many and it picks wrong or causes damage.
MCP servers – the standard protocol for wiring in external systems (MCP). Each server adds capability and cost; choose them the way you would choose runtime dependencies.
Skills – packaged workflows loaded on demand (Skill). They let the harness carry expertise without bloating the main context window.
Sub-agents – delegated workers with their own scoped contexts (Subagent). They isolate noisy investigations from the parent, separate specialties, and parallelize work.
Hooks – automation bound to lifecycle points (Hook). A formatter that fires after every write, a linter that fires before commit, a safety check that fires before a destructive command.
Approval and governance policy – the rules that gate what the agent can do without asking (Approval Policy, Bounded Autonomy).
Memory – what the agent carries across sessions (Memory). A surface that compounds: a well-tended memory gets better over time; a sloppy one accumulates contradictory noise.
Compaction strategy – how the harness shortens history when the window fills (Compaction). The strategy is tunable, and a bad strategy silently erases the context your other surfaces worked to build.
Back-pressure – the pacing mechanisms that keep the agent from saturating itself, its tools, or its humans. Concurrency caps on sub-agents, rate limits on parallel tool calls, cooldowns between writes, queueing when downstream systems signal stress. Classical reactive-systems vocabulary, now load-bearing for agents.
Isolation – filesystem and environment boundaries for risky or parallel work (Worktree Isolation, Externalized State).

A useful mental model is three nested loops. The inner loop is the agent in the code: the model calling tools, reading files, proposing edits. The middle loop is a human steering the agent: reading diffs, redirecting, approving (Steering Loop). The outer loop is harness engineering: the human, between sessions or between weeks, changing the surfaces so the inner and middle loops go better next time. Each loop has its own feedback signal. The outer loop’s signals come from AgentOps telemetry and from the team’s own observations about where agents keep stumbling. Annie Vella’s longitudinal study of 158 engineers (March 2026) gave the middle loop its empirical grounding and named the work that happens there supervisory engineering: directing, evaluating, and correcting.

Tip

When an agent session goes sideways, ask at which loop the fix belongs. A one-off prompt tweak lives in the inner loop. A “next time, steer earlier” lives in the middle loop. A pattern that keeps recurring (the agent keeps forgetting a convention, keeps overrunning a quota, keeps calling the wrong tool) belongs in the outer loop, and should change a surface: an instruction file, a hook, a tool list, a policy. The best harness work starts by noticing which loop you keep patching.

How It Plays Out

A team inherits a medium-sized TypeScript monorepo and starts using Claude Code. The first week, they use it out of the box: the agent produces code that compiles and passes tests but uses the wrong logging library, the wrong error-handling convention, and proposes migrations that violate a soft-deprecation rule the team never wrote down. Instead of treating each incident as a correction, the lead engineer opens an AGENTS.md and starts writing. She codifies the logging library, the error-handling pattern, the module boundaries, and the soft-deprecation rule. She adds a pre-commit hook that runs the repo’s type checker, and a tool-whitelist that keeps the agent from reaching for random npm scripts. She configures a sub-agent specifically for “explore this unfamiliar directory” and gives it a short-lived memory so exploration noise doesn’t pollute the main context. Two weeks later, she reviews sessions and finds the agent is self-correcting in the ways she used to intervene for. She hasn’t changed the model, the prompt style, or the team. She has done harness engineering.

A small startup that ships a web app runs every production change through a harness built on top of the Codex API. The first version is a single agent with broad tool access; it moves fast and occasionally destroys test fixtures. The team refactors it into a three-agent topology: a planner that produces the change plan and never writes files, a writer that executes the plan in a worktree-isolated branch, and a critic that reviews the diff against the plan and the repo’s invariants. A hook fires after every write to run the repo’s fast suite; a back-pressure cap prevents the writer from making more than ten file changes without the critic agreeing. Token cost drops 30% because the planner and critic run on a cheaper model. Incident rate drops further because the critic catches the same mistakes the humans used to catch. The interesting engineering here isn’t inside any single agent. It’s in the topology, the rate limits, and the hook schedule. That’s the harness.

Two engineers working alone on separate projects keep complaining to each other about how often their agents lose context on long tasks. One is running with default compaction; the other is manually truncating. Neither has named the surface they’re tuning. Once they do (“oh, the compaction strategy is the problem, and the progress log is how we route around it”), they stop arguing about model versions and start sharing compaction prompts and Progress Log templates. Ninety percent of harness engineering is noticing that a surface exists and giving it a name. The other ten percent is changing it.

Consequences

A deliberately engineered harness makes agents behave more like a senior teammate and less like a powerful stranger. The agent’s output becomes more consistent with the team’s conventions, its interventions fall into predictable places, and reviewers develop calibrated trust: they know where to read carefully and where to skim. Teams report compounding gains: each surface you tune pays out on every future session until the surface itself goes stale.

The costs are real. Harness engineering is work, and the harness becomes a project with its own maintenance burden. Instruction files drift as the codebase evolves. Tool lists accumulate dead entries. Hooks get slower as they pick up more checks. Sub-agent topologies grow overnight and rarely get pruned. A team that invests in a harness without a plan for keeping it healthy ends up with a lump of configuration nobody understands — a failure mode that Agent Sprawl names on the agent side and that applies to the configuration surfaces too. Garbage Collection matters as much for harnesses as it does for memory.

There’s also a portability question. A harness tuned for your repo is, almost by definition, less useful on someone else’s. Vendors and communities publish reasonable defaults, but the harness engineering work is where the local advantage lives, and teams that treat it as a trade-secret layer tend to outperform teams that treat it as something to share wholesale. Expect the practice to professionalize: new roles, named checklists, and a small but growing body of practitioner writing. The vocabulary in this article will probably be sharper in a year; that’s a sign the discipline is young, not that it’s fake.

Sources

Birgitta Boeckeler and Martin Fowler’s work on harness engineering at ThoughtWorks is the canonical framing, positioning the harness as a distinct engineering discipline rather than a vendor setting. The three-loop mental model used above builds on their “Humans and Agents in Software Engineering Loops” essay. Annie Vella’s The Middle Loop (annievella.com, March 2026) gave the middle loop its empirical anchor: a longitudinal mixed-methods study (158 engineers in round one, 101 in round two) that names supervisory engineering as the new category of work between the inner and outer loops.

OpenAI’s two public writeups on the Codex harness (the 2024 philosophy post introducing harness engineering as a named practice, and the 2026 “Unlocking the Codex harness” case study on the internal App Server that shipped roughly a million lines across 1,500 pull requests) are the fullest published account of what engineering a harness at production scale actually involves.

The LangChain Terminal Bench 2.0 result (52.8% to 66.5% from harness changes alone, same underlying model) is the empirical anchor cited throughout this article. It’s the clearest public demonstration that harness work, not model work, is where current gains live.

The enumeration of configuration surfaces (instruction files, MCP, skills, sub-agents, hooks, back-pressure) emerged from the agentic coding practitioner community in early 2026, with multiple independent writers converging on roughly the same list. The six-surface version in particular was sharpened by practitioners writing up their internal harness designs publicly during that period.

Stuart Russell and Peter Norvig’s perceive-reason-act framing from Artificial Intelligence: A Modern Approach (1995) remains the intellectual ancestor: a harness is what supplies the sensors and actuators that turn a reasoner into an agent. Harness engineering is Russell-and-Norvig’s sensor-and-actuator design problem applied to a model whose reasoning layer you don’t control.

Keyboard shortcuts