Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Task Horizon

Concept

A foundational idea to recognize and understand.

The length of task an agent can complete reliably on its own, measured against the same work done by a human expert.

Also known as: Time Horizon, Long-Horizon Task Capability

Understand This First

  • Agent – task horizon is a capability of an agent, not of a bare model.
  • Context Window – horizon and context are related but distinct capacities; the window bounds input size, the horizon bounds end-to-end task length.

What It Is

Every agent has a duration past which it starts to come apart. Under an hour, a frontier coding agent in 2026 can hold a multi-file refactor together. Give it eight hours and the same agent might drift, forget its plan, or quietly give up on a test that kept failing. The longest run it can actually close out without a human catching it is its task horizon.

Two precise versions of the number are in common use, both pioneered by METR (the Model Evaluation & Threat Research nonprofit). The 50%-time horizon is the task length, in human-expert hours, that the agent completes with 50% success. The 80%-time horizon is the stricter threshold: the length at which the agent still finishes four times out of five. Practitioners care more about the 80% number. Benchmarks report the 50% number because it’s statistically cleaner.

Horizon is not throughput. An agent that burns through 5,000 tokens a second can still have a short horizon if it loses the plot after twenty minutes. And horizon is not context window size. A million-token window can hold a week of transcripts, but the agent’s ability to stay coherent inside that window is a separate measurement. Horizon is the one that tells you whether to kick off an overnight run or stay at the keyboard.

Why It Matters

Scoping is the dominant planning question in agentic coding. “Can I let this run overnight?” “Is this task the kind of thing the agent finishes, or the kind where I need to be checking in every half hour?” Before horizon had a name, the answer was a guess calibrated against the last time you tried a task this size. With a name and a number, the decision becomes routine.

Horizon is also one of the few places in the field with a rigorous public leaderboard behind it. METR’s benchmark curves give a shared reality check: the frontier has roughly doubled every seven months since 2019, standing near two autonomous hours in early 2026 and reaching into the tens of hours with human-scheduled checkpoints. Teams can check their own scoping intuitions against those numbers instead of relying on vibes or vendor marketing.

There’s a subtler reason horizon deserves a name: it motivates every pattern in this section that exists to stretch the envelope. Compaction trades older context for a longer run. Checkpoint breaks a long task into verified stages so one missed step doesn’t rot the rest. Task Decomposition is the mitigation you reach for when the work you want is past the horizon you have. Without the horizon concept, those patterns look like scattered techniques. With it, they’re a toolkit for pushing one number up.

How to Recognize It

You can tell a task is near or past the agent’s horizon by the way the work fails. A task safely inside the horizon either finishes or errors out loudly. A task at the edge goes wrong in three characteristic ways:

Silent drift. The agent is still producing output that looks plausible, but it’s drifted off the plan it wrote an hour ago. Code compiles, tests pass, but the feature it’s shipping is not the feature it was asked for. This is the canonical long-horizon failure mode and the reason verification at the boundary matters more than at the start.

Plan loss. The agent started with a six-step plan, finished steps one through three, then dropped into ad-hoc mode for steps four and five and never came back to step six. A Progress Log or Externalized State would have caught it. Without one, you find out at the end.

Repeated surrender. The agent hits a problem, tries twice, can’t solve it, and quietly routes around it with a TODO comment or a mock. On a short task you’d have noticed. At hour six, you didn’t.

The benchmark numbers give you a shape for what to expect. As of early 2026, a frontier coding agent like Claude Opus or GPT-5 has a 50% horizon measured in hours and an 80% horizon somewhat shorter. A mid-tier model sits in the tens of minutes. An agent shipped two years ago sat at the five-minute mark. The specific numbers keep moving, but the shape is stable: the 50% horizon runs a few times longer than the 80%, and both roughly double every seven months.

A practical field test: pick three tasks you’d give the agent, sized at what you guess is 30 minutes, 2 hours, and 8 hours of human-expert work. Run each one cold, without intervening. The longest one it finishes cleanly is your working estimate of its 80% horizon on your kind of work. Your codebase and your task shape will move the number. The METR leaderboard is the ceiling; your lived horizon on your repo is the number that matters.

Tip

When a long-running agent task fails, don’t just ask what broke. Ask when it broke. A failure at minute 45 of a two-hour run is a different story from a failure at minute 110. The first suggests a tooling or context issue; the second is usually horizon hitting its ceiling.

How It Plays Out

A developer has a half-day refactor in mind: extract a domain module from a tangled service, wire up the call sites, and back it with tests. She’s used to chunking this kind of work into two-hour sessions. Before kicking off, she checks her notes from last month: the agent she’s running handled a similar refactor cleanly in one pass, just under three hours. She hands it the whole task with a Progress Log and a checkpoint after each call-site batch. It lands in two hours forty minutes. The move that made the call wasn’t heroic agent-wrangling. It was knowing the work fit inside the horizon.

A team tries the same move with a database migration that their past experience says is a full day of careful work. They kick it off overnight, no checkpoints. They come in the next morning to find the agent reached hour five, started a migration step, failed a constraint check, silently retried with a relaxed version of the constraint, kept going, and wrote seven more steps on top. The lesson isn’t that the agent is broken. The lesson is that they overshot the horizon and didn’t put in the scaffolding (checkpoints, a plan file, a human gate at the midway mark) to survive the overshoot.

A platform team runs a nightly agent job that audits the last 24 hours of commits against the team’s architectural rules. The job is structured as 30 short runs, one per commit, each well inside the agent’s short-task horizon. They get reliable results every night. A competing team tries to do the same audit as one long sweep across all commits. It succeeds half the time, and the failures look like the agent “forgot” the rule for half the commits. The difference is decomposition: 30 horizon-sized tasks are more reliable than one task that exceeds horizon, even if the total work is the same.

Consequences

Once you have the concept, scoping decisions get cheaper. A planned task is either inside the horizon (trust the loop; keep scaffolding light), near the horizon (add checkpoints and a plan file; stay available), or past the horizon (decompose, or don’t run it autonomously at all). The decision tree is three branches and a number.

Budget planning gets clearer too. Long horizons are expensive: tokens, time, and the coordination cost of the scaffolding that keeps a long run honest. If a task can be done in one in-horizon run, the simple loop is cheaper than an elaborate multi-stage harness. If it can’t, the scaffolding is the price of admission. The concept lets you price these options against each other instead of treating them as matters of taste.

The downside is that horizon is a moving target and easy to misread. The frontier doubles every seven months on a curated benchmark, but your horizon on your repo moves differently: it depends on language, test suite quality, documentation, how legible your code is to an agent, and how much tacit knowledge sits outside the repo. Reading the benchmark number as a direct prediction for your work overstates the agent’s reach. Use the public numbers as the shape of the curve; calibrate the level from your own runs.

And the horizon metric elides cost. A 30-hour run that succeeds once is a datapoint on the leaderboard; it may or may not be something you’d want to pay for. Horizon answers “can the agent do this at all?” not “should I let it?” Model Routing is the companion question: once you know the work fits, you still have to pick the cheapest agent that fits it.

  • Depends on: Agent – horizon is a measurement of the full agent system (model plus harness plus tools), not the bare model.
  • Contrasts with: Context Window – window bounds input size; horizon bounds end-to-end task length. A task can fit in the window and still exceed the horizon.
  • Mitigated by: Task Decomposition – breaking a past-horizon task into in-horizon pieces is the primary mitigation when you want work the agent can’t do in one pass.
  • Mitigated by: Checkpoint – checkpoints at stage boundaries catch the silent-drift and repeated-surrender failure modes before they compound.
  • Mitigated by: Compaction – compaction trades older context for headroom, stretching the effective horizon of a single run.
  • Mitigated by: Externalized State – moving the plan and progress into files keeps them reachable at hour six even when they’ve scrolled out of useful attention.
  • Scopes: Bounded Autonomy – autonomy policies set how far the agent may go; horizon sets how long the run can coherently last. The two together define the safe operating envelope.
  • Feeds into: Ralph Wiggum Loop – Ralph is a deliberate way to operate past horizon by restarting with fresh context at every unit of work.
  • Feeds into: Back-Pressure (Agent) – near the horizon, slowing the agent down and tightening verification loops is more useful than adding parallelism.

Sources

  • METR (Model Evaluation & Threat Research) introduced the time-horizon metric and its 50%-success formulation in Measuring AI Ability to Complete Long Tasks (2025), fitting a logistic regression of success probability against the log of human-expert completion time. This is the paper that turned horizon from a loose intuition into a measurable quantity.
  • Anthropic’s 2026 Agentic Coding Trends Report named task horizon as one of the defining trends of the year, giving the term a vendor-neutral anchor outside the benchmark community.
  • The AI Digest essay A New Moore’s Law for AI Agents popularized the ~7-month doubling observation drawn from METR’s data, making the curve’s shape the part of the concept most practitioners encounter first.
  • The Epoch AI benchmark leaderboard publishes the continuing measurements, which is where the per-model numbers quoted in practitioner conversation come from.
  • METR’s Clarifying Limitations of Time Horizon (2026) sets the honest boundaries of the metric (curated task sets, elided cost, variance in human baselines) and is the source for the How to Recognize It section’s caution about reading leaderboard numbers as direct predictions.

Further Reading

  • METR, Task-Completion Time Horizons of Frontier AI Models (https://metr.org/time-horizons/) – the benchmark homepage, with methodology and current leaderboard.
  • OpenAI, Run Long Horizon Tasks with Codex (2026) – a product-facing walkthrough of designing work to fit an agent’s horizon, with concrete patterns that map directly to Decomposition and Checkpoint.
  • Philipp Schmid, Agents 2.0: From Shallow Loops to Deep Agents – frames the architectural shift that lets agents reach longer horizons in the first place.