Jagged Frontier

AI capability is shaped like a coastline, not a horizon: tasks that look equally hard to a human can fall on opposite sides of an invisible, irregular boundary between “the agent nails it” and “the agent fails confidently.”

Concept

A foundational idea to recognize and understand.

Understand This First

Model – the underlying capability whose shape the frontier describes.
AI Smell – the surface signal that a task sat just outside the frontier.

What It Is

The Jagged Frontier is the observation that AI capability is uneven in ways that don’t track human intuition about task difficulty. Inside the frontier, an agent is reliably and often spectacularly competent. Just outside it, the same agent fails in ways that look confidently correct but are wrong. The boundary between the two is not a smooth curve running from “easy” to “hard.” It has spikes, pockets, and gaps that you can only discover by probing.

The term comes from a 2023 Harvard Business School working paper, Navigating the Jagged Technological Frontier, by Dell’Acqua, McFowland, Mollick, and colleagues. They ran a field experiment with 758 consultants at Boston Consulting Group. Consultants given access to GPT-4 finished 12% more tasks and did them 25% faster, with 40% higher quality, when the work fell inside the frontier. On tasks just outside the frontier, those same consultants performed 19% worse than the control group who used no AI at all. Same consultants. Same model. Opposite results, determined by which side of an invisible line the task happened to fall on.

Ethan Mollick popularized the metaphor in his One Useful Thing essays and in Co-Intelligence. The shape matters: a frontier with spikes and bays is harder to map than a straight wall. You don’t know where the line is until you find it, usually by crossing it and watching something break.

Why It Matters

Half of what this Encyclopedia teaches exists because of the jagged frontier. Verification Loop, Eval, Bounded Autonomy, Approval Policy, Generator-Evaluator, Human in the Loop: every one of these scaffolds exists because capability is unreliable in ways you cannot predict ahead of time. If the frontier were smooth, so that an agent which handled a hard task yesterday could be trusted on a slightly harder one today, most of that scaffolding would be unnecessary.

Naming the concept turns an implicit assumption into something you can cite. Readers new to agentic coding often arrive with the wrong mental model: they assume capability is like a person’s, where doing a harder task predicts the ability to do easier ones in the same area. It isn’t. The agent that just refactored a thousand-line module may fail at counting the functions it refactored. The agent that wrote a correct SQL query may botch a simpler one the next prompt. Expecting smooth capability is the biggest source of misplaced trust in an agent.

There is also a 2026-specific reason to name it now. Models are getting better, which closes off the obvious failures. The confident-but-wrong outputs that made the concept vivid in 2023 have mostly been retrained out. What remains are subtler jags: the agent that seems to understand your codebase until you ask it to count occurrences of a symbol; the plausible migration that looks correct until you reason about concurrent writes. The heuristic matters more, not less, once the easy failures are gone.

How to Recognize It

You can’t map the frontier in advance. You can only detect it empirically. Watch for these signals:

Tasks that look similar have dissimilar outcomes. You ask the agent to rename a symbol across a codebase and it succeeds. You ask it to count how many times that symbol appears and it gets it wrong. Same codebase, same kind of text processing, opposite result. This is the frontier talking.

The agent’s confidence doesn’t vary with its accuracy. On a task inside the frontier and on a task just outside it, the output looks equally assured. There is no tremor in the prose, no “I’m not sure here.” If the agent’s confidence is uniform across tasks where your own estimate of difficulty varies wildly, capability is not tracking difficulty and the frontier is active.

Performance collapses in a specific direction. Many frontiers run along predictable seams. Token-level tasks (counting letters, finding positions in a string) underperform relative to surface difficulty. Tasks requiring numeric reasoning, cross-referencing across long contexts, or inferring invariants from code fall on the harder side more often than they “should.” When you notice a seam, mark it.

Small changes produce big quality swings. Prompting the agent to solve a problem in Python versus in Haskell, or in a popular framework versus an obscure one, shouldn’t change its underlying reasoning. It does. A model that handles React fluently may stumble on the structurally similar Svelte. Capability is distributed across training data, not across concepts.

Why the Frontier Is Jagged

A model’s capability reflects the distribution of its training data more than the structure of the underlying problem. The surface difficulty of a task (how hard a human finds it) and its distributional difficulty (how well-represented it is in the training corpus) are only loosely correlated. Tokenization adds its own jags: “how many r’s in strawberry” is trivial for a human and historically hard for models because letters are not the unit the model thinks in. Abstraction leaks, the way a framework hides its internals from the code calling into it, add more. The frontier has the shape it has because each model has its own uneven map of what it has seen, and your task has to land on a patch that was densely represented.

This is also why frontiers differ by model. Claude and GPT-4 and Gemini each have their own coastline. Model Routing is one response to this fact: pick the model whose frontier includes the task at hand. It is also why an agent that handled something well last week is not reliable evidence it will handle this week’s task. Different tasks, different patches of the map.

Warning

The most dangerous jag is the one that isn’t visible until you are already past it. The agent generates a migration script that looks clean, the tests pass, and the deploy goes out. Three hours later the first lock-contention incident surfaces. The script was fine under sequential writes and broken under concurrent ones, and the frontier ran right through “concurrency-aware reasoning.” Treat anything you can’t verify mechanically as potentially outside the frontier until proven otherwise.

How It Plays Out

A senior engineer asks an agent to rename every use of currentUser to authenticatedPrincipal across a TypeScript monorepo. The agent handles it cleanly: imports, tests, JSDoc comments, even string templates in a couple of places. A week later she asks the same agent, on the same codebase, how many files still reference the old name. The agent says “zero.” She runs grep. The answer is seven. The rename was inside the frontier; the count was outside. Nothing about the difficulty of those two tasks, from her point of view, predicted the gap. The rename required understanding structure. The count required keeping faithful arithmetic while reading tool output. Training distribution was kind to the first and cruel to the second.

A product team delegates the first draft of a database migration to an agent. The resulting SQL is syntactically clean, uses the right data types, and includes an up-and-down script. The migration runs fine in staging. In production, it deadlocks under load because the agent wrote it as a single transaction holding locks on four tables that are normally accessed in a different order. The failure mode (concurrent-access reasoning) was far outside the frontier even though the surface task (write a migration) was well inside it. The team adds an Eval that simulates concurrent load against any agent-generated migration. They have mapped one jag. There are more.

A founder discovers that his agent is terrific at writing new features against his existing codebase and terrible at deleting them. Ask for a new endpoint, flawless. Ask for the correct set of files to delete when retiring an old endpoint, and the agent either misses files or proposes deleting active code. He realizes the asymmetry: creating new things is “generate text similar to other code you’ve seen”; retiring things requires reasoning about what depends on what, which is closer to Local Reasoning and farther from pattern-matching. He stops delegating deletions. That single policy change eliminates most of the incidents he used to spend his weekends recovering from.

Consequences

Internalizing the jagged frontier changes how you decide what to delegate. You stop asking “is this task hard?” and start asking “does this task live in a part of the map the agent has seen densely?” You develop a personal catalog of jags: the specific task shapes where your specific agents reliably fail. Over time this catalog is worth more than any abstract advice about when to use AI.

The cost is that there is no universal rulebook. Your catalog is yours, built from your stack, your codebase, your agents, your prompts. A teammate’s mental map of the frontier will overlap yours but won’t match it. This is uncomfortable for organizations that want a single delegation policy. The honest answer is that the policy has to be local and empirical.

The frontier also shifts under you. A model upgrade can close an old jag and open a new one. A new capability (longer context, better tool use, a different routing policy) redraws the coastline. Maps go stale. The discipline of re-probing, of running the same evals against a new model version, becomes part of the job. This is one of the strongest arguments for investing in a durable Eval suite: evals are the instrument that tells you where your current frontier runs.

There is a deeper consequence for how you think about working with agents at all. Mollick identifies two strategies, which he calls Centaur and Cyborg. A centaur keeps a clear division of labor: the human handles work that is outside the frontier, the agent handles work inside it, and the line between them is explicit. A cyborg interleaves more tightly: the human and agent weave back and forth within a single task, the human nudging when the agent drifts toward an edge. Both strategies are responses to the same underlying fact. The wrong strategy is pretending the frontier isn’t there.

Sources

Fabrizio Dell’Acqua, Edward McFowland III, Ethan Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim Lakhani introduced the term in Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality (Harvard Business School working paper 24-013, 2023). The BCG consultant experiment they report is the empirical foundation for the concept and the source of the inside/outside-the-frontier performance numbers.
Ethan Mollick developed and popularized the metaphor in his One Useful Thing essays, particularly Centaurs and Cyborgs on the Jagged Frontier (2023) and The Shape of AI: Jaggedness, Bottlenecks and Salients (2025), as well as in Co-Intelligence: Living and Working with AI (Portfolio, 2024). The centaur and cyborg vocabulary for working with a jagged frontier comes from these essays.
The tokenization explanation for why the frontier is jagged rather than smooth is a standard observation in the NLP community going back to Karpathy’s discussions of byte-pair encoding (see his minbpe tutorial repo and the Hugging Face BPE chapter); the “strawberry” class of failures that made it famous was documented across practitioner communities in 2023-2024.

Keyboard shortcuts