Deep Agents

Pattern

A named solution to a recurring problem.

The composite recipe behind every production coding agent: explicit planning, sub-agent delegation, persistent memory, and an extreme context-engineering layer that turns a model in a loop into a harness that survives long tasks.

Also known as: Agents 2.0

Understand This First

Agent – a model in a loop; the shallow building block a deep agent extends.
Plan Mode – explicit planning is one of the four pillars.
Subagent – delegated workers are another pillar.
Context Engineering – the instruction and context layer is the fourth pillar.
Memory – persistent state across steps and sessions is the third pillar.

Context

At the agentic level, “Deep Agents” names the composite architecture that Claude Code, Codex, Manus, Deep Research, and their peers all share. It is not a single feature but a recipe of four pillars applied together: the agent makes a plan and writes it down, delegates focused work to sub-agents with isolated context, persists state to an external store so nothing important lives only in the context window, and runs under a long, carefully authored system prompt that governs thousands of small decisions.

The name crystallized in 2026. Philipp Schmid framed the shift as “Agents 2.0: From Shallow Loops to Deep Agents,” LangChain shipped a deepagents SDK that generalizes the Claude Code architecture, and the 2026 practitioner literature converged on the same four pillars. Shallow agents are the agent primitive: a model in a loop with a handful of tools, an implicit plan, and a single conversation as its only memory. Deep agents are what that primitive becomes once you engineer it hard enough to survive a multi-hour refactor. Naming the composite lets you recognize it when you meet it, reason about what each pillar buys, and reach for the full recipe deliberately rather than reinventing pieces of it under pressure.

Problem

Why does Claude Code feel qualitatively different from a naked GPT-4 loop? Why does a shallow agent fall apart after twenty tool calls on a real codebase while a production harness keeps going for hours?

A single-loop agent has no plan it can re-read, no way to hand off focused work, no memory beyond its context window, and a short system prompt that can’t cover the thousand small decisions a real task requires. Each of those gaps is survivable on a five-step task. All of them at once, on a multi-hour task, are fatal. The agent forgets its own goal, saturates its context with tool output, loses the thread after one dead end, and produces confidently wrong results because nothing reminded it of the constraints that applied twenty turns ago. Patching one pillar in isolation doesn’t help much: planning without memory forgets the plan, memory without delegation saturates the orchestrator, delegation without a careful system prompt produces chaotic sub-agent behavior. The question isn’t which pillar to add first; it’s how the four compose into something that holds together.

Forces

Task length vs. context budget. Long tasks generate more tool output, plans, and partial results than any single context window can hold.
Goal persistence vs. step locality. Each step needs focused attention on its own work, but the overall goal must survive across steps without rereading everything.
Specialization vs. coherence. Different subtasks (research, design, implementation, review) want different prompts and tools, but the final result must still cohere.
Flexibility vs. reliability. The agent needs to adapt to whatever the task demands, but it also needs to behave predictably enough that a human can trust it unattended.
Power vs. cost. Every pillar adds tokens, latency, and moving parts; the recipe has to earn its overhead on tasks where a shallow loop would fail.

Solution

Build the agent around four pillars, applied together.

1. Explicit planning. The agent writes a plan before it acts, and the plan is an inspectable artifact, not a chat message. Claude Code’s TodoWrite is the canonical example: a structured list the agent can re-read, update, and check off. LangChain’s deepagents exposes a planning_tool that does the same job. The plan survives compaction, it survives hand-offs to sub-agents, and it survives the reader who wants to know what the agent thinks it’s doing.

2. Sub-agent delegation. Focused work happens in sub-agents with isolated context windows, invoked through a delegation tool (Claude Code’s Task, LangChain’s sub_agents). The orchestrator doesn’t read the codebase itself; it asks a research sub-agent to read the codebase and summarize. The orchestrator doesn’t write the fifteen-file refactor; it dispatches implementation sub-agents that return diffs. Each sub-agent keeps its own working memory out of the orchestrator’s window. See Orchestrator-Workers for the hierarchical composition and Subagent for the primitive.

3. Persistent memory. State lives outside the context window: on the filesystem, in a vector store, in a scratchpad directory, in the project’s own files. The agent writes notes, intermediate results, tool outputs, and the plan itself to files it can re-read. Compaction is safe because the important stuff isn’t lost when the window compresses; it was already on disk. Sessions can end and resume because the next session starts by reading the plan file and the scratchpad. See Externalized State and Memory.

4. Extreme context engineering. The system prompt is long, specific, and load-bearing. Claude Code’s system prompt runs past twenty thousand tokens. It names the tools, defines when to plan and when to act, specifies how to name files, dictates how to handle refusals, enumerates the failure modes to watch for. The instruction file extends the system prompt with project-specific conventions, and skills package reusable expertise on top. The agent isn’t clever because the model is clever; the agent is clever because the prompt told it how to think about this particular kind of work.

Each pillar addresses a specific shallow-agent failure mode. Planning fixes goal loss. Sub-agents fix context saturation. Memory fixes amnesia. Context engineering fixes the thousand small decisions the model would otherwise guess at. Remove any one pillar and the others can’t cover for it. That’s why the composite matters more than any single technique.

Tip

If you are building an agent from scratch, add the pillars in the order they will bite you. A short task can survive without memory. A medium task can survive without sub-agents. A long task can survive without a careful system prompt for a while. But none of them survive without a plan you can re-read, so that is the pillar to install first.

How It Plays Out

A developer asks Claude Code to migrate a Python service from SQLAlchemy 1.4 to 2.0. The model doesn’t start editing. It runs the planning tool and writes out a seven-step plan: audit current usage, identify breaking changes, design the migration order, update the models, update the queries, run the tests, patch anything the tests catch. The plan lives as a TodoWrite artifact the agent re-reads between steps.

For the audit step, the agent dispatches a sub-agent with a focused prompt: “find every SQLAlchemy import and the call sites that will break under 2.0.” The sub-agent runs grep and file reads in its own context window and returns a one-screen summary. The orchestrator’s window stays clean. The audit results go into a scratchpad file the agent updates as it works.

When the context window fills up on step five, compaction runs, but the plan, the audit results, and the in-progress diffs are all on disk. The agent rereads them and keeps going. The CLAUDE.md file in the repo told it to run poetry run pytest rather than pytest directly, and it did, because the long system prompt told it to read CLAUDE.md before assuming anything about the test runner. Four hours in, the migration lands.

Now picture the same task given to a shallow agent: a single loop with file-reading and shell tools, no sub-agents, no scratchpad, a three-hundred-token system prompt. The agent starts editing files immediately because it has no planning discipline. The audit runs inline and fills the context with grep output. By the fifth model file, the window is saturated with earlier diffs and tool responses, and the agent forgets that the query layer also needs updating. It runs pytest from the wrong directory, misreads the failure, and confidently reports success on a test suite that never actually ran. The task fails not because the model was weak but because the harness around it was shallow.

Here is the same four-pillar recipe visible in LangChain’s deepagents SDK:

from deepagents import create_deep_agent

agent = create_deep_agent(
    tools=[search_web, read_file, write_file, run_shell],
    instructions=long_system_prompt,          # pillar 4
    subagents=[research_agent, review_agent], # pillar 2
    # planning_tool is built in                 pillar 1
    # filesystem_backend is built in            pillar 3
)

The names are different from Claude Code’s, but the pillars are the same. A planning_tool for the TodoWrite equivalent, a subagents parameter for delegation, a filesystem backend for persistence, and a long instructions string for the context-engineering layer. Recognizing the shape makes switching frameworks a matter of translation, not re-architecture.

Warning

The long system prompt is load-bearing and fragile. Every behavior you rely on from a deep agent is written somewhere in those twenty thousand tokens. Delete the wrong sentence and the agent stops planning, or stops delegating, or starts over-editing. Treat the system prompt like production code: review changes, keep a changelog, test before shipping.

Consequences

Benefits. The recipe extends the task horizon by an order of magnitude. A shallow agent that fails at thirty minutes becomes a deep agent that works for four hours. Sub-agent delegation keeps the orchestrator’s context clean even on tasks that touch hundreds of files. Persistent memory turns interruptions and compaction events into non-events rather than disasters.

The long system prompt lets a fixed model behave dramatically differently across domains: the same Claude model writes Python one hour and reviews contracts the next, because the prompt told it how. Readers who recognize the recipe can reason about why a given harness works, evaluate frameworks by whether they support all four pillars, and notice when their own agent is shallow on the dimension that’s about to bite them.

Liabilities. Deep agents are expensive. Every planning step, every sub-agent dispatch, every file write, and every twenty-thousand-token system prompt costs tokens and wall-clock time. They over-engineer small tasks: asking a deep agent to add a one-line import is absurd when a shallow loop would finish before the plan was written. They also accumulate filesystem cruft: scratchpad files, stale plan artifacts, and abandoned sub-agent outputs pile up unless someone prunes them.

The orchestrator’s context can still saturate if sub-agent responses aren’t summarized aggressively, and sub-agents can scope-creep when their prompts don’t constrain them tightly. The long system prompt becomes a maintenance burden that no single engineer understands end-to-end, and observability gets harder: tracing why a sub-agent two levels down made a given choice requires logging at every level. The recipe’s power is its own trap, because a team that always reaches for deep agents stops learning when a shallow loop would have been the right answer.

Sources

Philipp Schmid’s Agents 2.0: From Shallow Loops to Deep Agents (2026) crystallized the framing and named the architectural generation shift. The four-pillar decomposition used here matches his taxonomy.
LangChain’s deepagents SDK and the accompanying blog series (Deep Agents, Building Multi-Agent Applications with Deep Agents, Deep Agents v0.5) formalized the recipe in code and generalized it beyond Claude Code. The SDK’s parameter names (planning_tool, sub_agents, filesystem_backend, system_prompt) are the clearest external evidence that the four-pillar decomposition is the pattern.
Anthropic’s Claude Code team produced the exemplar. The long system prompt, TodoWrite, Task delegation, and CLAUDE.md conventions are the canonical reference implementation of each pillar, even though Anthropic did not publish a paper naming the composite.
The DAIR.AI Prompt Engineering Guide added a dedicated Deep Agents page that codified the term for a pedagogical audience.
The shift is continuous with the broader multi-agent systems literature going back to the 1990s (Wooldridge, Jennings). What’s new in 2026 is the convergence on a specific four-pillar recipe and the engineering maturity to build it on top of commercial LLMs.

Keyboard shortcuts