Context Offloading
Route large tool results to the filesystem and pass the agent a summary plus a reference, so the active context stays small while the full payload remains retrievable.
Also known as: Offload Context, Filesystem Scratchpad, Dynamic Context Discovery.
Understand This First
- Context Window — the finite resource that makes offloading worth doing.
- Context Rot — the failure mode that tool exhaust accelerates.
- Tool — the surface where offloading is implemented.
Context
At the agentic level, context offloading is a discipline for handling tool output. When a tool returns more material than the agent needs to reason about right now, you write the full payload to a file and hand the agent a short summary plus a reference. The agent reads the file only if the summary turns out to be insufficient. The active context window stays focused on the work; the bulky payload sits on disk, available on demand.
The pattern crystallized around 2025 as practitioners building production coding agents hit the same wall from several directions. Manus described treating the file system as infinite memory and writing old tool results out to keep working memory clean. Cursor wrote about “dynamic context discovery,” where the agent gets a head and tail of long output and pulls the rest as needed. LangChain catalogued “offload context” as one of seven core context-engineering moves. Anthropic’s Claude Code bakes the pattern into its built-in tools: Read returns a slice with the rest available by offset and limit, and Bash will redirect long output to a file the agent can revisit. The names differ; the move is the same.
Problem
How do you let an agent call powerful tools without letting the volume of their output crowd out everything else the agent needs to think about?
A grep returns 2,000 lines and the conversation now has 2,000 lines of code in it, none of which the agent has decided are relevant yet. A database query returns 5,000 rows and every subsequent message carries 5,000 rows of cognitive overhead. A Read of a large file fills 30,000 tokens with material the agent will scan once and never look at again. An MCP server registers fifty tools, each with a 500-token description; the agent now sees 25,000 tokens of catalogue before it has even thought about what to do.
You can feel the problem within a single afternoon of running an agent on a real task. The window fills with tool exhaust the agent never asked the model to read carefully. By the time the agent has to reason about the next step, the relevant context is buried in noise, and the conversation tips into Context Rot: the agent’s outputs get vaguer, repeats start creeping in, earlier decisions get forgotten. Loading less isn’t the answer either, because the agent genuinely needed that grep, that query, that file. You need a way to call powerful tools without paying for them in working memory.
Forces
- Variable payload size. Tool outputs vary by orders of magnitude across the same session — sometimes a one-line answer, sometimes ten thousand rows. You cannot tune the window for the average case.
- Reasoning quality vs. retrieval cost. Pulling a payload back from disk costs a tool call. Letting it sit in the active context costs reasoning quality across every subsequent turn. The second cost is bigger and easier to underestimate.
- The agent has to know to come back. A summary that is too lossy hides the fact that the agent should re-read; a summary that is too generous defeats the purpose. Summary design is load-bearing.
- Auditability. A human reviewing the conversation may want to see exactly what the agent saw. If the payload only ever lived on disk, that audit trail has to point at the file, not at the chat.
- Cleanup. Files written during a session accumulate. Without gardening, the scratch space turns into clutter that the agent stumbles over later.
Solution
Wrap your tools so they write large outputs to a file and return a structured summary plus a reference, instead of returning the raw payload. The agent’s next turn sees the summary; it reads the file only if it decides the summary is not enough.
The minimum viable shape is two-field:
{
"summary": "2,043 matches for `parse_ast` across 87 files. Top files by match count: src/parser/core.rs (412), src/ast/walker.rs (188), src/lint/rules.rs (104). Full results in /tmp/agent/grep_47.txt.",
"ref": "/tmp/agent/grep_47.txt"
}
The summary is the agent’s decision surface. Write it so the agent can answer the obvious follow-ups (“which file should I look at first?”, “is the term I expected even present?”) without paying for the full payload. Where helpful, structure the summary itself as a small index: top results, distribution by category, anything that supports the next reasoning step. If the agent decides it needs the full file, it reads it on the next turn.
Apply the same shape across the tool surface, not just to one tool. Long file reads return a slice plus a path and offset so the agent can page in more. Long shell commands redirect to a logfile and return the head and tail. MCP server discovery returns one-line tool descriptions with a fetch-by-name for the full schema. Conversation history older than N turns gets checkpointed to disk and replaced in the window with a one-paragraph summary. The pattern is uniform: the wrapper, not the model, owns the decision about what to keep in the active context.
Two practices make offloading work in production.
Make the summary trustworthy. A summary that omits a critical detail will silently steer the agent wrong. The agent doesn’t know what was dropped. Where you cannot summarize without losing fidelity (close textual comparison, regulatory text, a diff that has to be read line by line), don’t offload; return the payload. Offloading is for material the agent samples, not material it has to read end-to-end.
Garden the scratch space. Files written during a session are session-scoped. Use predictable paths (/tmp/agent/<session>/<tool>_<n>.<ext>), and let the harness clean them up at session end. If the agent has to navigate a folder of stale files from previous runs to find the one it just wrote, you have made the problem worse, not better.
When you wrap a tool, write the summary first and the file path second, then review the summary as if you were the model deciding whether to read the file. If you wouldn’t know whether to open it, neither will the agent.
How It Plays Out
A coding agent is refactoring a parser in a large Rust repo. It calls Read on src/parser/core.rs, which is 4,200 lines. The wrapped tool returns the first 200 lines and a one-line summary: "src/parser/core.rs (4,247 lines): top-level pub items include Parser, ParseError, parse_module, parse_expr; the rest available with offset/limit." The agent sees the public surface in 200 lines, decides it needs the body of parse_expr, and calls Read again with offset: 1240, limit: 180. It never reads the unrelated lexer at the bottom of the file. The window cost of touching this file is around 400 lines instead of 4,200.
A research agent has been working through a question for ninety minutes across forty turns. The earliest turns were exploration that has long since been superseded. The harness rolls all turns older than the last fifteen into a single summary: "Earlier turns (1-25, checkpointed at /tmp/agent/sess_b/history_1.json): explored three hypotheses (A, B, C). A and B ruled out by experiments in turns 12 and 18. C is the live thread; current focus is verifying its corollary." The active window now carries one paragraph instead of twenty-five turns of dead exploration, and the agent’s next move is grounded in what’s still relevant.
An MCP-heavy agent connects to a server with fifty registered tools. Instead of accepting fifty 500-token descriptions on every turn, the harness returns a single index (one line per tool, name and one-sentence purpose) plus a fetch_tool_schema(name) call. The agent reads the index, picks the three tools it needs, and pulls their full schemas only as it’s about to call them. Tool registration cost drops from 25,000 tokens to roughly 600.
Offloading does not work for tasks where every detail must flow through the model in full. Close legal-text comparison, line-by-line diff review, and audits that depend on noticing the one anomaly in a long list all require the payload in the window. Offloading those tasks risks the model deciding the summary is good enough when it is not.
Consequences
Offloading turns tool output from a tax on the active window into a resource the agent can sample on demand. The window stays available for reasoning, planning, and the parts of payloads that genuinely matter. Long sessions hold their coherence further into the task; tool-heavy workflows stop choking on their own success. Offloaded payloads also become a side-effect audit trail: the human reviewing the conversation can open the same file the agent saw, instead of trying to reconstruct what was in a window that has since been compacted.
The costs are real. The summary becomes load-bearing: a poorly designed summary silently steers the agent toward the wrong conclusion, and unlike a missing tool call this failure leaves no obvious trace. The agent has to know it can re-read; if your harness offloads but doesn’t teach the agent how to fetch back, you’ve just hidden the data. The scratch space accretes files that need cleanup. And there’s a category of task (close-reading work where every word has to be in the window) where offloading is the wrong move, and you have to recognize that case before you reach for the wrapper.
The reframe worth keeping: offloading is a discipline for which failures dominate, not a guarantee against failure. It trades “the window fills up and reasoning degrades” for “the summary occasionally hides something the agent needed.” The first failure is gradual and silent and accumulates over a long session. The second is local, debuggable, and visible the moment the agent’s answer is wrong. That trade is almost always worth making.
Related Patterns
| Note | ||
|---|---|---|
| Complements | Externalized State | Both write to disk, but externalized state is for humans (plan, progress, audit); offloading is for the agent itself (payloads it may want to revisit). |
| Complements | Progressive Disclosure | Output-side sibling to input-side disclosure: progressive disclosure tiers what arrives, offloading tiers what departs. |
| Depends on | Context Window | Offloading exists because the window is finite; on an infinite window the pattern would have no purpose. |
| Enables | Thread-per-Task | Keeping the active window lean lets a single thread stay coherent for longer before it has to be retired. |
| Extends | Compaction | Compaction is the lossy recovery move when the window is already full; offloading is the lossless prevention that keeps it from filling. |
| Mitigates | Context Rot | Tool exhaust crowding the window is one of the main drivers of context rot; offloading removes it from the active context. |
| Specializes | Context Engineering | Context offloading is one specific technique within the broader context-engineering discipline. |
| Uses | Tool | Offloading is implemented at the tool layer: the wrapper writes the payload and returns a summary plus a reference. |
Sources
- The pattern of treating the filesystem as the agent’s overflow memory was developed publicly by the Manus team in their context-engineering writeups (October 2025), which named the move and made the case for it as the central discipline of long-running agent sessions.
- Cursor documented an equivalent mechanism under “Dynamic Context Discovery,” reframing the same idea around the agent paging through long output with
tailandhead-style reads instead of swallowing the whole payload. - The LangChain team catalogued “Offload Context” as one of seven core agent-design patterns alongside Cache, Isolate, Evolve, Progressive Disclosure, Multi-Layer Action Space, and Give Agents A Computer, framing offloading as a peer to the other context-engineering moves rather than a special case.
- Anthropic’s Claude Code bakes the pattern into its built-in tool API:
Readreturns a slice with the rest available byoffsetandlimit;Bashwill redirect long output to a logfile that the agent can revisit. The tool API itself is the pattern’s clearest production reference. - The broader observation that bloated context windows degrade reasoning quality before they hit any hard limit is the through-line of the Context Rot literature; offloading is one of the discipline-level responses that line of work motivates.