Context Window
The bounded working memory inside which a model sees everything it knows about the current task.
Understand This First
- Model — the context window is a property of the model.
What It Is
The context window is the bounded working memory available to a model during a single interaction. Everything the model can “see” (the system prompt, the conversation history, files or documents you’ve handed it, and its own previous responses) has to fit inside this window. The window is measured in tokens (roughly, word fragments), and its size is a property of the model, not the harness.
As of 2026, frontier models commonly offer one million tokens of context, with some reaching ten million. Mid-tier models start at 128K. The numbers keep climbing, but the shape of the constraint doesn’t change: everything the model knows about right now has to live in the window, and anything that falls outside the window is invisible to the model from that turn onward. The model doesn’t experience the missing content as a gap. It works from what it has and generates plausible output for the rest.
Inside the window, attention is also uneven. The same one-million-token capacity does not give every token equal weight: information near the beginning and the end of the context gets more attention than information in the middle. Researchers call this the “Lost in the Middle” effect, and it is still present in production models at million-token scale. A larger window buys capacity; it does not buy uniform comprehension.
The window is distinct from a few adjacent ideas that often get confused with it:
- It is not the model’s training data. Training data is what the model learned from before deployment; the window is what it is looking at right now.
- It is not memory. Memory is a mechanism for carrying information across windows. The window itself is single-session.
- It is not the prompt. The prompt is a particular thing you write into the window; the window is the container that holds it alongside everything else.
Why It Matters
The window is the single most consequential constraint in agentic coding. It governs how much code an agent can reason over in one pass, how long a conversation can run before it starts losing the thread, and how much room is left for instruction files, retrieved snippets, and tool output. Once you have the term, decisions that used to feel ad hoc resolve into the same question: does this fit, and what’s the cost of making it fit?
The window also creates an asymmetry the practitioner has to feel in their bones. You can walk away from a long session, sleep on it, come back, and recover your full understanding. The model can’t. Once a fact leaves the window, the agent is no longer working from a partial picture; it’s working from a confidently complete picture that happens to be missing pieces. Without vocabulary for the window, that failure mode reads as the model being flaky. With the vocabulary, it reads as a resource exhaustion problem you can manage.
The term also locates a real tradeoff that practitioners argue about constantly. A bigger window costs more per call, slows responses, and dilutes attention; a smaller window forces sharper choices about what to include. There’s no universal right setting. The vocabulary makes the argument productive: instead of “the agent is dumber than yesterday,” you can say “the window is saturated” or “the relevant context is past the attention sweet spot” and act on it.
How to Recognize It
A few signals tell you the window is the thing you’re looking at:
The agent quietly stops honoring earlier instructions. You spent the first message establishing that the project uses TypeScript with strict null checks. Sixty messages later, the agent returns JavaScript with loose typing. Your instructions haven’t changed; they’ve scrolled out of the model’s effective attention. If the model’s behavior shifts and nothing in the recent turns explains it, suspect the window.
The agent contradicts itself across turns. Early in the session, the agent recommended approach A and explained why B was a dead end. An hour later, it proposes B again as if it were a fresh idea. That’s the tell that the earlier reasoning is gone from the window — or at least far enough from the attention sweet spot that the model can’t reach it.
Quality degrades smoothly with conversation length. Replies that started crisp and specific become hedgier, vaguer, more boilerplate. The agent starts producing the kind of answer it would have given on turn one if you’d asked the question with no context. It’s reverting to its training-data baseline because the session-specific context is no longer in effective reach.
The token meter creeps toward the cap. When the harness exposes a token count (Claude Code, API instrumentation, a sidebar in the IDE), watch it. Smooth growth past 80% of the window without any compaction firing usually means you’re heading toward a hard wall rather than a graceful one.
Two diagnostic moves separate window pressure from other failure modes. First, restate the lost instruction in the most recent turn; if behavior snaps back, the constraint was the window. Second, start a fresh thread with the current state summarized up front; if quality returns, the prior session was saturated.
When an agent starts ignoring conventions it followed earlier in the session, the cheapest test is to restate them in the next turn. If the agent immediately complies, the instructions were pushed past the model’s effective attention; treat that as a signal to compact or start a fresh thread before the rest of your work goes the same way.
How It Plays Out
A developer is ninety minutes into a refactor with an agent. The first message established a project convention for error handling: specific exception types, no bare try/except. By the time the agent is wiring up the fifth module, it starts emitting bare try/except blocks. The developer restates the convention; the agent apologizes and corrects. Restating works for one turn. By the next message, the bare blocks are back. The convention has reached the point where it can be retrieved when explicitly cited but is no longer informing the agent’s defaults. The session needs compaction or a thread reset, not another scolding.
A platform team works on a tangled legacy module where one function pulls in five files of context to reason about. The agent works, but slowly, and it spends most of its window on navigation. They restructure the module to support local reasoning: the function’s dependencies get narrowed, types get tightened, and a short module-level comment names the invariants. Afterward, the agent can hold the complete picture of that function in a fraction of the window it used to need. The window didn’t change. The amount of context the work required did.
“Read src/auth/middleware.ts and src/auth/types.ts, then add rate limiting to the login endpoint. Don’t read other files unless you need to check an import.”
A four-hour code-audit agent runs against a 200-file repository. Without window discipline it would either drown on the first file or burn its context on directory listings. The team sets up the harness to fetch files on demand via tools, compact every 60% of capacity, and write summaries into a durable progress log outside the window. The window is still finite. The work is no longer bounded by it.
Consequences
Once the window is in your vocabulary, you stop blaming the model for what is really a resource constraint. You provide focused context, you start fresh threads when quality drifts, and you structure code so each unit fits comfortably inside a single session’s working set. The agent gets more useful and your debugging gets cheaper because you’re diagnosing the right thing.
The cost is ongoing attention management. Every long session forces decisions about what to include, what to leave out, and when to compact. Those decisions have real consequences for output quality, and they don’t go away as windows get larger; bigger windows just push the failure mode further out and make it harder to notice when it arrives. The “Lost in the Middle” effect compounds the problem: even when the window technically fits everything, attention can’t.
Mechanisms that help — compaction, instruction files, memory, tools, well-decomposed code that supports local reasoning — are themselves patterns and concepts the practitioner has to learn. None of them removes the constraint. They give the practitioner ways to work productively against it.
Related Patterns
Sources
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin introduced the Transformer architecture in “Attention Is All You Need” (2017). The fixed-length input sequence processed by self-attention is the architectural origin of the context window as a hard constraint.
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang demonstrated the U-shaped attention curve in “Lost in the Middle: How Language Models Use Long Contexts” (2023). Their finding that models attend most strongly to information at the beginning and end of the context is the empirical basis for the recognition guidance above. As of 2026, no production model has fully eliminated this position bias, even at million-token scale.
The term “context engineering” gained traction through Tobi Lutke, who proposed it as a better name than “prompt engineering” for the skill of assembling the right context for a task. Simon Willison championed the term in a widely circulated note (June 2025), helping it enter common usage.