Context Rot
An LLM’s output quality degrades as its input grows longer, even when the context window is nowhere near full.
Understand This First
- Context Window – context rot is the quality curve inside that window.
- Model – rot is a property of how transformer attention handles long inputs.
Context
At the agentic level, context rot is the measurable decline in an LLM’s output quality as the amount of material packed into its context window grows, even when the window’s advertised capacity isn’t close to full. A 1M-token window doesn’t give you a 1M-token working memory. It gives you a soft, uneven curve where the first few thousand tokens get sharp attention, the middle sags, and the tail gets some attention back. The middle of that curve is where quiet mistakes live.
Every modern coding agent runs inside this curve. The question isn’t whether your agent’s model rots. It’s how fast, at what lengths, and what you’re going to do about it.
What It Is
Context rot is an architectural property of transformer models, not a training artifact or a capacity bug. The attention mechanism at the heart of every current frontier model uses a softmax over the input tokens to decide what the next token should pay attention to. Softmax normalizes: every token’s attention weight is a share of a fixed budget. Add more tokens and every token’s share shrinks. The model hasn’t forgotten the input; the signal for any specific token just becomes fainter as the input grows.
The empirical shape has a name: “Lost in the Middle.” Nelson Liu and colleagues at Stanford published the first widely cited result in 2023, showing that language models answer questions most accurately when the relevant passage sits at the start or the end of the input. Put the same fact in the middle of a long document and recall drops, even though the words are identical. The curve looks like a U: high on the ends, a noticeable dip in the middle.
Chroma Research tested 18 frontier models in 2025 (GPT, Claude, Gemini, Qwen, and Llama families) and found the same shape in every one. Every model tested degrades as input grows, regardless of its advertised window size. The rot is faster for some than others, but the direction is universal.
The word “rot” is precise. The information hasn’t been deleted; the model isn’t out of memory. What has changed is the model’s ability to find and weigh the relevant tokens, and that ability falls off gradually, not at a cliff. A model that’s brilliant at 2K tokens is pretty good at 32K, average at 128K, and quietly wrong at 500K, even when the “answer” is sitting in the input the whole time.
Why It Matters
Start with diagnosis. Without a name for the phenomenon, a degrading agent session feels like an unlucky day. “The model is being stupid.” “It must be the heat.” “Let me try again with the same prompt.” Once you name it, the pattern becomes visible: the longer the session runs, the more files you dump, the larger the instruction block, the more the agent starts missing things it used to catch. The fix isn’t a better prompt. The fix is a shorter, sharper context.
Then look at design. Several existing patterns in this book only make sense once you know that attention thins as input grows. Compaction fights rot on a long task by shrinking the history. Retrieval keeps working inputs small by fetching on demand instead of preloading everything. Thread-per-Task resets the attention curve with a fresh window. Subagents split a task into pieces that each fit in the steep part of the curve. Context engineering is the whole discipline you practice because rot exists — if it didn’t, you could load the entire codebase and let the model sort it out. You can’t, so you have to choose.
There’s also a buyer-beware reason. A model advertised at 1M tokens is a model that technically accepts 1M tokens of input. That is not the same as a model that stays equally sharp at 1M tokens. Teams that load giant codebases into giant windows and expect a giant increase in understanding often get the opposite: an agent that looks confident and is subtly, persistently wrong about things it was shown. The agent isn’t lying. It’s looking through a fog that the token count didn’t warn it about.
How to Recognize It
Context rot rarely announces itself. The signs are all second-order, which is why so many teams miss them.
Forgotten instructions. You told the agent in the project’s instruction file to always include a correlation ID in error messages. Twenty turns into a session, it stops including them. You can search the conversation and see that your instruction is still there. It hasn’t been removed. It’s just slid into the sag.
Wrong file, right problem. You asked the agent to investigate a bug. It read eight files. It correctly identified that the bug is in one of them. It wrote a fix for a different one. All eight files were in the input. The relevant file was in position four of eight. This is the coding-agent signature of the “Lost in the Middle” curve: the agent is treating the middle of its input as if it were lightly out of focus.
Regression to generic code. Early in a session the agent produces code that matches your conventions exactly, because your conventions are fresh at the top of its context. Hours in, the same agent produces code that looks like an average open-source project. Your conventions are still in its input. They’re just no longer the loudest voice.
Confidence without grounding. The agent cites a function that is almost but not quite what you wrote, or refers to a field that is close to but not the same as one of your real fields. You can find the real thing in the context it was given. The closer-than-random mistake is a fingerprint of attention spread too thin: the model saw the token, failed to weight it, and interpolated.
If you want to measure rot instead of just noticing it, the tools exist. Evaluation suites like “needle in a haystack” tests (a single fact hidden in a long input, measured for recall) and the RULER benchmark give you a rough curve for a given model at a given length. They don’t capture coding-agent workloads perfectly, but they tell you where the curve bends down hardest for the model you’re using.
How It Plays Out
A developer dumps a 60K-token service module into the context and asks the agent to find the cause of a slow endpoint. The agent reads carefully, names three suspicious functions, and recommends a fix in the second one. The fix is plausible. It’s also wrong: the real bottleneck is in a helper that the service module calls through an import, defined in a different file that the developer never included. The agent didn’t ask for that file. Why would it? Its immediate input was enormous, and from inside the fog of that input, it looked like the answer must be in there somewhere. A fresh session with only the call graph and the relevant helper (4K tokens total) catches the real bottleneck in one pass.
A team builds a long-running agent session for a complex refactor. For the first ninety minutes, the agent is crisp: it names the modules, respects the contracts, remembers the team’s naming conventions. Around minute 120, it starts producing output that looks great but quietly drops a constraint that the team established at minute 10. The team used to call this “the agent getting tired.” Now they call it rot, and they respond structurally: they compact the session every forty minutes, re-anchoring the constraints at the top of the new window. The agent stops drifting.
A product manager uses a 200K-token window to paste in a product spec, a customer interview transcript, three screenshots of competitor UIs, and a high-level request. The agent produces a design that makes sense for the request but ignores a specific constraint from the interview transcript (“must work offline”). The constraint was on page 12 of the transcript. It was in the input. It was in the middle.
When a session starts degrading, do not restate the instructions for the fourth time. Compact, summarize the current state, and open a fresh thread with the compacted summary and only the files you actually need. Fighting rot by adding more tokens is like fighting a fire by adding more air.
Consequences
Naming context rot changes how you build agent workflows. You stop treating the context window as a bag you dump things into and start treating it as a stage where only the most relevant material gets to stand in the bright spot. You get honest about how long a session can run before it needs to be reset. You stop blaming the model for faults that live in the input you gave it.
The main liability is over-correction. A team that’s just learned about rot can swing too far the other way, ruthlessly trimming context until the agent doesn’t have what it genuinely needs, then blaming the trimming when the agent guesses wrong. Rot is a curve, not a threshold. The goal is to keep the material that matters in the sharp part of the curve, not to minimize input for its own sake. Good context engineering is about signal concentration, not token counting.
A deeper consequence is that your agent strategy now depends on which model you use and what you’re asking it to do. Some models rot faster than others. Tasks that require holding many things in mind at once (large refactors, multi-file bug hunts) hit the rot curve harder than tasks that need a single clean answer. Choosing a model, sizing a context budget, and deciding when to spawn a subagent all become rot-aware decisions rather than window-size decisions.
Related Patterns
- Depends on: Context Window – rot is the quality curve inside the window.
- Depends on: Model – rot is a property of transformer attention, and different models rot at different rates.
- Mitigated by: Context Engineering – context engineering is the discipline you practice because rot exists.
- Mitigated by: Compaction – compaction is how you recover sharpness on a long task.
- Mitigated by: Retrieval – retrieval keeps working inputs small by fetching on demand.
- Mitigated by: Thread-per-Task – a fresh thread resets the attention curve.
- Mitigated by: Subagent – subagents split work into pieces that each fit in the sharp part of the curve.
- Related to: Memory – memory moves durable context out of the rotting window into persistent storage.
Sources
- Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang, “Lost in the Middle: How Language Models Use Long Contexts” (2023), gave the phenomenon its first widely cited empirical curve. Their finding that accuracy dips when the answer sits in the middle of a long input is the load-bearing result this article builds on.
- Chroma Research’s 2025 study tested 18 frontier models across the GPT, Claude, Gemini, Qwen, and Llama families and established that every model tested degrades with longer inputs regardless of advertised window size. Their work popularized the term “context rot” and turned it into a cross-model claim rather than a single-paper observation.
- Ashish Vaswani and co-authors, “Attention Is All You Need” (2017), introduced the transformer’s softmax attention mechanism. The mathematical reason rot exists at all (a fixed attention budget being spread across more tokens as input grows) is a direct consequence of that architectural choice.
- The broader conversation around context engineering at major labs and in the practitioner community during 2025 and early 2026 connected rot to the design of coding agents: it is what compaction, retrieval, subagents, and thread isolation are all, at bottom, fighting.
Further Reading
- Lost in the Middle: How Language Models Use Long Contexts – the foundational paper; the U-shaped accuracy curve is worth seeing in its original form.
- Context Rot: How Increasing Input Tokens Impacts LLM Performance – Chroma Research’s 18-model study, with charts that make the cross-model consistency unmistakable.