Agentic Context Engineering
Treat the agent’s working context as an evolving structured playbook of discrete tagged bullets, updated incrementally by three specialized roles instead of monolithic rewrites.
Also known as: ACE, Evolving Playbook.
Understand This First
- Context Engineering — ACE is one specific architecture inside this broader discipline.
- Reflexion — single-agent verbal self-critique; the ancestor ACE generalizes.
- Memory — the substrate the playbook lives in.
Context
At the agentic level, Agentic Context Engineering is what you reach for when an agent should learn from its own execution and you want that learning to compound rather than evaporate. The agent runs a task. Some attempts work, some don’t. You want the next attempt to be sharper than the last, and you want the run after that to be sharper still, across days, sessions, and personnel changes. The naive answer is to have the agent rewrite its own instructions: edit CLAUDE.md, update the system prompt, summarize what it learned. ACE is the pattern that says: don’t rewrite. Itemize.
The architecture is one of several in the Context Engineering family. Where the parent pattern names the four operations (select, compress, order, isolate) at the level of “what does the model see this turn,” ACE answers a narrower question one floor up: how do you accumulate useful, durable knowledge into that context over time without breaking it? The pattern was published by Qizheng Zhang and colleagues at SambaNova, Stanford, and UC Berkeley in late 2025 and accepted at ICLR 2026. Two open-source implementations and a SambaNova industrial blog have followed.
Problem
How do you let an agent learn from its own runs and have the learning stick, when the obvious approach (let the agent rewrite its own working instructions) quietly destroys what it knows?
Two failure modes show up within weeks of trying naive self-rewriting. The first is brevity bias: every rewrite drops domain-specific detail in favor of cleaner, shorter summaries, so the agent gets vaguer over time. The second is context collapse: after enough rewrites, the accumulated knowledge degrades into a small generic blob. It’s the cassette-tape problem. Copy a copy of a copy and the signal goes flat. By the tenth iteration, the playbook reads like a tutorial introduction; the project-specific edge cases that actually mattered have been smoothed away.
Both modes are well-named in the ACE paper, and once you have words for them you start seeing them in agents that try to teach themselves. The pattern exists because the cure isn’t “reflect more” or “summarize less.” It’s a structural change to how the working knowledge is represented and how it gets edited.
Forces
- The model is the same model on every call. Whatever learning you do has to live in what the model sees, not what it is.
- An evolving playbook needs to grow without going stale. Add new lessons, but don’t lose the old ones that still apply.
- Rewriting is cheap and tempting. Asking the model to “produce the new version of the playbook with this lesson incorporated” works once and decays under iteration.
- Structured edits are more expensive per learning step than monolithic rewrites: more roles, more inference, more bookkeeping.
- You need to know which entries are paying their way and which are dead weight, or the playbook becomes a junk drawer.
Solution
Represent the agent’s accumulated knowledge as an itemized, tagged playbook rather than a freeform document, then use three specialized roles to update it incrementally.
The playbook is a structured document organized into named sections (typical examples from the reference implementation: STRATEGIES & INSIGHTS, FORMULAS & CALCULATIONS, COMMON MISTAKES). Each entry inside a section is a discrete tagged bullet that carries provenance and usefulness counters:
[strategies-00042] helpful=7 harmful=0 :: When the schema migration touches
both `users` and `profiles`, run them in one transaction. Splitting the two
breaks the foreign-key check during the brief window between commits.
The tag is stable across edits. The helpful and harmful counters track how often the entry contributed to a successful or failed run when surfaced to the agent. The :: separator and the surface format are the reference implementation’s choice, not a standard. What matters is that entries are addressable, replaceable, and individually scored.
Updates flow through three roles:
- Generator. The agent that actually does the task. It produces reasoning paths and surfaces what worked, including which playbook entries it consulted on the way to a result.
- Reflector. A separate role that reads the trace after the fact and extracts candidate lessons. The reflection here is third-person analysis of someone else’s run, not the Generator looking at its own work, and that separation is the move that makes ACE more robust than naive Reflexion.
- Curator. The role that decides what to do with each candidate lesson. Add a new entry, refine an existing one, increment counters, retire a stale entry, but always as a small, targeted edit, never as a rewrite of the whole document.
The three roles can be three separate model calls, three different prompts to the same model, or even three personas inside a longer pipeline. What matters is that the target of the edit shifts from “the document” to “this specific entry,” and the author of the edit is no longer the agent that just used it.
Start with the data structure, not the roles. Pick a tag scheme, decide where the playbook is stored (a markdown file in the repo is fine), and define the entry format. The three-role pipeline is easy to add once the playbook itself is addressable. If you start by orchestrating roles against a freeform document, you’ll end up reinventing brevity bias.
The published evaluation is encouraging. On the AppWorld agent benchmark, the paper reports a 10.6-point improvement over the strongest baseline. On the finance benchmark, 8.6 points. Most striking: a 17.1-point gain on AppWorld when the agent learned purely from execution feedback, with no ground-truth labels available. The numbers are specific to those benchmarks and to the reference implementation; treat them as evidence the architecture moves the needle, not as a guarantee for any particular task.
How It Plays Out
A team builds a coding agent that pairs with engineers on a large internal codebase. They start with a single CLAUDE.md and ask the agent to update it after each session with anything useful it learned. Within a week the file is shorter, blander, and missing the specific things that made it useful: the import-path conventions, the legacy column names, the test-runner quirks. They restructure. The agent now writes into a playbook/ directory of tagged bullets organized into conventions, pitfalls, commands. A nightly job runs a Reflector pass over the day’s session traces and proposes additions. A Curator pass merges them, increments helpful counters when an entry contributed to a passing test, and retires entries with harmful >= 3 && helpful == 0. After a month the playbook has more than three hundred entries, but it’s getting sharper, not vaguer. New engineers report the agent feels project-aware on their first session.
A domain agent works in a regulated industry (finance, legal, medical) where the value is in capturing and compounding expert insight without losing it on the next iteration. Each case the agent handles surfaces something specific: a regulatory edge case, a common drafting mistake, a calculation formula. The freeform-rewrite approach loses these within a few cycles because the language they require is irregular and verbose. The structured playbook keeps each as its own tagged bullet under precedents or formulas, with provenance back to the case that produced it. Six months in, the playbook is the team’s living institutional knowledge. When a new model version ships, the playbook moves over unchanged; the agent gets smarter without forgetting what it already knew.
A solo developer running a long-horizon refactor loop notices the agent makes the same three categorical mistakes across different files. The naive reaction is to expand the system prompt with more rules, which makes the prompt longer and the agent slower without obviously helping. With an ACE-style playbook, those three mistakes become three tagged common-mistakes entries with concrete contrastive examples. The Generator surfaces the relevant ones into context only when the file being edited matches the trigger pattern. The agent’s per-step prompt stays small. The accumulated knowledge stays addressable.
Where ACE Doesn’t Fit
ACE assumes the agent runs enough times for the counters to mean something. On a one-off task, the bookkeeping is overhead with nothing to amortize against. The pattern also assumes you can run a Reflector pass over traces, which means traces have to be captured and stored, and “what the Reflector should look for” has to be defined well enough that it doesn’t fill the playbook with noise. Teams that adopt ACE prematurely tend to ship a beautiful empty playbook and quietly stop using it.
The three-role pipeline also costs more inference per learning step than monolithic rewrite. If your task volume is low, the per-task cost ratio of “learn” to “do” can flip the wrong way. Measure before adopting at scale.
Consequences
The benefit is durable: the agent’s accumulated knowledge stops degrading under iteration. Each new lesson lands in a specific addressable place. Old lessons can be inspected, scored, and retired. A new team member can read the playbook and understand what the agent knows, which is the kind of legibility that monolithic rewriting destroys. Cross-session learning becomes a property of the system rather than a hope.
The cost is real and worth naming. The three-role pipeline raises the floor of complexity. At minimum you’re maintaining a structured playbook, a Reflector prompt, a Curator policy, and the bookkeeping for usefulness counters. The structured format makes debugging and pruning much easier than freeform documents, but only after you’ve built the tooling to inspect the playbook and roll back bad edits. Token cost per learning step is higher than naive self-rewriting, although total token cost over the agent’s lifetime usually drops because retries on the same mistake go down.
ACE is also a cost lever, not a quality ceiling. It improves how the agent uses what its model can already do. It will not turn a model that can’t solve a task into one that can. If your agent is failing because the underlying capability isn’t there, more structured learning won’t rescue it, and the more visible the playbook gets, the more obvious that mismatch becomes.
When to reach for ACE: you have an agent that runs many times against similar tasks, you have signal on which runs succeeded, and the freeform “have the agent update its own instructions” loop has started to drift. When not to reach for it: you’re shipping a one-shot agent, or you don’t yet have a way to capture and replay traces, or the underlying task isn’t repeating often enough to make the bookkeeping pay back.
Related Patterns
| Note | ||
|---|---|---|
| Complements | Garbage Collection | Garbage Collection scans for drift in the agent's working state; ACE accumulates positive learnings into it. Both are recurring maintenance loops on the same surface. |
| Complements | Steering Loop | The steering loop closes feedforward and feedback signals; ACE specifies how the learnings from those signals get accumulated. |
| Depends on | Feedback Sensor | The Generator's traces and the Reflector's lesson extraction depend on usable execution feedback. |
| Extends | Generator-Evaluator | ACE adds a third Curator role and shifts the target from judging output to merging lessons into the working context. |
| Extends | Reflexion | ACE is the multi-role, structured-playbook generalization of Reflexion's single-agent verbal self-critique. |
| Implements | Feedback Flywheel | ACE is one concrete mechanism for running a feedback flywheel; the playbook captures and compounds signal across runs. |
| Mitigates | Context Rot | Structured incremental updates with usefulness counters fight the same erosion the Context Rot article names. |
| Specializes | Context Engineering | ACE is one specific architecture within the broader Context Engineering discipline. |
| Uses | Eval | Agent benchmarks are the validation surface for any context-improvement architecture; ACE's reported gains come from eval-driven measurement. |
| Uses | Memory | The evolving playbook is stored in memory; ACE imposes an opinionated structure on top. |
Sources
- Qizheng Zhang and colleagues at SambaNova, Stanford, and UC Berkeley introduced the pattern, the name, and the three-role architecture in Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models (arXiv:2510.04618, ICLR 2026). The paper named both failure modes (brevity bias, context collapse), gave the playbook data structure, and reported the AppWorld and finance benchmark results.
- The reference implementation
ace-agent/acemakes the architecture concrete: Generator, Reflector, and Curator scripts; the tagged-bullet playbook with helpful/harmful counters; and the AppWorld and finance benchmark harnesses. A second independent implementation,kayba-ai/agentic-context-engine, reproduces the architecture from the same paper. Two unrelated teams converging on the same shape is a useful signal that the pattern is portable rather than implementation-coupled. - The framing of context collapse as a named failure mode reached general circulation through industry-press coverage in late 2025 and early 2026; once the term existed, practitioner blogs picked it up to describe symptoms they had already been seeing in agents that rewrote their own instructions. The ACE paper is the canonical reference for both the symptom and the architectural answer.
- The pattern positions itself explicitly against Reflexion (Shinn et al., NeurIPS 2023): same goal of within-system learning from execution, but with a structured incremental playbook in place of monolithic verbal self-critique, and with the reflection role separated from the agent doing the work.
Further Reading
- The OpenReview discussion thread for the ICLR 2026 paper collects reviewer questions and author responses; a useful complement to the paper for readers who want to see the architecture stress-tested.
- The Hugging Face Papers page aggregates community discussion of the paper and links to derivative implementations as they appear.