Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reflexion

Pattern

A reusable solution you can apply to your work.

Force the agent to articulate why its last attempt failed, store that reflection as memory, and feed it back as context for the next try.

Also known as: Self-Reflection, Verbal Reinforcement Learning, the Reflection Pattern.

Understand This First

  • Verification Loop – Reflexion sits on top of a verification loop; it needs a real failure signal to reflect on.
  • ReAct – the inner thought-action-observation loop that Reflexion wraps.
  • Memory – verbal reflections are stored as memory and retrieved on the next attempt.

Context

At the agentic level, Reflexion is the named upgrade from “try again” to “think about why that didn’t work, then try again.” You have an agent that can run a task, fail, and retry. You want the retry to be smarter than the original attempt, not just another roll of the same dice. Reflexion is the mechanism: between the failure and the next attempt, the agent writes a short natural-language post-mortem, and that post-mortem becomes part of the next attempt’s context.

The pattern sits between naive retry and full multi-agent review. No second agent, no new model, no fine-tuning. All it needs is one extra prompt between attempts: “your last attempt failed for these reasons. What went wrong?” The agent’s own answer is the learning signal.

Problem

How do you get an agent to improve across attempts, when gradient updates and model retraining are off the table?

Coding agents fail often and retry often. A test fails, the agent edits the code, runs the test again. Without any reflection step, each retry starts from the same prior state: same model, same prompt, same weights. If the first attempt was wrong because the agent misread the test’s expectations, the second attempt will likely make the same mistake for the same reason. The agent is trying, but it isn’t learning.

You need a way to turn within-session failure into within-session learning. You can’t update the model. You can update what the model sees on the next step.

Forces

  • Models are stateless. Each attempt begins from whatever context you give it; nothing carries over automatically.
  • Tests, linters, and type-checkers produce pass/fail signals, but the signal alone does not explain why something failed in terms the model can reason about on its next attempt.
  • Raw retry loops are cheap but flat: they repeat the same errors because the model has no record of what it already tried.
  • Full multi-agent review catches more errors but doubles the model cost and adds orchestration overhead.
  • Natural language is the one medium the model already produces fluently. It is also the medium that fits into the model’s own context window without translation.

Solution

Wrap the agent’s task loop with an explicit reflection step. On every failure:

  1. Attempt. The agent tries the task: writes the code, calls the tool, produces the output.
  2. Evaluate. A machine-checkable oracle (tests, a linter, a type-checker, a build step) decides whether the attempt succeeded. This is the feedback signal.
  3. Reflect. If the attempt failed, the agent is prompted to write a short natural-language explanation of what went wrong. Not a summary of the error message: an analysis. “The test expected None for the empty case; I returned -1 because I assumed the sentinel was a sentinel value. I should return None.”
  4. Store. The reflection is appended to a memory buffer that persists across attempts within the task.
  5. Retry. The next attempt sees the original prompt plus the stored reflections. The agent is now trying the task with an explicit record of what it already got wrong.

Shinn and colleagues at Northeastern and MIT introduced this pattern in 2023 under the name Reflexion, and framed it as verbal reinforcement learning. The key claim: the model’s own reflection, expressed in natural language and added to context, is the learning signal. No gradient updates, no fine-tuning. The reflection buffer is the only thing that changes between attempts, and it’s enough to move the needle.

The original paper reported GPT-4’s pass rate on the HumanEval coding benchmark climbing from 80% to 91% when Reflexion was added on top of a baseline agent. The gains generalize: whenever a task has a machine-checkable oracle and room for more than one attempt, Reflexion almost always beats naive retry.

Tip

The reflection prompt matters. “Why did that fail?” is the minimum. Better: “Describe the failure concretely, name the specific assumption or decision that caused it, and state what you will do differently.” Vague reflection produces vague retries. Specific reflection produces specific corrections.

How It Plays Out

An agent is fixing a bug in a date-parsing function. The first attempt strips whitespace and runs the parser, but the test suite rejects the output because the test expected timezone information to be preserved and the agent dropped it. Without Reflexion, the agent would retry: maybe strip differently, maybe add a try-except. With Reflexion, the agent writes: “The test expects 2024-01-01T00:00:00+05:00 as the output; I returned 2024-01-01 00:00:00. I dropped the timezone by calling .replace(tzinfo=None) in the middle of parsing. I should preserve the timezone through the full pipeline.” The second attempt handles timezones correctly on the first try.

A team runs a nightly migration loop that moves deprecated API calls to their replacements. Each iteration picks one call site, rewrites it, runs the affected tests, and commits if green. Early in the migration, about a third of attempts fail on the first pass. The team adds a reflection step: on failure, the agent writes a two-sentence note about what went wrong before retrying. After a week of operation, the reflections start to cluster. The same three edge cases (retries, timeouts, custom serializers) account for most of the failures. The team uses the clustered reflections to rewrite the migration prompt itself, which cuts the failure rate in half. The reflections turned into compiled knowledge. This is the bridge from Reflexion (within-task) to Feedback Flywheel (across-session).

An engineer is debugging an intermittent integration test. The agent tries a fix, the test passes locally, CI fails. The engineer adds a Reflexion step keyed specifically to “works locally, fails in CI.” The reflection prompt asks the agent to list every assumption about the local environment that might not hold in CI. The agent produces a list: filesystem case sensitivity, timezone, Python minor version, presence of a .env file. The next attempt accounts for each. The fix lands on the second try instead of the seventh.

Where Reflexion Breaks

Reflexion is powerful but not foolproof. The recurring failure modes:

  • Confabulated reflection. The agent fails, the reflection prompt fires, and the agent produces a plausible-sounding explanation that has nothing to do with the actual cause. The test failure was a stale cache; the agent’s reflection blames its own algorithm choice. The next attempt fixes the wrong thing. Guard: the reflection should quote or reference the actual failure output, not reason purely from the task description.
  • Reinforced wrong hypothesis. An early reflection fixates on a bad theory and subsequent reflections refine the bad theory instead of abandoning it. The agent gets stuck chasing the same ghost across five attempts. Guard: cap the reflection memory at a small number of entries and prune aggressively when a new failure contradicts an older reflection.
  • Infinite loop without a real oracle. If the evaluation step is itself an LLM judge with no ground truth, the agent and the judge can collude: the agent gets better at satisfying the judge without getting better at the task. Guard: Reflexion works best when the oracle is machine-checkable (tests, lints, types). For subjective tasks, reach for Generator-Evaluator instead; the separate evaluator agent breaks the collusion.
  • Cost blow-up. Every failed attempt spends tokens on the reflection step in addition to the retry itself. On tasks with high failure rates, the reflection overhead dominates. Cap the total attempts, and switch to Ralph Wiggum Loop or human escalation when the cap is hit.

Consequences

Reflexion converts the agent’s failure log into part of its working context. That’s the whole mechanism, and its benefits follow directly from it. The agent stops repeating the same error in the same way. Cost per task rises somewhat, because every failure adds a reflection round, but total cost usually drops: fewer total attempts are needed to reach success.

The pattern also reshapes what “memory” means in an agentic system. Memory stops being “the transcript” or “a scratchpad” and becomes “the record of what I tried and why it did not work.” That is a more useful kind of memory. It also composes naturally with other patterns: reflections generated within a task can be surfaced across tasks via Feedback Flywheel, and individual reflections can be promoted into permanent instruction file guidance when they capture a recurring lesson.

The liabilities are real but bounded. Reflexion is a within-session pattern. The reflections live in the context window, and they disappear when the session ends unless you explicitly persist them. Their quality is bounded by the quality of the underlying model and the feedback signal. And the pattern does not solve the underlying problem that the model is the same model: if the task is beyond the model’s capability, more reflection won’t fix it. It will only produce more articulate confusion.

When to reach for Reflexion: you have a retry loop, you have a real pass/fail oracle, and the retries aren’t converging. When not to reach for it: you have no oracle (use Generator-Evaluator with an independent judge), the task needs multi-agent independence (also Generator-Evaluator), or the agent is succeeding on the first try anyway (the reflection step just adds cost).

  • Depends on: Verification Loop – Reflexion needs a real failure signal; without verification it’s just the agent talking to itself.
  • Depends on: Memory – reflections are stored in memory, retrieved on the next attempt.
  • Used by: Feedback Flywheel – within-session reflections can be lifted into cross-session learning.
  • Contrasts with: Generator-Evaluator – Reflexion is single-agent self-critique; Generator-Evaluator is two-agent independent review. Choose Reflexion when you have a machine-checkable oracle; choose Generator-Evaluator when you do not.
  • Corrects: Ralph Wiggum Loop – naive one-track retry becomes structured self-correction once a reflection step is added between attempts.
  • Related: Feedback Sensor – the oracle that triggers the reflection step is a feedback sensor.
  • Related: ReAct – Reflexion wraps a ReAct loop; the reflection becomes part of the next loop’s initial context.
  • Related: Instruction File – reflections that capture a recurring lesson can be promoted into durable guidance.

Sources

  • Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao introduced the pattern and its name in Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv:2303.11366, NeurIPS 2023). The paper gave the three-role architecture (Actor, Evaluator, Self-Reflector), the HumanEval benchmark result, and the framing of verbal reflection as a learning signal.
  • Noah Shinn and Ashwin Gopinath’s follow-up essay Reflecting on Reflexion laid out the practitioner-facing summary of what the pattern does and does not do, and clarified the distinction between the three-role reference architecture and the simpler two-role collapse most implementations adopt.
  • The DAIR.AI Prompt Engineering Guide’s Reflexion entry became the standard reference for practitioners adopting the pattern, connecting it to the broader family of self-correction techniques that followed.
  • Andrew Ng’s Agentic Design Patterns series named Reflection as one of four core patterns of agentic design (alongside Tool Use, Planning, and Multi-Agent Collaboration), which cemented the pattern in practitioner pedagogy.
  • The 2024-2026 descendant line (LATS tree search, Self-Refine, process reward models, and many production agent frameworks) all trace back to the Shinn et al. formulation and treat it as the canonical ancestor for within-task self-correction.