Generator-Evaluator
Split code creation and code critique into separate agents so that neither role can blind the other.
Understand This First
- Verification Loop – the single-agent feedback cycle that Generator-Evaluator extends across two agents.
- Subagent – the generator and evaluator are specialized subagents with distinct roles.
- Feedback Sensor – the evaluator is a feedback sensor with judgment authority.
Context
At the agentic level, Generator-Evaluator is a multi-agent architecture for producing higher-quality output than any single agent achieves alone. It sits above the Verification Loop, which runs generate-test-fix inside one agent’s context. Generator-Evaluator separates those responsibilities into two agents with independent context windows: one writes, one judges.
The pattern draws on a principle that predates AI: the person who creates the work shouldn’t be the only one who reviews it. Code review, editorial review, adversarial red-teaming, peer grading in education — they all exploit the same structural insight. When the critic is separate from the creator, the critique is harder to dismiss and harder to game.
Problem
How do you get reliable quality from an agent when the agent can’t evaluate its own output honestly?
LLMs exhibit a consistent self-review bias. Ask a model to generate code, then ask it whether that code is correct, and it will tend to say yes. The same context window that produced the output also produces the review, so the model’s reasoning stays anchored to its own prior choices. It finds reasons to defend what it wrote rather than reasons to doubt it. The output looks confident. It reads well. But it hides bugs, missed requirements, and architectural drift behind fluent prose.
Forces
- Self-review bias means a single agent rates its own work too favorably.
- Context contamination makes it hard for one agent to both generate and critique, because the generation reasoning occupies the same window as the critique.
- Quality thresholds are easier to enforce when the judge can’t be swayed by the author’s intent.
- Cost and latency increase with every additional agent in the loop, so the architecture must earn its overhead.
Solution
Assign two agents distinct, non-overlapping roles. The generator writes code, builds features, or produces whatever artifact the task requires. The evaluator grades the output against explicit criteria, produces structured critique, and decides whether the work meets the bar.
The two agents operate in a loop:
- The generator produces output based on the task specification and any prior feedback.
- The evaluator inspects the output against acceptance criteria and returns a structured verdict: pass or fail, with specific reasons.
- If the evaluator fails the work, the generator receives the critique and tries again.
- The loop repeats until the evaluator passes the output or a maximum iteration count is reached.
A planner agent often sits upstream of both. The planner breaks a high-level goal into discrete tasks with explicit acceptance criteria, giving the evaluator something concrete to grade against. Without clear criteria, the evaluator defaults to vague judgments (“looks good”) that don’t drive improvement.
Three design choices matter most:
Independent context windows. The generator and evaluator each get their own context. The evaluator never sees the generator’s internal reasoning, draft attempts, or abandoned approaches. It sees only the finished artifact and the acceptance criteria. This prevents the evaluator from rationalizing the generator’s mistakes.
Structured feedback. The evaluator doesn’t just say “try again.” It returns specific, actionable critique: which tests failed, which requirements weren’t met, which edge cases were missed. The generator treats this feedback as its primary input for the next iteration, not its own self-assessment.
Concrete grading criteria. The acceptance criteria should be as specific as possible: expected behavior, required test coverage, edge cases to handle, constraints to satisfy. Vague criteria produce vague evaluations. When the evaluator can run tests, check types, or interact with a live application, the grading gets sharper.
The evaluator doesn’t have to be a more capable model. It can be the same model, or even a cheaper one, running in a fresh context with a grading rubric. What matters is the separation of roles and context, not the evaluator’s raw intelligence.
How It Plays Out
A team builds an internal tool using a three-agent harness. The planner reads the product spec and decomposes it into feature tasks, each with a checklist of acceptance criteria: required endpoints, expected UI behavior, error handling requirements. The generator picks up each task and writes the implementation. The evaluator loads the running application through a browser automation tool, navigates the pages, fills out forms, clicks buttons, and checks whether the behavior matches the spec. When the evaluator finds that a form submission silently drops validation errors, it returns a structured report: “The /register endpoint accepts empty email fields. Expected: validation error with HTTP 422.” The generator reads the critique, adds the validation, and resubmits. On the next pass, the evaluator confirms the fix and moves on.
A solo developer working on a data pipeline separates generation from evaluation without a framework. She uses one agent conversation to write transformation functions and a second conversation to review them. The review conversation gets only the function signatures, the docstrings, and a set of sample inputs with expected outputs. The review agent runs the samples, flags two functions that produce incorrect output on edge cases, and returns the failures. She pastes the feedback into the generation conversation, which fixes the issues. The separation is manual, but it catches bugs that the generation agent missed on its own.
Consequences
Benefits:
- Output quality improves because critique comes from an independent context that can’t be biased by the generation process.
- Failure modes become visible. The evaluator’s structured feedback creates an audit trail of what went wrong and when, making debugging easier for humans.
- The pattern scales naturally. You can increase iteration depth (more passes through the loop) or tighten evaluator rigor (stricter criteria, more tools) without changing the architecture.
Liabilities:
- Cost and latency roughly double at minimum, since every piece of work goes through at least two agent passes. For simple tasks where a single agent gets it right on the first try, the evaluator pass is pure overhead.
- The pattern requires well-defined acceptance criteria. If the criteria are vague, the evaluator can’t grade meaningfully and the loop degenerates into wasted iterations.
- Iteration limits need tuning. Too few passes and the generator can’t converge. Too many and you burn tokens on diminishing improvements, or the generator starts cycling between equally mediocre alternatives.
Related Patterns
- Extends: Verification Loop – the verification loop runs generate-test-fix in one agent; Generator-Evaluator distributes those roles across two agents for stronger critique.
- Uses: Subagent – the generator and evaluator are specialized subagents with distinct contexts and roles.
- Uses: Feedback Sensor – the evaluator is a feedback sensor that grades output against criteria.
- Complements: Plan Mode – a planner agent produces the specs and acceptance criteria that the evaluator grades against.
- Contrasts with: Ralph Wiggum Loop – where Generator-Evaluator uses structural separation to produce honest critique, the Ralph Wiggum Loop is what happens when self-review has no teeth.
- Related: Agent Teams – agent teams coordinate multiple agents across many tasks; Generator-Evaluator is a specific two-role architecture for quality within a single task.
- Related: Steering Loop – the steering loop decides whether to continue, adjust, or stop; Generator-Evaluator is a specific instantiation focused on output quality.
Sources
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Yoshua Bengio, Aaron Courville, and Sherjil Ozair introduced Generative Adversarial Networks in “Generative Adversarial Nets” (NeurIPS 2014). The GAN’s core insight that pairing a generator against a discriminator produces stronger output than either alone inspired the adversarial structure adapted here for code generation.
- Anthropic described a three-agent harness (planner, generator, evaluator) for long-running application development in “Harness design for long-running application development” (March 2026). The evaluator used browser automation to interact with live applications and grade output against spec-derived criteria, demonstrating the pattern at production scale.
- Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui introduced AgentCoder in “AgentCoder: Multi-Agent-Based Code Generation with Iterative Testing and Optimisation” (2023). Their framework split code generation into three specialized agents (programmer, test designer, test executor) and showed that multi-agent separation outperformed single-agent generation on competitive coding benchmarks.
- The separation of code authoring from code review is a longstanding software engineering practice. Michael Fagan’s software inspection process (1976) established that independent review by someone other than the author catches defects that self-review misses, a principle that Generator-Evaluator applies to autonomous agents.