Eval
Understand This First
- Agent – evals measure agent performance.
- Testing – many eval criteria rely on existing test infrastructure.
Context
At the agentic level, an eval (evaluation) is a repeatable suite that measures how well an agentic workflow performs. Evals apply the same principle as testing in traditional software (you need an objective, automated way to know whether things are working) but applied to the agent itself rather than to the code it produces.
As agentic workflows become more sophisticated, the question shifts from “does the code work?” to “does the agent produce good code, consistently, across a range of tasks?” Evals answer that question with data rather than impressions.
Problem
How do you measure whether your agentic workflow is actually effective, and how do you detect when it regresses?
Without measurement, assessments of agent quality rely on anecdotes: “it seemed to work well yesterday” or “it struggled with that refactoring.” Anecdotes are unreliable. They’re biased toward recent experience, dramatic failures, and tasks that happened to be easy or hard. You need a systematic way to evaluate agent performance across a representative range of tasks.
Forces
- Subjectivity: “good output” is hard to define precisely for creative tasks like code generation.
- Variability: the same prompt can produce different results on different runs due to model stochasticity.
- Scope: evaluating one task tells you little about general capability; you need a diverse suite.
- Cost: running eval suites consumes time and API credits.
- Moving targets: model updates, harness changes, and prompt modifications all affect results.
Solution
Build a suite of representative tasks that cover the range of work you expect the agent to handle. Each task in the suite has:
A defined input: the prompt, context files, and instruction files the agent receives.
A defined success criterion: how to tell whether the agent’s output is acceptable. This can be automated (tests pass, linter is clean, type checker succeeds) or semi-automated (a human rates the output on a scale, checked against a rubric).
Repeatability: the task can be run multiple times to measure consistency.
Common eval dimensions include:
- Correctness: Does the generated code pass its tests?
- Convention adherence: Does the output follow project coding standards?
- Efficiency: How many tool calls and iterations did the agent need?
- Robustness: Does the agent handle edge cases, ambiguous instructions, and incomplete context gracefully?
Run evals whenever you change something that affects agent behavior: updating the model, modifying instruction files, changing prompts, adding tools, or adjusting approval policies. Compare results against a baseline to detect regressions.
Start with a small eval suite (five to ten representative tasks) rather than trying to be thorough from the start. A small suite you actually run is far more useful than a large suite you never get around to building.
How It Plays Out
A team uses a coding agent daily. They build an eval suite of fifteen tasks: five bug fixes, five feature implementations, and five refactorings, drawn from their actual project history. Each task has a known-good solution for comparison. When a new model version is released, they run the suite and discover that correctness improved overall but convention adherence dropped. The new model ignores their instruction file’s indentation rules more often. They adjust the instruction file’s wording and re-run until the results are acceptable.
A developer notices that her agent seems to produce worse code on Mondays. She runs the eval suite and discovers the results are consistent across days. Her perception was biased by the harder tasks she tends to tackle at the start of the week. The eval replaced a subjective impression with objective data.
“Run our eval suite against the new model version. Compare correctness, convention adherence, and test pass rates against the baseline from last month. Flag any tasks where the new model scored lower.”
One of the best-known model evals in the agentic coding community is Simon Willison’s pelican riding a bicycle. The task sounds easy: generate an SVG of a pelican on a bike. But it tests spatial reasoning, compositional ability, and attention to physical detail, which makes it a surprisingly sharp discriminator between models. Robert Glaser extended it into an agentic version where models iterate on their own output. His finding: most models tweak incrementally rather than rethink their approach, which tells you something useful about how agentic loops actually behave.
Consequences
Evals replace gut feelings with data. They let you make informed decisions about model selection, prompt engineering, and workflow configuration. They catch regressions before they accumulate into visible quality drops. And they provide a shared benchmark for team discussions about agentic workflow quality.
The cost is building and maintaining the suite. Evals are software: they need to be designed, implemented, and updated as the project evolves. Tasks that were representative six months ago may not be representative today. The investment is worthwhile for teams that rely heavily on agentic workflows, but may be overkill for occasional or simple use cases.
Related Patterns
- Depends on: Agent — evals measure agent performance.
- Refines: Verification Loop — evals are verification applied to the workflow itself rather than to individual changes.
- Uses: Prompt — eval results guide prompt refinement.
- Uses: Instruction File — eval results reveal whether instruction files are effective.
- Depends on: Testing — many eval criteria rely on existing test infrastructure.
- Related: Approval Fatigue – automated evals reduce the approval burden on humans.