Eval

Pattern

A named solution to a recurring problem.

Understand This First

Agent – evals measure agent performance.
Testing – many eval criteria rely on existing test infrastructure.

At the agentic level, an eval (evaluation) is a repeatable suite that measures how well an agentic workflow performs. Evals apply the same principle as testing in traditional software (you need an objective, automated way to know whether things are working) but applied to the agent itself rather than to the code it produces.

As agentic workflows become more sophisticated, the question shifts from “does the code work?” to “does the agent produce good code, consistently, across a range of tasks?” Evals answer that question with data rather than impressions.

Problem

How do you measure whether your agentic workflow is actually effective, and how do you detect when it regresses?

Without measurement, assessments of agent quality rely on anecdotes: “it seemed to work well yesterday” or “it struggled with that refactoring.” Anecdotes are unreliable. They’re biased toward recent experience, dramatic failures, and tasks that happened to be easy or hard. You need a systematic way to evaluate agent performance across a representative range of tasks.

Forces

Subjectivity: “good output” is hard to define precisely for creative tasks like code generation.
Variability: the same prompt can produce different results on different runs due to model stochasticity.
Scope: evaluating one task tells you little about general capability; you need a diverse suite.
Cost: running eval suites consumes time and API credits.
Moving targets: model updates, harness changes, and prompt modifications all affect results.

Solution

Build a suite of representative tasks that cover the range of work you expect the agent to handle. Each task in the suite has:

A defined input: the prompt, context files, and instruction files the agent receives.

A defined success criterion: how to tell whether the agent’s output is acceptable. This can be automated (tests pass, linter is clean, type checker succeeds) or semi-automated (a human rates the output on a scale, checked against a rubric).

Repeatability: the task can be run multiple times to measure consistency.

Common eval dimensions include:

Correctness: Does the generated code pass its tests?
Convention adherence: Does the output follow project coding standards?
Efficiency: How many tool calls and iterations did the agent need?
Robustness: Does the agent handle edge cases, ambiguous instructions, and incomplete context gracefully?

Run evals whenever you change something that affects agent behavior: updating the model, modifying instruction files, changing prompts, adding tools, or adjusting approval policies. Compare results against a baseline to detect regressions.

Tip

Start with a small eval suite (five to ten representative tasks) rather than trying to be thorough from the start. A small suite you actually run is far more useful than a large suite you never get around to building.

How It Plays Out

A team uses a coding agent daily. They build an eval suite of fifteen tasks: five bug fixes, five feature implementations, and five refactorings, drawn from their actual project history. Each task has a known-good solution for comparison. When a new model version is released, they run the suite and discover that correctness improved overall but convention adherence dropped. The new model ignores their instruction file’s indentation rules more often. They adjust the instruction file’s wording and re-run until the results are acceptable.

A developer notices that her agent seems to produce worse code on Mondays. She runs the eval suite and discovers the results are consistent across days. Her perception was biased by the harder tasks she tends to tackle at the start of the week. The eval replaced a subjective impression with objective data.

Example Prompt

“Run our eval suite against the new model version. Compare correctness, convention adherence, and test pass rates against the baseline from last month. Flag any tasks where the new model scored lower.”

The Pelican Benchmark

One of the best-known model evals in the agentic coding community is Simon Willison’s pelican riding a bicycle. The task sounds easy: generate an SVG of a pelican on a bike. But it tests spatial reasoning, compositional ability, and attention to physical detail, which makes it a surprisingly sharp discriminator between models. Robert Glaser extended it into an agentic version where models iterate on their own output. His finding: most models tweak incrementally rather than rethink their approach, which tells you something useful about how agentic loops actually behave.

Consequences

Evals replace gut feelings with data. They let you make informed decisions about model selection, prompt engineering, and workflow configuration. They catch regressions before they accumulate into visible quality drops. And they provide a shared benchmark for team discussions about agentic workflow quality.

The cost is building and maintaining the suite. Evals are software: they need to be designed, implemented, and updated as the project evolves. Tasks that were representative six months ago may not be representative today. The investment is worthwhile for teams that rely heavily on agentic workflows, but may be overkill for occasional or simple use cases.

Sources

OpenAI popularized the term “evals” in the LLM community by open-sourcing their Evals framework in March 2023, providing both a standard library for evaluating language models and a public registry of benchmarks that others could extend.
Mark Chen et al. introduced HumanEval in Evaluating Large Language Models Trained on Code (2021), the first major benchmark for measuring code generation correctness. HumanEval’s pass@k metric became the standard way to report how often a model produces working code.
Carlos Jimenez, John Yang, and colleagues at Princeton created SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (2023; ICLR 2024), which moved coding evals from isolated function synthesis to real-world GitHub issue resolution. The benchmark now ships in multiple variants: SWE-bench Verified, a 500-instance human-curated subset developed with OpenAI that became the de-facto scoreboard cited in major model announcements, and SWE-bench Pro, a harder variant where even frontier models score in the low 20s — a sharper discriminator as agentic coding scores on Verified have saturated above 90%.
Simon Willison’s pelican-on-a-bicycle eval and Robert Glaser’s agentic extension of it (both referenced in the article) demonstrated that effective evals don’t need to be large or formal — a single well-chosen task can reveal meaningful differences between models and workflows.

Keyboard shortcuts