---
slug: evaluator-driven-code-search
type: pattern
summary: "Use an automated evaluator as selection pressure so agents search a program space instead of betting on one generated answer."
created: 2026-06-14
updated: 2026-06-14
related:
  generator-evaluator:
    relation: extends
    note: "Evaluator-Driven Code Search extends the generate-evaluate loop from one artifact into a population search over candidate programs."
  verification-loop:
    relation: extends
    note: "Each candidate program goes through verification, and the aggregate scores decide which candidates survive."
  test-oracle:
    relation: depends-on
    note: "The evaluator is a test oracle with teeth; if it is weak or gameable, the search optimizes the wrong thing."
  eval:
    relation: uses
    note: "The evaluator is an eval embedded inside the optimization loop rather than a one-time benchmark after the fact."
  benchmark-mirage:
    relation: risk-of
    note: "An evaluator that rewards superficial benchmark wins turns the search into Benchmark Mirage at machine speed."
  architecture-fitness-function:
    relation: uses
    note: "Fitness functions can become evaluator checks when the search is optimizing structural qualities rather than behavior alone."
  determinism:
    relation: depends-on
    note: "Repeatable scoring makes candidate comparisons meaningful; nondeterministic evaluators blur the selection signal."
---
# Evaluator-Driven Code Search

> **Pattern**
>
> A named solution to a recurring problem.

*Turn a coding agent into a search system by asking it for candidate programs, scoring those programs with an automated evaluator, and retaining the variants that improve the objective.*

*Also known as: Evolutionary Coding Agent, LLM-Guided Program Search.*

If "code search" sounds like finding a file in a repository, this is a different kind of search. The search space is the set of possible programs. The agent proposes variants, the evaluator scores them, and the system keeps enough history to make the next proposal less random. Reach for this pattern when "try one answer" leaves too much value on the table and "run the score again" is cheap enough to automate.

## Understand This First

- [Generator-Evaluator](generator-evaluator.md) — the generate-and-judge loop that this pattern scales from one artifact to a population.
- [Verification Loop](verification-loop.md) — the per-candidate check that supplies the score.
- [Test Oracle](test-oracle.md) — the source of truth that decides whether the score means anything.

## Context

At the **agentic** level, Evaluator-Driven Code Search applies when you can express progress as a score. You are asking an agent to explore a large space of possible programs, run each candidate, measure the result, and keep the variants that move the score in the right direction.

Google DeepMind's AlphaEvolve made the pattern visible in 2025 and 2026. The system used large language models to propose program changes, automated evaluators to run and score them, and an evolutionary loop to preserve promising candidates across a program history. The reported applications were the kind of problems where a crisp evaluator exists: scheduling, matrix multiplication, chip design, compiler and storage heuristics, and other optimization tasks where better code can be measured automatically.

Open-source follow-on work such as CodeEvolve shows the same shape outside DeepMind's closed system: LLMs generate code, a genetic algorithm maintains populations, and automated scoring decides which variants survive.

The important part isn't AlphaEvolve as a product. The pattern is the shape: model proposes code, evaluator scores code, and search pressure accumulates across many attempts. That makes it different from ordinary eval-driven development, where the evaluator tells you whether one system is good enough. Here the evaluator is part of the search machinery.

## Problem

How do you use an agent when the best answer is unlikely to appear in a single generation, but you can cheaply tell whether one candidate is better than another?

One-shot code generation is a poor fit for optimization problems. A model may produce a reasonable heuristic for a scheduler, a kernel, or a ranking function, but "reasonable" is not the target. The target is a measurable improvement: lower latency, fewer scalar multiplications, better packing, less write amplification, higher pass rate under fixed constraints. You need the agent to search, not guess.

Manual iteration leaves the useful signal on the floor. A human can ask for a candidate, run it, read the score, and ask for another. That works for three attempts. It doesn't work for thousands. The evaluator knows which candidates improved, but unless that score feeds back into a maintained population, the search forgets what it has learned.

## Forces

- **Objective clarity** is the hard gate. If you can't score candidates cheaply and honestly, you don't have selection pressure.
- **Search spaces are huge.** The useful variant may be several mutations away from the model's first answer.
- **Evaluator cost compounds.** A search loop may run hundreds or thousands of candidates, so each score must be cheap enough to repeat.
- **Score gaming is easy.** A model will optimize what the evaluator rewards, even when that stops matching the real goal.
- **Human-readable code still matters.** The winning candidate must be inspectable, deployable, and maintainable after the search ends.

## Solution

Build the coding workflow as a search loop. The generator proposes code variants. The evaluator runs each variant and returns a score. A controller stores the candidates, keeps the best and most diverse ones, and uses that history to prompt the next generation.

The loop has five parts:

1. **Seed.** Start from a working baseline, reference implementation, or minimal skeleton.
2. **Generate.** Ask the model for candidate program changes, not for prose answers.
3. **Evaluate.** Run each candidate against automated checks: correctness, performance, resource use, or another objective score.
4. **Select.** Keep candidates that improve the score or preserve useful diversity.
5. **Mutate.** Prompt the model with the current history and ask for variants that exploit what worked or explore nearby alternatives.

This differs from [Generator-Evaluator](generator-evaluator.md). Generator-Evaluator asks one agent to create an artifact and another to judge whether that artifact is good enough. Evaluator-Driven Code Search turns evaluation into selection pressure across a population. The evaluator isn't a reviewer; it's the fitness function.

Evaluator design is the whole pattern. A good evaluator is fast, deterministic enough for comparisons, hard to game, and aligned with the real objective. For an algorithmic task, that may mean checking correctness against reference outputs and then measuring speed. For an architecture task, it may mean running an [Architecture Fitness Function](architecture-fitness-function.md) that scores whether a structural property improved without breaking behavior.

> **⚠️ Warning**
>
> Don't use this pattern when the evaluator is vague. If the score is "looks clean" or "the model judge likes it," the search will learn to satisfy that judge. You get optimized theater instead of better code.

## How It Plays Out

A performance team has a hot path in a tensor library. The existing kernel is correct, but expensive. They give the agent a benchmark harness, correctness tests, and a seed implementation. The agent proposes variants, the harness rejects incorrect ones, and the scorer ranks the survivors by runtime. After hundreds of trials, the best candidate is not a rewrite a human would have tried first. It is a small change in tiling and memory access that passes the tests and cuts runtime by 9%. The team reviews the code, adds comments explaining the shape, and ships it behind the normal benchmark gate.

A platform group wants to tune a cache replacement policy. The objective is explicit: reduce misses on a representative trace without increasing memory use. A one-shot agent suggests a textbook policy. The evaluator-driven loop does better: it mutates small policy fragments, runs them against saved traces, keeps variants that improve the miss rate, and feeds those variants back into the next prompts. The final result is not "the model's answer." It is the best candidate found by a search process the model helped drive.

A team tries the same approach on code style and fails. They ask an evaluator to score "maintainability" with an LLM judge, then let the search optimize against it. Within a few rounds, the candidates become verbose, over-commented, and oddly formatted because the judge rewarded visible effort. Nothing gets easier to maintain. The team stops the loop and replaces the evaluator with concrete checks: maximum function length, dependency boundaries, test coverage, and a human review for the parts that still require judgment.

## Consequences

**Benefits.** Evaluator-Driven Code Search turns agentic coding from single-shot generation into controlled exploration. It is especially strong when the code's quality can be measured: algorithms, heuristics, schedulers, kernels, search procedures, and other domains where correctness and performance can be checked automatically. It also produces an audit trail. You can inspect the candidate history, see which mutations improved the score, and understand why the final program survived.

**Liabilities.** The pattern inherits every weakness of its evaluator. A thin oracle rewards hacks. A slow oracle makes the search too expensive. A nondeterministic oracle adds noise, so candidates appear better or worse by chance. The search can also overfit to the benchmark trace, producing code that wins the harness and fails the real workload.

The operational burden is higher than ordinary agentic coding. You need a harness, a score, storage for candidates, a controller that chooses what to preserve, and a stopping rule. For many product features, that machinery is waste. Reach for this pattern when the problem has a clear objective function and enough value at stake to justify search.

## Sources

- Google DeepMind introduced AlphaEvolve in *[AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms](https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/)* (May 2025), describing a coding agent that combines large language model proposals with automated evaluators and an evolutionary framework.
- Alexander Novikov and colleagues documented the technical design and reported results in *[AlphaEvolve: A coding agent for scientific and algorithmic discovery](https://arxiv.org/abs/2506.13131)* (arXiv:2506.13131, 2025), including applications to data-center scheduling, hardware design, AI-training kernels, matrix multiplication, and mathematical search problems.
- Google DeepMind's follow-up *[AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields](https://deepmind.google/blog/alphaevolve-impact/)* (May 2026) showed the same pattern moving from initial algorithm discovery into broader infrastructure, research, and commercial optimization tasks.
- Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai introduced CodeEvolve in *[CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization](https://arxiv.org/abs/2510.14150)* (arXiv:2510.14150, 2025), showing an open-source implementation with island populations, LLM-generated mutations, crossover, and evaluator feedback.

---

- [Next: Model Routing](model-routing.md)
- [Previous: Generator-Evaluator](generator-evaluator.md)