Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLM-as-Judge

Pattern

A reusable solution you can apply to your work.

Use one model to score another’s output against a written rubric, so you can evaluate non-deterministic agent work at machine cost without giving up most of the signal a human reviewer would provide.

Also known as: LLM-as-a-Judge, Model-Graded Eval. The agentic generalization is called Agent-as-a-Judge.

Understand This First

  • Test Oracle — LLM-as-Judge is a probabilistic oracle, distinct from the deterministic oracles that handle exact-match cases.
  • Test — the unit being judged; an LLM-as-Judge run is one kind of test.
  • Feedback Sensor — a judge is one kind of inferential sensor inside a larger feedback loop.

Context

You are building or running an agent that produces non-deterministic, open-ended output. A summary. A code review comment. A customer-support reply. A generated test plan. Exact-match assertions cannot tell you whether the output is good, because there is no single “right answer” to compare against. This is a tactical evaluation pattern. It sits beside Test Oracle as one of the answers to the question “how do we know the output is correct?” when the answer is not a string equality check.

Human review is the gold standard for output like this, and it doesn’t scale. A senior engineer can read fifty agent code reviews in a careful afternoon. The agent generates fifty in an hour. So the team has to choose between a complete signal that arrives too late and no signal at all.

LLM-as-Judge sits in that gap. It uses a separate model call (typically a strong instruction-following model, often from a different vendor than the one being evaluated) to grade the output against a written rubric. The judgment is probabilistic and imperfect, but in published research it agrees with human reviewers about 80% of the time at roughly 1% of the cost. That ratio is what makes continuous quality monitoring of agent output economically possible.

Problem

Deterministic test oracles cover a vanishing fraction of real agent output. You can assert that a JSON response parses, that a number falls in a range, that a returned URL is reachable. You cannot assert that a generated summary is faithful, that a code review comment is useful, or that a chatbot reply is tactful.

So how do you measure the quality of agent output that has many valid forms, when human review burns hours per evaluation and you have hundreds of new outputs per day?

Forces

  • Deterministic checks are cheap and trustworthy but only cover a narrow band of correctness.
  • Human review is the trustworthy gold standard, but it does not scale past a few hundred examples per release.
  • A model judging another model is fast and cheap, but it introduces its own systematic biases. The judge is not a neutral instrument.
  • Continuous quality monitoring on production traffic requires an evaluator that runs nightly without a human in the loop.
  • Rubric design is real engineering work; a vague rubric produces vague scores that drive nothing.

Solution

Use a separate LLM call to score the output against an explicit, written rubric. The judge gets the input, the output, and the rubric. It returns a score (or a winner, in pairwise mode) and a short reasoning trace. Three canonical modes cover almost every real use:

Single-output rubric scoring. The judge sees one output and assigns a score on each rubric dimension, typically pass/fail or a small integer scale (1–5). This is the workhorse mode for regression dashboards and nightly batch evaluation.

Pairwise comparison. The judge sees two outputs for the same input and picks the winner. Always run both orderings and aggregate; never trust a one-way result. Pairwise is the right mode for prompt A/B tests and for choosing among small candidate sets in a Generator-Evaluator loop.

Group ranking. The judge orders three or more candidates from best to worst. Useful when you need to pick the top result from a beam search or a fan-out, and the relative order matters more than absolute scores.

The judge prompt itself has a load-bearing structure. Give it a role (“you are an expert reviewer of customer-support replies”). State the rubric in plain language, with one criterion per line. Ask for the reasoning before the final score, so the model commits to its analysis before committing to a number. Specify a strict output format the calling code can parse, usually a small JSON object with the score, the reasoning, and any flags. Keep the rubric short. A judge prompt that runs to two pages is one the judge will not actually follow.

Two design choices then determine whether the judge produces signal or noise.

Pick a different model family from the one you are evaluating. Self-preference bias is real and measurable: judges over-rate output from their own family. If the agent runs on Claude, judge with GPT or Gemini. If it runs on GPT, judge with Claude. When that is not possible, rotate judges across runs and average.

Calibrate against a small human-labeled gold set. Before you trust a judge’s nightly numbers, label fifty to a hundred examples by hand and confirm the judge agrees with you most of the time. The gold set also catches rubric drift later: when the rubric the judge uses today no longer matches the rubric the team agreed on six months ago, agreement on the gold set drops first, before any production metric moves.

How It Plays Out

A team running a production summarization agent wires LLM-as-Judge into their nightly pipeline. They sample 1% of the prior day’s outputs, send each through a judge prompt that scores faithfulness, conciseness, and tone-match on a 1–5 scale, and write the scores to a dashboard with a 7-day moving average. When faithfulness drops below 4.0 for two consecutive days, the on-call engineer is paged. Two weeks after a routine model upgrade, the dashboard catches a silent regression: the new model is faster and cheaper but hallucinates more. Without the nightly judge, the team would have learned about it from customer support tickets a month later.

A solo developer working on a code-review agent wants to A/B test two prompt variants. She has 200 historical pull requests, each with a known good review verified by a senior engineer. She runs both variants on every PR, then runs a pairwise judge (“which of these two reviews better matches the gold review?”) in both orderings. After 400 judgments, variant B wins 137–63 with both orderings agreeing on 89% of pairs. The 89% agreement number is the signal she actually trusts; if the orderings had disagreed half the time, she would know position bias was driving the result and the test would be inconclusive.

A team at a third company adopts pairwise judging without running both orderings. Six weeks later a confused engineer working on something else discovers the team has been “shipping” whichever prompt variant happened to be listed first in the harness. The 60–40 result that justified each rollout was almost entirely position bias. The fix is one line of code (run both orderings, average), but the lesson sticks for the next hire: a judge is a real measurement instrument with real instrumentation problems.

Tip

Start every new judge with a binary pass/fail rubric and graduate to a small integer scale only when you need it. Continuous floats sound more precise but produce noisier scores than judges actually deserve, and they invite false confidence in tiny score differences.

Where It Breaks

Four well-documented biases will trip any team using LLM-as-Judge. They aren’t exotic edge cases. They’re the default behavior of every model that has been studied. Plan for them from the start.

Position bias. In pairwise comparison, judges systematically prefer one position, usually the first candidate and sometimes the last. The effect is large enough to flip results entirely. The mitigation is mechanical: always run both orderings, aggregate the scores, and treat disagreement between orderings as a signal that the comparison is too close to call.

Verbosity bias. Judges over-rate longer outputs even when the extra length is padding or nonsense. A confident, wordy wrong answer often beats a terse correct one. Mitigations: include “conciseness counts” explicitly in the rubric; track length as a separate metric so verbosity changes are visible; for hard cases, add an independent length-penalty term to the aggregate score.

Self-preference bias. Judges over-rate outputs from their own model family. The strongest evidence is in pairwise studies, but the effect shows up in single-output scoring too. The mitigation is to judge with a different family from the one being evaluated; when that is not possible, rotate judges and watch for any one judge consistently scoring its family higher.

Authority bias. Judges over-weight confident-sounding language even when the underlying content is wrong. A reply that hedges appropriately (“I’m not sure, but I think…”) often loses to a reply that asserts a wrong answer with conviction. Mitigations: write rubric language that explicitly de-couples confidence from correctness; require the judge to cite specific evidence in its reasoning before producing the score.

A fifth, broader failure mode doesn’t have a tidy name. The judge will confabulate a coherent-sounding score on output it doesn’t actually understand. The deeper the domain, the more the judge needs the same context the generator had: the source statute, the customer’s prior history, the relevant section of the spec. A judge scoring a legal summary without seeing the underlying statute is a confident liar; a judge scoring a code review comment without seeing the code is the same.

The deepest failure mode is Goodhart’s Law. Once a judge becomes the metric the team ships against, the agent gets optimized to please the judge, which means the agent’s specialty becomes the judge’s blind spots. The mitigation is to keep recalibrating against human-labeled examples and to rotate judges periodically, so the agent never gets too comfortable pleasing one particular grader.

Consequences

Benefits. Continuous quality monitoring on non-deterministic output becomes economically possible at scale. Regressions get caught nightly instead of in customer support tickets two weeks later. Prompt A/B tests can run on hundreds of examples in minutes, with statistically meaningful results from a single afternoon of work. The judge prompt becomes a living artifact of what the team thinks “good” actually means, often the most useful side effect because it forces tacit quality standards to become explicit.

Liabilities. The judge is a real cost line on every evaluation: cents to dollars per call, multiplied by every output you grade. Rubric design takes real engineering and iteration; the first rubric is rarely the right one. The four biases will trip the team at least once, usually painfully, before the de-biasing playbook becomes muscle memory. And the judge has to be calibrated against human-labeled examples, which still requires human work upfront, just less of it than reviewing every output by hand.

Failure modes worth naming. Judging without the source context the generator had (confabulation). Using the same model family as judge and judged (self-preference collapses signal). Rubric drift when someone tweaks the rubric without updating the gold set. Goodhart’s Law: the agent gets optimized to the judge’s blind spots and the underlying user is no longer being served, even though the dashboard looks great.

Sources

  • Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica formalized LLM-as-a-Judge in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023). Their study established the ~80% agreement-with-humans figure and named the position-bias and verbosity-bias problems that every later treatment builds on.
  • Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber generalized the technique to multi-step agents in Agent-as-a-Judge: Evaluate Agents with Agents (2024), where the judge has tools, memory, and planning rather than a single completion.
  • The Hugging Face cookbook entry Using LLM-as-a-judge for an automated and versatile evaluation turned the academic technique into a practitioner walkthrough, including the rubric-design checklist that most teams now follow.
  • Michael Fagan’s software inspection work (1976) established the older principle the entire pattern depends on: independent review by someone other than the author catches defects that self-review misses. LLM-as-Judge is what happens when you apply that principle to non-deterministic output at machine speed.

Further Reading