--- slug: evaluation-gate type: pattern summary: "Run the agent's eval suite in CI and block merge or deploy when quality, latency, cost, or safety regress past agreed thresholds." created: 2026-06-20 updated: 2026-06-20 related: eval: relation: depends-on note: "An evaluation gate is the CI enforcement point for an eval suite." shift-left-feedback: relation: specializes note: "The gate moves agent-quality feedback to the pull request, where the change is still cheap to fix." pipeline-as-code: relation: uses note: "The gate normally lives as a job in the versioned CI/CD pipeline." continuous-integration: relation: uses note: "CI supplies the per-change trigger that runs the eval suite." regression: relation: detects note: "The gate catches behavior that used to score acceptably and no longer does." llm-as-judge: relation: uses note: "Many gates use a judge model as one scorer for open-ended output." test-oracle: relation: uses note: "The gate needs one or more oracles to decide whether each eval case passed." runtime-governance: relation: complements note: "Runtime governance blocks risky actions while the agent runs; evaluation gates block risky releases before they merge or deploy." agentops: relation: complements note: "AgentOps turns production traces into future eval cases; the gate keeps those cases consequential." verification-loop: relation: contrasts-with note: "A verification loop is the agent correcting itself inside a task; an evaluation gate is an external release barrier." --- # Evaluation Gate > **Pattern** > > A named solution to a recurring problem. *Run the agent's eval suite in CI and block merge or deploy when quality, latency, cost, or safety regress past agreed thresholds.* *Also known as: CI Gate, Eval Gate, Quality Gate, Release Gate* An eval dashboard that nobody has to obey is a report. An evaluation gate is where the report becomes a control. The pull request changes a prompt, model, retrieval rule, tool, or instruction file. CI runs the agent against a fixed eval set. If the score falls below the bar, the merge stops. That is the move that matters. The gate doesn't make the eval suite smarter. It makes the result binding. ## Understand This First - [Eval](eval.md) — the repeatable suite the gate runs. - [Continuous Integration](continuous-integration.md) — the trigger that runs checks on each change. - [Pipeline as Code](pipeline-as-code.md) — the versioned file where the gate usually lives. - [Regression](regression.md) — the class of failure the gate is built to catch. ## Context This is an **operational governance** pattern. It sits at the boundary between agent evaluation and delivery automation: after the team has an [Eval](eval.md) suite, before the change reaches the main branch or production. Traditional CI gates already block broken code. A test fails, the build turns red, the merge waits. Agentic systems need the same enforcement for behavior that ordinary tests can't see: an assistant gives less faithful answers, a coding agent ignores the style guide more often, a retrieval change improves one query class and damages another, or a judge score drops even though the code compiles. The gate belongs in the same place as the change. If a pull request edits a prompt, the eval result should appear on that pull request. If a deploy swaps models, the gate should run before that deploy is promoted. A nightly report is useful for trend watching, but it doesn't stop a bad change while the author still has the context to fix it. ## Problem How do you stop a prompt, model, retrieval, or agent-configuration change from shipping when it quietly makes the system worse? Without a gate, evals become advisory. Someone may look at the dashboard after the merge. Someone may notice the score dipped. Someone may remember which change caused it. That is not a release control. It is archaeology. Agent regressions are especially good at slipping through code-shaped gates. The tests pass because the functions still run. The schema validates because the JSON still parses. The model still answers. What changed is quality: faithfulness, helpfulness, safety, latency, cost, tool-use accuracy, or task completion. The delivery path needs a check that can fail on those dimensions before the change lands. ## Forces - **The signal is soft.** Many agent outcomes require scores, rubrics, judges, or human-calibrated examples rather than exact assertions. - **The gate must be fast enough for CI.** A full production-scale eval run may be too slow or expensive for every pull request. - **Thresholds are policy, not math.** A refund agent deserves a stricter bar than a draft summarizer. - **Flakes can paralyze delivery.** Non-deterministic scorers and model variance can turn a real gate into a random merge blocker. - **Aggregate scores hide local damage.** A healthy average can mask one business-critical flow collapsing. ## Solution **Put a small, representative eval run on the delivery path and make it merge-blocking.** Treat prompts, models, retrieval settings, tool definitions, and agent instructions as release inputs. When any of them changes, CI runs the eval suite against a fixed golden set: curated examples whose expected behavior is known. The gate compares the result to a baseline. Start with a narrow gate. Pick cases that represent real failures, high-value flows, and safety boundaries. A useful CI set is usually smaller than the full offline suite: tens or low hundreds of examples, not every trace the team has ever collected. The full suite can still run nightly or before major releases. The gate's job is to catch regressions too costly to let through a pull request. Score more than one dimension. A gate that checks only "average quality >= 0.85" will miss the change that keeps quality high by doubling latency, tripling cost, or breaking one critical flow. Use per-dimension thresholds: task success, rubric score, groundedness, safety, P95 latency, cost per run, tool-use correctness, or whatever the agent owes the product. When one dimension is a hard safety requirement, make it a hard fail rather than a weighted average. Make the result visible where the author works. Post the per-scorer diff on the pull request: which cases improved, which regressed, which thresholds failed, and how the run compares with the current baseline. A red check without a useful diff teaches the team to rerun until it passes. A red check with named failing cases teaches the team what to fix. Treat threshold changes like code. The gate is a policy artifact, not a magic number in CI YAML. Version the threshold file, review changes to it, and require a rationale when someone loosens a bar. If the team needs to bypass the gate, make the bypass explicit, logged, and owned by a human. Hidden bypasses turn the gate back into a report. ## How It Plays Out A team changes the prompt for a support agent. The new prompt is shorter and cheaper, and local smoke tests look fine. The pull request triggers the evaluation gate. Fifty golden support cases run through the agent, a judge scores faithfulness and tone, and the gate compares the result to last week's baseline. Average score is still healthy, but refund-flow faithfulness drops from 0.91 to 0.74. The gate fails, posts the failing cases on the pull request, and the author sees that the shorter prompt removed a policy sentence the refund flow needed. The fix happens before merge. A platform team swaps its default coding model. Unit tests pass because the generated code is never committed by the model-swap pull request. The evaluation gate runs a small suite of historical coding tasks. Correctness rises, but tool-use cost doubles and one task starts calling the search tool in a loop. The gate blocks on cost per successful run, not on correctness. The team changes the routing rule so only difficult tasks reach the new model. A third team adds an LLM-as-Judge scorer to its gate without calibrating it. For two weeks the gate fails randomly, and engineers start rerunning the workflow instead of fixing the product. The team repairs the gate by freezing judge temperature, running each borderline case twice, tracking confidence bands, and calibrating against a small human-labeled set. The gate becomes boring again. That's what you want from a gate: it fails rarely, but when it fails, people believe it. > **⚠️ Warning** > > Do not ship a single aggregate threshold as your only gate. Averages hide exactly the regressions that matter: one flow gets worse, one safety class slips, one latency tail blows out. Gate the dimensions the product cannot afford to lose. ## Where It Breaks The first failure mode is **threshold theater**. The team sets a bar, watches it fail, and quietly lowers it until delivery feels easy again. The cure is governance: threshold changes are reviewed, justified, and visible in the same history as the code. The second is **flaky enforcement**. If a gate fails for reasons the author can't reproduce, the team learns to distrust it. Keep the CI set small, deterministic where possible, and stable. For model-graded checks, use fixed judge settings, record prompts and outputs, and quarantine flaky cases until they are repaired. The third is **dataset staleness**. A golden set that never absorbs production failures becomes a museum of last quarter's problems. Feed production traces, support incidents, and postmortem examples back into the offline suite, then promote the highest-value cases into the CI gate. [AgentOps](agentops.md) is the upstream discipline that keeps this loop supplied. The fourth is **gate capture**. Once a metric blocks releases, teams optimize for the metric. Agents get tuned to please the judge, not the user. Mitigate this with human calibration, rotated scorers, occasional blind review, and a habit of adding counterexamples whenever the gate misses something users care about. ## Consequences **Benefits.** Evaluation stops being a dashboard and becomes a release control. Prompt and model changes get the same discipline as code changes. Regressions are caught while the author still has the pull request open and the context in mind. The gate also forces the team to say what "good enough" means in operational terms: which dimensions matter, which thresholds are hard, and who owns exceptions. **Liabilities.** The gate costs money and time on every relevant change. It needs a curated dataset, stable scorers, baseline management, and a process for repairing flaky cases. A strict gate can slow delivery; a loose one becomes decoration. And because the gate turns quality into policy, it creates organizational pressure: when a release is late, someone will ask to lower the bar. The team needs to decide in advance who can do that and what evidence they owe. An evaluation gate also changes the eval suite's social meaning. Before the gate, the suite was a measurement tool. After the gate, it is a contract: these behaviors must keep working, or the change does not land. That contract is powerful, but only if the team maintains it. ## Sources - Jez Humble and David Farley's *[Continuous Delivery](https://martinfowler.com/books/continuousDelivery.html)* (Addison-Wesley, 2010) established the deployment pipeline as the place where automated tests, acceptance checks, and release decisions become one repeatable path from commit to production. Evaluation Gate applies that deployment-pipeline idea to agent behavior. - OpenAI's [Evals framework](https://github.com/openai/evals) helped popularize eval suites as reusable measurement artifacts for language models. This article treats the suite as the thing being enforced, not merely reported. - Alexandre Cristovão Maiorano's 2026 preprint *[Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications](https://arxiv.org/abs/2603.15676)* describes multi-dimensional quality gates for LLM applications, including promote, hold, and rollback decisions across task success, evidence coverage, latency, and safety. - Braintrust's [Quality gate](https://www.braintrust.dev/encyclopedia/quality-gate) and [CI/CD integration](https://www.braintrust.dev/encyclopedia/ci-cd-integration) entries give the same compact release-control shape: run evals on changes and block deployment when scores miss thresholds. - Practitioner guides from Braintrust, Kinde, Traceloop, and Digital Applied describe the same delivery path in longer form: a fixed dataset, scored outputs, pull-request or deploy-time reporting, and a build that fails when thresholds are missed. ## Further Reading - [Braintrust, "Best AI Eval Tools for CI/CD Pipelines (2026 Review)"](https://www.braintrust.dev/articles/best-ai-evals-tools-cicd-2025) — a tool-oriented survey, useful for seeing which CI features teams now expect from eval platforms. - [Kinde, "CI/CD for Evals: Running Prompt & Agent Regression Tests in GitHub Actions"](https://www.kinde.com/learn/ai-for-software-engineering/ai-devops/ci-cd-for-evals-running-prompt-and-agent-regression-tests-in-github-actions/) — a practical GitHub Actions walkthrough for turning prompt and agent regression tests into a merge-blocking check. - [Traceloop, "Automated Prompt Regression Testing with LLM-as-a-Judge and CI/CD"](https://www.traceloop.com/blog/automated-prompt-regression-testing-with-llm-as-a-judge-and-ci-cd) — a concise treatment of prompt regression tests, golden datasets, judge rubrics, and CI failure conditions. - [Digital Applied, "Building an AI Agent Evaluation Pipeline: 2026 Methodology"](https://www.digitalapplied.com/blog/ai-agent-evaluation-pipeline-2026-testing-methodology) — a pipeline-level guide that separates PR-triggered eval runs, threshold scoring, PR reporting, and trace-to-eval feedback. --- - [Next: Human in the Loop](human-in-the-loop.md) - [Previous: Eval](eval.md)