Test Pyramid
“The test pyramid is a way of thinking about how different kinds of automated tests should be used to create a balanced portfolio.” — Martin Fowler
A heuristic for allocating testing effort: many fast, cheap tests at the base, fewer slow, expensive tests at the top.
Understand This First
- Test – the basic unit whose allocation this pattern governs.
- Test Oracle – different oracle kinds live at different pyramid layers.
Context
Every project with more than a handful of tests faces the same question: where should the effort go? A team can write ten thousand unit tests, fifty end-to-end browser tests, or any mix in between. The choice looks like a matter of taste until the bill arrives: a test suite that takes forty minutes to run won’t get run; one dominated by flaky browser tests will train everyone to ignore failures. This is a tactical pattern. It sits above individual Tests and shapes a whole test suite.
The pyramid is the classic answer. Mike Cohn sketched it in Succeeding with Agile (2009); Ham Vocke’s “The Practical Test Pyramid” (2018), hosted on martinfowler.com, made it canonical; and the 2026 wave of agentic coding has given it a second, parallel life.
Problem
Not all tests cost the same. A unit test against a pure function runs in microseconds, has no dependencies, and almost never flakes. An end-to-end test that drives a browser against a staging environment takes tens of seconds, depends on a dozen services being healthy, and fails intermittently for reasons unrelated to the code under test. If you treat them as equivalent (counting “tests” as a single number), you end up with a suite that is slow, flaky, and expensive to maintain, yet somehow misses the bugs that matter.
How do you decide how many of each kind to write, given that end-to-end tests feel more convincing but cost orders of magnitude more per assertion than unit tests?
Forces
- Fast tests give fast feedback; slow tests give realistic feedback.
- End-to-end tests catch integration bugs that unit tests cannot see.
- End-to-end tests flake, and a flaky suite trains people to ignore red builds.
- Every test has a maintenance cost that compounds as the codebase changes.
- With AI agents now generating tests at high volume, the suite can balloon quickly into something nobody can run locally.
Solution
Shape your test suite like a pyramid. Put many fast, isolated tests at the base, fewer integration tests in the middle, and a small number of end-to-end tests at the top. The widths are proportions, not fixed ratios, but the rough guidance holds: if unit tests are not the majority by count, something is wrong.
The classic three layers:
- Unit. One function, class, or module in isolation. No network, no database, no filesystem. Runs in milliseconds. You write hundreds or thousands of these.
- Integration. A real component talking to one or two real collaborators: the code against a real database, a module against a real file system, an API handler end-to-end inside a single process. Runs in hundreds of milliseconds. You write tens or low hundreds.
- End-to-end. The whole system exercised from the outside, as a user or client would use it. A browser against a running server, a deploy against a staging environment. Runs in seconds or tens of seconds. You write only the handful you cannot live without.
The shape follows from economics. A bug caught at the base is cheap to find and cheap to fix, because the failing test points directly at the code. A bug caught at the top is still caught, which is better than escape, but the diagnosis is harder and the test was more expensive to write and it’s more expensive to run. You want the cheapest layer that could have caught each bug to be the one that does.
The opposite shape, the “ice cream cone,” with a few unit tests propping up a mountain of end-to-end tests, is the anti-shape. It signals that the team either distrusts unit tests or could not figure out how to write them, and it leads to slow builds, random flakes, and the quiet abandonment of CI as a source of truth.
The Agentic Pyramid
In 2026, a second pyramid has emerged alongside the classical one, shaped by the same economic logic but aimed at systems that include non-deterministic components like LLMs. Practitioners building agent evaluation pipelines have converged on reorganizing the layers by uncertainty tolerance rather than test type:
- Base: deterministic tests. Traditional unit and integration tests over the non-LLM parts of the system. Tool handlers, prompt builders, schema validators, state machines. These must be reproducible and fast, because the layers above them won’t be.
- Middle: recorded interactions and LLM-as-judge evaluations. Record-and-replay tests that pin down an agent’s interaction with a tool or MCP server so that the integration is deterministic in CI. Above those sit rubric-based evaluations where one LLM scores another’s output on dimensions like accuracy, helpfulness, and safety.
- Top: end-to-end simulations and human review. A small number of realistic agent runs against a staging environment, plus periodic human spot-checks. Expensive to run, impossible to fully automate, irreplaceable for catching the failures only a human will notice.
The principle is the same: push determinism as low as you can, because that is where tests are cheap, fast, and trustworthy. Reserve the expensive probabilistic layers for what deterministic tests genuinely cannot reach.
How It Plays Out
A payments team has a test suite of 180 browser tests that run for 35 minutes in CI and fail at least once a week for reasons nobody can reproduce. They set aside a sprint to rebuild the suite. The 180 browser tests become 14 end-to-end tests covering the critical flows (new-card checkout, saved-card checkout, refund, dispute), 60 integration tests that hit a real database and a real Stripe test account, and roughly 900 unit tests that cover pricing logic, tax rules, retry handling, and input validation. CI time drops to eight minutes. Flakes drop to roughly one per month, and when they occur, they are almost always genuine bugs in timing-sensitive code. The team ships more confidently because the signal is finally reliable.
An engineer is building a customer-support agent. Early on, she writes a handful of end-to-end scenarios in which the agent handles whole conversations against a mock CRM. They pass, she ships, and within two weeks the agent is failing in production on inputs the scenarios never covered. She rebuilds the testing story as a pyramid. At the base she puts deterministic tests over the tool handlers, the prompt assembly code, and the escalation logic. In the middle she records fifty representative tool-call traces and replays them in CI, plus a panel of rubric-graded eval prompts scored by a cheaper model. At the top she keeps three live conversations against a staging environment, run nightly. Now a regression in prompt formatting fails in the base layer in milliseconds instead of showing up as a mysterious quality drop three days later.
When an agent writes tests for you, ask explicitly for pyramid-shaped output. “Start with unit tests for the pure logic; add two integration tests for the database path; add one end-to-end scenario for the happy path.” Left to themselves, agents often default to end-to-end tests because that’s what’s most visible in the scenario description.
The pyramid is a heuristic, not a quota. If a system has genuinely little logic at the base (say, a thin orchestration layer over a SaaS API), its suite will not look like a textbook pyramid, and that is fine. Chase proportions when they serve you, and stop when they do not.
Consequences
Benefits. You get fast feedback most of the time. The suite runs quickly enough that developers run it before pushing. Failures point at specific code, which makes debugging straightforward. The suite survives refactoring, because most tests check behavior of small units that are stable under internal change. And the economics are legible: you can look at a layer and ask whether it is pulling its weight.
Liabilities. A disciplined pyramid takes design effort. You have to structure code so that units are testable in isolation, which means separating pure logic from I/O. Teams that have not internalized that discipline will find the base layer hard to populate and will default upward into integration and end-to-end tests. The pyramid also creates a temptation to over-test at the base, chasing 100% line coverage by testing trivial getters and setters, which wastes effort without catching real bugs. The goal is not more tests; it is the right tests at the right layer.
Related Patterns
- Depends on: Test – the pyramid is about how to allocate individual tests, not how to write one.
- Uses: Test Oracle – different layers rely on different oracle kinds (assertion at the base, full-system judgment at the top).
- Uses: Fixture, Harness – the base layer needs small, fast fixtures; the top layer needs real environments.
- Complements: Happy Path – the base layer often covers happy paths cheaply; the higher layers earn their cost by going beyond them.
- Complements: Regression – a pyramid-shaped suite detects regressions fast at the base and expensive ones at the top.
- Complements: Eval – in the agentic pyramid, evals occupy the middle and top layers.
- Enables: Red/Green TDD – the tight TDD loop depends on the base layer being fast enough to run on every change.
- Enables: Shift-Left Feedback – pushing signal to the base layer is shift-left in action.
- Related: Feedback Sensor – tests at each layer are sensors reporting at different granularities and latencies.
- Related: Verification Loop – the pyramid shapes what an agent sees in each verification cycle.
Sources
- Mike Cohn named and drew the pyramid in Succeeding with Agile: Software Development Using Scrum (Addison-Wesley, 2009). His original sketch of many unit tests, fewer service tests, and a handful of UI tests is still the reference picture most teams carry in their heads.
- Ham Vocke’s “The Practical Test Pyramid” (martinfowler.com, 2018) is the definitive modern treatment. Vocke reframed the layers around scope rather than tooling and emphasized that proportions, not specific tool names, are what matter.
- The agentic variant emerged in early 2026 from practitioners who needed a way to reason about testing systems that combine deterministic code with non-deterministic model calls. The key reorganizing insight (layering by uncertainty tolerance rather than test type) was developed in parallel across several engineering blogs and has since become a shared idiom.
- Lisa Crispin and Janet Gregory’s Agile Testing (2009) and More Agile Testing (2014) gave the pyramid much of its early practical vocabulary, especially around the integration layer and the economics of slow tests.
Further Reading
- Ham Vocke, “The Practical Test Pyramid” (2018) – the canonical contemporary treatment, walking through a real service with tests at each layer. Concrete examples in Java, but the reasoning applies everywhere: https://martinfowler.com/articles/practical-test-pyramid.html
- Mike Cohn, Succeeding with Agile (Addison-Wesley, 2009) – chapter 16 is where the pyramid was first drawn. Worth reading in its original context even though the canonical online treatment has since surpassed it.