Smoke Test
A small, deliberately broad-but-shallow set of checks that verify a build is not catastrophically broken before any time is invested in deeper testing.
Also known as: Build Verification Test (BVT), Confidence Test, Sniff Test
Understand This First
- Test — the parent concept; smoke is one kind of test in the family.
- Test Pyramid — positions smoke within the broader test taxonomy.
- Happy Path — the single golden path; smoke is a multi-path version of the same idea.
- Fail Fast and Loud — the discipline smoke embodies at the build-verification layer.
Context
Every change to a system produces a new build, and every new build could be subtly or catastrophically broken. The team, or the agent, that just made the change has finite attention to spend on verification. Spending an hour running deep regression tests on a build that fails to start is wasted time. The decision is how to spend the first thirty seconds of verification budget so the next thirty minutes are well spent.
This is a tactical pattern. It sits inside the test suite and the deployment pipeline, between Tests (the individual unit) and the larger machinery of Continuous Delivery. The agentic angle is sharp: when a coding agent can produce hundreds of lines of code per minute, the only verification step that scales is one that runs in seconds.
The name comes from outside software. Hardware engineers powering on a new circuit board would watch for literal smoke; if any appeared, the device was broken enough that further testing was a waste of time. Plumbers ran smoke through new pipes to find leaks fast. The software analog kept the name because the discipline is the same: cheapest possible signal first, deeper tests later.
Problem
Without a cheap, broad-shallow verification step between built and deeply tested, two failure modes recur.
In the first, teams or agents run deep regression suites against builds that are so broken the deep suite fails on setup. The deep failure obscures the actual catastrophic breakage, and the team spends an hour debugging the wrong thing.
In the second, teams skip verification entirely on the assumption that “if it builds, it works.” Catastrophic breakage then surfaces in production when a customer reports a blank screen.
Both failures share a root cause. There is no verification step optimized for the question “is anything obviously on fire?” — only steps optimized for “does every detail behave correctly?”
Forces
- Verification budgets are finite; the deeper a test, the more of the budget it consumes.
- Catastrophic bugs are rare in absolute terms but expensive in consequence, and they hide behind every commit.
- Deep test suites take minutes to hours; nobody runs them on every change.
- Flaky tests teach the team to ignore failures, which is worse than no test at all.
- Agentic code production runs orders of magnitude faster than human review; verification has to keep up.
Solution
Run a small, fast, broad-but-shallow check on every build to prove the system is not catastrophically broken before any deeper testing starts. Five disciplines hold a smoke suite together.
Optimize for breadth, not depth. A smoke suite touches every major surface (auth, primary user flow, primary data write, primary external call, primary background job) at the most superficial possible level. It does not exercise edge cases, error paths, or unusual states. If a surface is so important that breaking it is a showstopper, it gets one smoke check. If it isn’t, it doesn’t.
Optimize for runtime. A smoke suite that takes thirty minutes isn’t a smoke suite; it’s a regression suite with a different name. Target under one minute for build-time smoke; under thirty seconds for in-pipeline smoke. The runtime constraint is what makes smoke valuable. It is the only verification step you can afford to run on every commit.
Make pass/fail unambiguous. Smoke produces one bit of information: did the build clear the bar? Flaky smoke tests, the kind that sometimes pass and sometimes fail, are worse than no smoke tests, because they train the team to ignore the signal. Treat a smoke flake as a P1: fix the test or remove it the same day.
Stage it correctly. Smoke runs after unit tests (which gate at the function level) and before deep integration or end-to-end suites (which gate at the feature level). A typical pipeline looks like: build → unit tests → smoke → deep suites → deploy → post-deploy smoke → trust. The two distinct smoke stages matter: pre-deploy smoke (does the build work in test?) and post-deploy smoke (does the deployed system serve real traffic?).
Distinguish smoke from its cousins. Industry confusion between smoke, sanity, and regression is endemic. The distinction is mechanical:
| Test type | Breadth | Depth | Question it answers |
|---|---|---|---|
| Smoke | Broad | Shallow | Is the build catastrophically broken? |
| Sanity | Narrow | Deep | Did my specific change actually work? |
| Regression | Broad | Deep | Has any previously-known failure returned? |
Smoke and sanity are inverses on the breadth-depth axis. Smoke and regression both run broad, but regression goes deep on every known failure mode and takes hours; smoke stays shallow and runs in seconds. All three are valuable; they belong at different stages of the pipeline.
For agents specifically, smoke is the cheapest verification primitive available. When an agent makes a code change, the fastest signal of “did I break something fundamental?” is the smoke suite. Agents should run smoke after every meaningful change, before their own further self-review. Skipping smoke is the agentic equivalent of pushing to main without running tests. It works until it doesn’t.
How It Plays Out
A 12-developer team has a CI pipeline that runs unit tests in 90 seconds, smoke in 20 seconds, and the full end-to-end suite in 22 minutes. Every commit runs unit and smoke; merge is gated on both passing. The full suite runs nightly. When a developer accidentally commits a typo that breaks app startup, smoke catches it in 20 seconds, the developer pushes the fix within five minutes, and the team carries on. Without the smoke stage, the breakage would have ridden into the nightly run and blocked the team for half a day the next morning.
A small startup ships a new version of their API service with a progressive rollout. A post-deploy smoke suite runs against the new instance the moment it accepts traffic. Three checks: GET /health returns 200, POST /login with a known-good user returns a token, GET /profile with that token returns the expected user record. If any of the three fails, the deploy is rolled back automatically. This is smoke as a deploy gate, not a build gate, and it has caught two production-bound config drift bugs in the last quarter.
A coding agent is asked to refactor a payment service module. The agent makes the change and, before reporting completion, runs the smoke suite: app starts, health check returns OK, one canonical payment-creation call returns the expected response. Smoke passes; the agent surfaces the diff for human review. Had smoke failed, the agent would have either self-corrected and rerun, or rolled back and reported the failure. Without that primitive, the agent has no fast way to know whether its change broke something fundamental, which forces a choice between over-confidence (silent regression) and over-caution (running a 22-minute deep suite for a one-line change).
When designing a smoke suite, write down the answer to one question for each candidate check: “If this surface broke and we shipped, would we roll back immediately?” If yes, it’s a smoke check. If “we’d file a bug and fix it in the next release,” it belongs in the deeper suite, not in smoke.
Consequences
Benefits. You catch catastrophic breakage in seconds, on every commit, for almost no compute cost. Deep suites stop wasting time on builds that were already broken at startup. Deploys gain a safe automated gate that doesn’t depend on human attention. Agents gain a fast, cheap verification primitive they can call inside any change loop. The team’s signal-to-noise ratio on CI failures improves, because smoke is small enough to keep flake-free.
Liabilities. Smoke suites tend to drift. A check that was once “is the system on fire?” gets joined by a check that’s “does this specific edge case work?” and another that’s “did we regress that one bug from last quarter?” Within six months the smoke suite is twelve minutes long and nobody runs it on every commit anymore. Resisting that drift is a continuous discipline.
A smoke suite that doesn’t fail when something is broken is worse than no smoke suite, because it produces false confidence. Coverage gaps are easy to introduce: a new endpoint ships without a smoke check, breaks at deploy, and the team is surprised because “smoke passed.”
Any non-trivial pipeline needs two smoke suites (pre-deploy and post-deploy), which is an extra surface to maintain. Teams that treat them as one suite end up with checks that work in CI but fail in production, or the reverse.
When It Fails
Smoke that has rotted into regression. The smoke suite started at 20 seconds and grew to 12 minutes as engineers added “just one more check.” It no longer runs on every commit, and developers have started skipping it. Remedy: prune aggressively. Anything not in the top-five-most-catastrophic category gets moved to the deeper suite.
Flaky smoke. Smoke fails 5% of the time for environmental reasons. Developers learn to rerun on red and the signal value goes to zero. Remedy: any flake gets fixed or removed within 24 hours. Flake tolerance is what kills smoke as a discipline.
Smoke confused with sanity. The team thinks “did my bug fix work?” is smoke, when it’s actually sanity. They write narrow-deep tests and call them smoke; the suite no longer protects against the catastrophic-breakage failure mode it was supposed to. Remedy: an explicit definition in the team’s testing handbook (this article).
No post-deploy smoke. Pre-deploy smoke passes, deploy succeeds, but the deployed environment differs from test (config drift, missing secret, wrong DB connection string), and the system is broken in production until the first customer reports it. Remedy: a separate, smaller smoke suite that runs against the live environment immediately after deploy and gates traffic shift.
Agent skips smoke. An agent makes changes and reports completion without running the smoke suite, on the assumption that “the change is small enough not to need verification.” This is the agentic version of “it compiles, ship it.” Remedy: encode the smoke run as a non-skippable step in the agent’s workflow, at whatever layer makes that possible (project instructions, hook, verification-loop primitive).
Designing a Smoke Suite
Five questions to answer before you write a single check:
- What surfaces are catastrophic if broken? Auth, primary read, primary write, primary external call, primary background job. Five candidates, often fewer than five smoke checks.
- What is the simplest possible check for each? Not the thorough check, the simplest one. A 200 response is enough; you don’t need to assert the whole payload.
- Can the whole suite run in under one minute? If not, prune. The runtime constraint is the point.
- Is every check pass/fail with no flakes? If a check sometimes fails for environmental reasons, fix it or remove it. Flaky smoke is worse than no smoke.
- Where in the pipeline does it run? Pre-deploy smoke and post-deploy smoke are different suites against different environments. Don’t conflate them.
If your answers add up to more than ten checks, or more than a minute of runtime, you’re no longer writing smoke. You’re writing regression with a faster name on it.
Related Patterns
| Note | ||
|---|---|---|
| Complements | Observability | Post-deploy smoke is the discrete-signal complement to continuous observability. |
| Complements | Service Level Objective | Post-deploy smoke is the discrete-signal complement to continuous observability. |
| Contrasts with | Exploratory Testing | Smoke is automated and prescriptive; exploratory is human and open-ended. |
| Contrasts with | Regression | Both are broad; regression goes deep, smoke stays shallow. |
| Depends on | Test | Smoke is one kind of test in the family. |
| Detects | Failure Mode | Smoke catches the catastrophic failure modes a silent failure would otherwise hide. |
| Detects | Silent Failure | Smoke catches the catastrophic failure modes a silent failure would otherwise hide. |
| Documented in | Runbook | When post-deploy smoke fails, the runbook says what to do next. |
| Embodies | Fail Fast and Loud | Smoke surfaces catastrophic breakage at the earliest possible point. |
| Gates | Deployment | Post-deploy smoke is the discrete signal that triggers rollback. |
| Gates | Rollback | Post-deploy smoke is the discrete signal that triggers rollback. |
| Refines | Test Pyramid | Smoke is the broad-shallow extreme of the test taxonomy. |
| Related | Feedback Loop | Smoke is one of the fastest possible feedback loops in the build pipeline. |
| Related | Happy Path | Smoke is happy-path verification across multiple critical surfaces at once. |
| Stages of | Continuous Delivery | Smoke is a named pipeline stage between build and deep test, and between deploy and traffic shift. |
| Stages of | Continuous Deployment | Smoke is a named pipeline stage between build and deep test, and between deploy and traffic shift. |
| Stages of | Continuous Integration | Smoke is a named pipeline stage between build and deep test, and between deploy and traffic shift. |
| Used by | Generator-Evaluator | Smoke is the cheapest verification primitive an agent can call inside its loop. |
| Used by | Reflexion | Smoke is the cheapest verification primitive an agent can call inside its loop. |
| Used by | Subagent | Smoke is the cheapest verification primitive an agent can call inside its loop. |
| Used by | Verification Loop | Smoke is the cheapest verification primitive an agent can call inside its loop. |
| Uses | Fixture | The smoke suite needs minimal but reliable setup. |
| Uses | Harness | The smoke suite needs minimal but reliable setup. |
| Uses | Test Oracle | Every smoke check needs a clear pass/fail oracle. |
Sources
- The term entered software from hardware engineering, where engineers literally watched a powered-on circuit board for smoke before any further testing, and from plumbing, where smoke was forced through new pipes to find leaks. The metaphor carried into early software practice in the 1970s and 1980s as testers borrowed the discipline of cheapest-signal-first.
- Glenford Myers, The Art of Software Testing (Wiley, 1979; 3rd ed. 2011), gave software testing much of its early formal vocabulary, including the breadth-versus-depth framing that smoke embodies.
- Microsoft’s internal testing practice popularized the formal name “Build Verification Test” (BVT) in the 1990s, where the BVT suite was the gate every nightly build had to clear before broader QA would even look at it. The BVT lineage is where many enterprise teams still encounter the discipline.
- The IEEE 829 testing standard and the ISTQB glossary both document smoke testing formally, treating it as a recognized phase of build verification rather than an informal practice.
- Martin Fowler’s “Smoke Test Your Continuous Delivery Pipeline” reframed smoke for the CI/CD era, arguing that the pipeline itself needs a smoke check (not just the application) and that post-deploy smoke is what makes safe automated rollback possible.
Further Reading
- Wikipedia, “Smoke testing (software)” — concise survey of the term’s origin in hardware and plumbing, the BVT lineage, and modern usage. A good first stop.
- Lisa Crispin and Janet Gregory, Agile Testing (Addison-Wesley, 2009) — situates smoke testing within an agile pipeline and gives practical advice on keeping the suite fast and trustworthy.