Correctness, Testing, and Evolution

Software isn’t a static thing. It changes constantly: new features arrive, bugs get fixed, requirements shift, and the world it operates in evolves. The patterns in this section live at the tactical level. They address how you know your software is correct, how you keep it correct as it changes, and how you detect when something goes wrong.

Correctness starts with knowing what “right” looks like. An Invariant is a condition that must always hold. A Test is an executable claim about behavior. A Test Oracle tells you whether the output you got is the output you should have gotten. Around every test sits a Harness, the machinery that runs it, and within that harness, Fixtures provide the controlled data and environment the test needs.

Testing isn’t just verification; it can drive design itself. Test-Driven Development uses tests as a design tool, and Red/Green TDD gives that idea a tight, repeatable loop. Once tests pass, Refactoring lets you improve internal structure without breaking what works. When something does break unexpectedly, that’s a Regression, and catching regressions early is one of the highest-value activities in software development.

Not all problems announce themselves. Observability is the degree to which you can see what’s happening inside a running system, and Logging is the primary mechanism for achieving it. When a bug resists reading and reasoning, Printf Debugging lets you make runtime values visible with nothing more than a print statement and a hypothesis. Every system has Failure Modes, specific ways it can break, and the most dangerous are Silent Failures, where something goes wrong and nobody notices. Finally, every system operates within a Performance Envelope, the range of conditions under which it still behaves acceptably.

In an agentic coding world, where AI agents generate and modify code at high speed, these patterns become guardrails. An agent can write a function in seconds, but only tests can tell you whether that function does what it should. The faster you change code, the more you need the safety net these patterns provide.

Defining Correctness

What “right” means: the foundations for knowing whether your software does what it should.

Invariant — A condition that must remain true for the system to be valid.
Test — An executable claim about behavior.
Test Oracle — The source of truth that tells you whether an output is correct.
LLM-as-Judge — Use one model to score another’s output against a written rubric, the probabilistic oracle for non-deterministic agent work.
Harness — The surrounding machinery used to exercise software in a controlled way.
Fixture — The fixed setup, data, or environment used by a test or harness.
Happy Path — The default scenario where everything works as expected, and the concept that makes every other kind of testing meaningful.
Code Review — Having someone other than the code’s author examine changes before they merge, catching what tests and the author’s own eyes miss.

Test-Driven Workflows

Using tests to drive design and catch breakage before it ships.

Test-Driven Development — Tests written to define expected behavior before or alongside implementation.
Red/Green TDD — The core TDD loop: write a failing test, then make it pass.
Refactor — Changing internal structure without changing external behavior.
Regression — A previously working behavior that stops working after a change.
Flaky Test Quarantine — Isolate an intermittently-failing test to a non-blocking suite on a fix-or-delete deadline, so its noise stops gating merges while the failure stays tracked.
Test Pyramid — Shape a test suite with many fast unit tests at the base, fewer integration tests in the middle, and a small number of end-to-end tests at the top.
Test Impact Analysis — Run only the tests a change can affect, using a change-to-test map, so feedback stays fast as agents push edit volume up.
Smoke Test — Run a small, broad-but-shallow check on every build to prove the system is not catastrophically broken before any deeper testing or deployment proceeds.
Exploratory Testing — Run time-boxed sessions against the system, guided by a charter, to find the defects scripted tests were never written to catch.
Agentic Manual Testing — Give the agent a plain-English charter and the tools to run it (browser driver, shell, HTTP client), and let it do the clicking, typing, and watching that a human QA tester used to do before every release.
Consumer-Driven Contract Testing — Let each consumer declare the parts of an API it depends on; the provider verifies every consumer’s contract before release, so no change ever breaks a real caller.

Observability and Debugging

Seeing what your system is doing, measuring how well it works, and finding out why it broke.

Observability — The degree to which you can infer internal state from outputs.
Domain-Oriented Observability — Instrument the business events that matter (cart abandoned, payment declined, order placed) as first-class telemetry, so dashboards track outcomes and not just process health.
Agent Trace — Capture each agent run as a tree of spans (model calls, tool calls, sub-agent dispatches), so debugging, cost attribution, multi-agent correlation, and replay all read from the same structured record.
Failure Mode — A specific way a system can break or degrade.
Silent Failure — A failure that produces no clear signal.
Production-Readiness Cliff — When an agent-built app crosses the “looks done” line long before the “is production-ready” line, leaving a polished UI over absent or broken backend behavior.
Fail Fast and Loud — Detect invalid state at its source and surface it in a way that’s impossible to ignore, so nothing builds on a broken foundation.
Performance Envelope — The range of operating conditions within which a system remains acceptable.
Logging — Record what your software does as it runs, so you can understand its behavior after the fact.
Printf Debugging — Insert temporary output statements to test a hypothesis about code behavior, then remove them once you’ve found the answer.
Metric — A quantified signal, tracked over time, that tells you whether your software, team, or process is improving or degrading.
Feedback Loop — Any arrangement where a system’s output circles back to influence its next action, enabling self-correction or self-reinforcement.
Service Level Objective — A committed reliability target with a matching error budget that governs how much risk the team can spend on change.

Managing Change

Evolving a system safely over time without breaking what works.

Technical Debt — Shortcuts in code act like financial debt, letting you ship faster now and charging interest on every future change.
Greenfield and Brownfield — Greenfield is building from a clean slate; brownfield is working in and around an existing system. Naming which one you’re doing at the start of a task is among the highest-return acts of agent steering available.
Strangler Fig — Replace a legacy system incrementally by building new functionality alongside it, routing traffic piece by piece, until the old system can be switched off.
Parallel Change — Change an interface by adding the new form first, migrating callers at their own pace, and removing the old form last, so consumers never see a breaking change.
Deprecation — Announce the removal of a feature on a specific future date, keep it working in the meantime, watch who still uses it, and remove it only once usage has actually gone to zero.
Evolutionary Modernization — Treat modernization as a continuous, guided process of small replacements with working software at every step, rather than a bounded project that ends in a single cutover.
Regenerative Software — Design components so they can be deleted and rebuilt from durable specs, boundaries, and evals, trading in-place maintenance of AI-generated code for safe, local regeneration on a cadence.
Sweep — Apply one rule uniformly across many files in a single disciplined pass, using regex, a codemod, or an agent depending on whether the rule is textual, syntactic, or judgment-dependent.
Backfill — Populate a new field, marker, or annotation across an existing corpus so records made before the requirement existed conform to it, without corrupting the corpus along the way.

Keyboard shortcuts

Correctness, Testing, and Evolution

Defining Correctness

Test-Driven Workflows

Observability and Debugging

Managing Change