Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Metric

Concept

A foundational idea to recognize and understand.

A metric is a quantified signal that tells you whether your software, your team, or your process is improving, degrading, or standing still.

Understand This First

  • Observability – you need to see inside your system before you can measure it.
  • Test – tests verify correctness; metrics track behavior over time.

What It Is

A metric is a number that measures something you care about, tracked over time so you can spot trends. Response latency, error rate, deployment frequency, test coverage, defect count, time to resolve incidents. Each one compresses a complex reality into a signal you can watch, compare, and act on.

A one-time measurement tells you where you are today. Tracking that same measurement weekly tells you whether last month’s refactoring helped or whether the new feature is dragging performance toward the edge of your Performance Envelope. Metrics earn their value through repetition: the same measurement, taken consistently, revealing change.

Not every number qualifies. A metric requires a definition (what exactly are you counting?), a collection method (how do you gather the data?), and a purpose (what decision does this number inform?). A number without a purpose is trivia. A number tied to a decision is a metric.

Why It Matters

Software teams drown in opinions. “The system feels slow.” “Deployments seem risky.” “Code quality is declining.” Metrics replace feelings with evidence. They don’t settle every argument, but they shift the conversation from anecdotes to data.

This matters even more when AI agents generate and modify code at high speed. The 2025 DORA report found that individual developers using AI tools completed 21% more tasks and merged 98% more pull requests. Organizational delivery metrics stayed flat. Code review time increased 91%. Pull request size grew 154%. Bug rates climbed 9%. Traditional metrics like deployment frequency can actually mislead in this context: a team might celebrate shipping twice as fast while the codebase grows harder to maintain. The metrics didn’t break, but they stopped measuring what matters most when the bottleneck shifts from writing code to reviewing it.

Metrics also make agentic workflows governable. When an agent handles routine deployments, generates test suites, or refactors modules, you need a way to know whether its work is improving the codebase or degrading it. Evals measure agent performance on specific tasks. Metrics measure the cumulative effect of agent work on the system over weeks and months.

How to Recognize It

You’re working with metrics when you can answer three questions about a number: What does it measure? How is it collected? What do we do when it changes?

Good metrics share four properties. They’re specific: “p95 API latency for the /checkout endpoint” rather than “performance.” They’re comparable: today’s value means something relative to last week’s. They’re actionable: if the number moves, someone knows what to investigate. And they’re resistant to gaming: measuring lines of code written encourages bloat, not quality.

Watch for vanity metrics. Total page views, raw commit counts, “number of AI-generated pull requests merged” can all move in the right direction while the product gets worse. The antidote is to tie every metric to a question that matters: Are users succeeding at their tasks? Is the system reliable? Can we ship changes safely?

How It Plays Out

A startup tracks three metrics: deployment frequency, change failure rate, and mean time to recovery. For six months, all three improve steadily. Then the team adopts a coding agent and starts shipping twice as fast. Deployment frequency doubles, but change failure rate creeps from 5% to 12%, and recovery time lengthens because the failures are harder to diagnose. The metric dashboard makes the tradeoff visible before customers start complaining. The team slows down, adds integration tests to the agent’s Verification Loop, and watches the failure rate stabilize before resuming the faster cadence.

A platform engineering team builds a dashboard tracking token consumption, tool call counts, and task completion rates across their fleet of coding agents. One agent consistently uses 3x more tokens than others for similar tasks. Investigation reveals that its Instruction File is poorly structured, causing the agent to re-read large files repeatedly. Fixing the instruction file cuts token costs by 60% and improves completion time. Without the metric, the waste would have been invisible. The agent still produced correct output, just expensively.

Tip

When measuring agentic workflows, track both the agent’s direct output (task completion, test pass rate) and its second-order effects (code review burden, defect rate in agent-generated code, token cost per task). The direct output often looks good while the second-order effects tell the real story.

Consequences

Metrics create a shared language for discussing system health. Instead of debating whether the codebase is “getting worse,” you can point to defect density trends, test coverage changes, or deployment lead times. This shared language is especially valuable when agents are involved, because agent output is too voluminous for any human to review line by line.

The costs are real. Metric infrastructure takes time to build and maintain. Poorly chosen metrics distort behavior: if you measure velocity, people optimize for velocity at the expense of quality. This is Goodhart’s Law in action (“when a measure becomes a target, it ceases to be a good measure”), and it applies to agent-generated code just as much as human-written code. Metrics can also create false confidence. A green dashboard doesn’t mean everything is fine, only that the things you’re measuring are within bounds. The failures you haven’t thought to measure are the ones that surprise you.

The hardest part is choosing what to measure. Start with metrics tied to user outcomes (are they succeeding?) and system reliability (is it working?), then add process metrics (are we shipping safely?) as the team matures. Resist the urge to measure everything. A small set of well-understood metrics beats a sprawling dashboard that nobody reads.

  • Depends on: Observability – metrics come from observable systems; you can’t measure what you can’t see.
  • Quantifies: Performance Envelope – latency, throughput, and resource metrics define and monitor the envelope.
  • Complements: Eval – evals measure agent performance on tasks; metrics track cumulative system effects over time.
  • Informs: Regression – metrics detect performance and quality regressions before users do.
  • Related: Feedback Sensor – metrics are a category of feedback signal that sensors collect.
  • Related: Steering Loop – metrics provide the data that steering loops act on.
  • Related: Silent Failure – failures in unmonitored areas produce no metric signal.
  • Contrasts with: Test – tests give a binary pass/fail on specific behaviors; metrics track continuous quantities over time.

Sources

The DORA team (originally at Google, now the DevOps Research and Assessment program) established deployment frequency, lead time for changes, change failure rate, and mean time to recovery as the canonical software delivery metrics. Their 2025 report introduced rework rate as a fifth metric, replaced the four-tier performance model with seven team archetypes, and documented the AI amplification effect (individual gains, organizational flatness).

Martin Fowler’s writings on metrics emphasize the distinction between vanity metrics and actionable ones. Charles Goodhart formulated his law in 1975 (later popularized by Marilyn Strathern’s pithier version: “when a measure becomes a target, it ceases to be a good measure”), which remains the central warning in metric design.

Google’s HEART framework (Happiness, Engagement, Adoption, Retention, Task Success) provides a structured approach to user-centered metrics, with the Goals-Signals-Metrics model for connecting business goals to measurable quantities.

Further Reading

  • DORA | Get Better at Getting Better – the DORA program’s site, including the annual State of DevOps reports and the quick check tool for benchmarking your team.
  • Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim – the book-length treatment of DORA’s research linking delivery metrics to organizational performance.