Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Service Level Objective

Pattern

A recurring solution to a recurring problem.

Pick a reliability target you will defend, measure how often you meet it, and use the slack between that target and perfection to decide when to ship and when to slow down.

Also known as: SLO, SLI/SLO/Error Budget

Understand This First

  • Metric – an SLO is a metric with a target attached and a consequence for missing it.
  • Observability – you cannot measure a service level you cannot see.

Context

Every service your users touch has some expected level of quality. A checkout endpoint should usually return in under a second. A login service should almost always say yes to correct passwords. A file upload should almost never lose bytes. “Usually,” “almost always,” and “almost never” are the interesting words in those sentences. Nobody seriously expects a production system to be perfect forever, but nobody has a shared definition of “good enough” either. This is a tactical pattern, rooted in Google’s site reliability engineering practice and now central to how teams reason about reliability, release risk, and the limits of agent-driven deployment.

Problem

Teams argue about reliability without a shared yardstick. One engineer says the service is “stable.” Another says it “feels slow.” A product manager promises customers “high availability.” An on-call rotation burns out chasing every alert, because no one has agreed which failures are worth waking up for and which are background noise. Meanwhile, the pressure to ship new features never lets up. Without a number everyone has signed off on, every reliability decision devolves into a judgment call — often made by whoever is most exhausted at 2 a.m.

Forces

  • Perfect reliability is infinitely expensive; users rarely need it and cannot tell the difference above a certain point.
  • Shipping fast and shipping safely pull against each other, and neither side has a principled way to concede ground.
  • Reliability is meaningful only to the degree you measure it; without measurement, every outage is a surprise.
  • Teams need a trigger for slowing down that does not depend on anyone’s mood or seniority.
  • The target has to be low enough that you can actually meet it, and high enough that users stay happy.

Solution

Define a Service Level Indicator (SLI), set a Service Level Objective (SLO) on it, and manage the gap between the SLO and 100% as an error budget.

The three pieces work as a system.

An SLI is a ratio of good events to total events. “Successful HTTP requests divided by total HTTP requests.” “Requests completed under 500ms divided by total requests.” The ratio matters more than the raw count, because it scales with traffic and stays meaningful under load. Pick SLIs that actually reflect user experience — what breaks the user’s task, not what’s easiest to graph.

An SLO is a target value for an SLI over a time window. “99.9% of login requests will succeed over any rolling 30-day window.” The 99.9% is not sacred; it is a deliberately chosen number the team commits to defending. Lower it if you cannot meet it; raise it only if users genuinely need more than you are giving them. A good SLO is slightly tighter than what users would tolerate and slightly looser than what engineering can deliver with unlimited budget.

The error budget is the arithmetic complement: 100% minus the SLO. A 99.9% SLO gives you a 0.1% error budget — about 43 minutes of downtime per month. That budget is real currency. When you have budget left, you spend it on risky work: feature launches, infrastructure migrations, experimental changes. When the budget is gone, you stop shipping anything that is not a reliability fix until the budget replenishes in the next window. This resolves the tension between shipping fast and shipping safely without a shouting match, because the number does the arguing for you.

The whole system only works if the SLO is genuinely defended. If you exhaust the budget and keep shipping features anyway, you have redefined the SLO downward without saying so, and everyone will stop trusting the number within a month.

How It Plays Out

A payments team runs a transaction API with a 99.95% success SLO measured over 30 rolling days. For the first three weeks of the month, things go smoothly and the error budget sits mostly untouched. Then a bad deploy causes a 40-minute partial outage that eats most of the remaining budget. The team’s policy kicks in automatically: no new feature deploys until the next window opens. Engineers spend the last week writing regression tests, improving the canary analysis, and hardening the deployment pipeline. By the time the budget resets, the root cause is fixed and shipping resumes. Nobody had to argue about whether it was “safe” to deploy — the budget answered the question.

A small team discovers their first attempt at an SLO is too ambitious. They set 99.99% availability on a service running on a single cloud region, then spend two months failing to meet it every window. The retrospective concludes that 99.99% is not achievable without multi-region failover, which the team has neither the budget nor the staffing for. They lower the SLO to 99.9%, write down why, and communicate the change to stakeholders. The new target is meetable, the on-call rotation stops living in perpetual burndown, and the team can have an honest conversation about what it would cost to raise the number later.

A platform team operates a fleet of coding agents that deploy to production via automated pipelines. Each deployment advances a workflow through stages — plan, implement, verify, release — and the release stage is gated by a real-time error-budget check. If the budget for the target service is healthy, the agent ships; if the budget is below a threshold, the agent pauses the workflow and opens an incident for human review instead. The same rule that governs human deploys governs agent deploys, which means the team does not need a separate policy for machine-driven changes. The error budget is the trust boundary.

Tip

When you introduce SLOs to a service that has never had them, resist the urge to pick round numbers like 99.9% because they sound professional. Instead, measure the service for two or three weeks, see what it actually delivers, and set the SLO at a level you’re already close to meeting. You can tighten it later as the service improves. Setting an aspirational SLO you cannot meet teaches the team to ignore the number, which is worse than having no SLO at all.

Consequences

Benefits. SLOs give the team a shared definition of “good enough” that survives personnel changes and shifting priorities. The error budget turns reliability from a moral argument into an accounting exercise: you either have slack or you don’t, and what you do next follows from that. On-call engineers stop chasing noise because only SLO-threatening failures are worth paging for. Product and engineering can negotiate feature velocity against reliability in a language both sides understand. In agentic workflows, SLOs give automated release gates a principled trigger — agents can ship when budget permits and pause when it doesn’t, without requiring a human to translate “is this risky” into a policy.

Liabilities. Picking a meaningful SLI is harder than it looks; the wrong ratio measures what’s easy to count instead of what users feel. Setting the SLO too high creates permanent budget exhaustion and teaches the team to ignore it. Setting it too low creates slack that absorbs real incidents invisibly, hiding problems that should surface. Error budgets also tempt teams into reckless spending — “we have 20 minutes of budget left, let’s ship the risky thing” — which is a misread of what the budget is for. And SLOs only cover what you chose to measure. A service with a green SLO dashboard can still be failing its users in ways your SLIs do not capture, which is why SLOs pair with Observability and User Story work rather than replacing them.

  • Depends on: Metric – SLIs are metrics with a target and a consequence; the SLO pattern is what a metric becomes when it earns a commitment.
  • Depends on: Observability – SLIs are computed from the signals observability makes visible.
  • Quantifies: Performance Envelope – the envelope describes the range of acceptable behavior; the SLO makes “acceptable” a specific target you defend.
  • Enables: Rollback – a blown error budget is the most defensible trigger for rolling back a release.
  • Feeds: Steering Loop – SLO burn rate is one of the clearest signals a steering loop can act on, whether the loop’s actuator is a human or an agent.
  • Related: Feedback Loop – error-budget policies form a closed loop: deploy, measure, compare to target, adjust release rate.
  • Related: Failure Mode – SLOs define which failure modes count against the budget and which are tolerated.
  • Contrasts with: Test – tests prove specific behaviors work in a controlled environment; SLOs measure whether the running system keeps working under real conditions.

Sources

Google’s Site Reliability Engineering team formalized the SLI/SLO/error-budget triangle in its public SRE book (Beyer, Jones, Petoff, and Murphy, 2016) and the follow-up SRE Workbook (2018). Benjamin Treynor Sloss defined site reliability engineering as “what happens when you ask a software engineer to design an operations team,” and the error-budget idea is the most quotable piece of that worldview: spend reliability the way you spend any other budget.

Alex Hidalgo’s Implementing Service Level Objectives (O’Reilly, 2020) is the best single reference for teams that want to actually adopt SLOs rather than just read about them. It covers SLI selection, target setting, burn-rate alerting, and the organizational politics that make SLO adoption succeed or fail.

The SRE community has continued adapting the pattern for agentic systems. Industry writing in 2026 (under the “Error Budgets 2.0” label) explores continuous burn-rate monitoring, adaptive release governance based on live SLO health, and automated mitigation — scaling, traffic shaping, rollback — triggered by budget thresholds rather than by humans reading dashboards. The core definitions are unchanged; the governance layer above them has become increasingly agent-driven.

Further Reading

  • Site Reliability Engineering — Google’s free online edition of the SRE book, including the chapters on service level objectives and error budgets.
  • The SRE Workbook — the practical companion volume, with detailed worked examples of SLI selection and SLO tuning.
  • Implementing Service Level Objectives by Alex Hidalgo — a book-length treatment aimed at teams putting SLOs into practice for the first time.