Runbook
Also known as: Operations Playbook, Incident Response Procedure
Understand This First
- Configuration – runbooks reference configuration values and how to change them.
Context
This is an operational pattern that captures hard-won knowledge about how to handle recurring situations. A runbook is a documented procedure for a specific operational task or incident type. When the database runs out of disk space at 3 a.m., when the payment processor goes down, when a deployment goes sideways, a runbook tells the on-call engineer exactly what to do, step by step.
In agentic coding, runbooks serve a dual purpose. They guide human operators during incidents. And they can serve as structured instructions for AI agents: an agent that understands a runbook can assist with diagnosis, suggest steps, or even execute parts of the procedure.
Problem
Operational knowledge lives in people’s heads. When those people are asleep, on vacation, or have left the company, the knowledge is unavailable. Even when the right person is around, they may be stressed, sleep-deprived, and making decisions under time pressure during an incident. How do you make sure operational procedures are available, reliable, and executable regardless of who’s on call?
Forces
- People forget steps under pressure, especially at 3 a.m. during an incident.
- Operational procedures change as the system evolves, and outdated runbooks are worse than no runbooks.
- Writing runbooks takes time that could be spent building features.
- Every incident is slightly different. A runbook can’t anticipate every variation.
Solution
Document your recurring operational procedures as step-by-step runbooks. Store them alongside your code in Version Control, or in a team wiki that is easily searchable. Write them for an audience that is competent but stressed: clear steps, no ambiguity, explicit commands they can copy and paste.
A good runbook includes:
- Title: what situation this runbook addresses.
- Symptoms: how to recognize that this runbook is the right one.
- Prerequisites: access, tools, or permissions needed.
- Steps: numbered, concrete actions. Include actual commands, URLs, and expected outputs.
- Verification: how to confirm the situation is resolved.
- Escalation: what to do if the runbook does not work.
Write runbooks after an incident, when the steps are fresh. Review and update them regularly; a runbook for a system that has changed is actively dangerous. During incident retrospectives, ask: “Did we have a runbook? Was it accurate? What should we add or change?”
When working with AI agents, well-structured runbooks become even more powerful. You can paste a runbook into a conversation with an agent and ask it to help execute the diagnostic steps, interpret log output, or suggest which branch to follow. The runbook provides the structure; the agent provides speed and pattern recognition.
How It Plays Out
A startup’s primary database runs out of disk space on a Saturday night. The on-call engineer has been at the company for two months. She opens the runbook titled “Database Disk Space Emergency,” follows the steps to identify the largest tables, runs the documented cleanup queries, and verifies that disk usage has dropped to safe levels. The incident is resolved in 20 minutes. Without the runbook, she would have been guessing at 2 a.m.
A team adds a runbook for their deployment rollback procedure. It includes the exact commands to run, the dashboards to check, and the Slack channels to notify. During the next rollback, the on-call engineer follows the runbook and completes the rollback in three minutes. Afterward, they update the runbook to include a step they discovered was missing: checking for in-flight background jobs.
The best time to write a runbook is immediately after resolving an incident. The steps are fresh, the pain is motivating, and you know exactly what you wished you had documented. Make runbook creation part of your incident retrospective process.
“Write a runbook for handling database disk space emergencies. Include the exact commands to identify the largest tables, the cleanup queries to run, the verification steps, and the Slack channels to notify.”
Consequences
Runbooks democratize operational knowledge. Any competent engineer can handle an incident, not just the one person who has seen it before. Response times drop because the on-call engineer does not have to figure out the procedure from scratch. Incident stress decreases because there is a clear path to follow.
The cost is creation and maintenance. Writing runbooks takes time. Keeping them current as the system evolves takes discipline. An outdated runbook can lead an engineer down the wrong path during an incident, making things worse. Treat runbooks as living documents: review them during retrospectives, test them periodically, and update them whenever the system changes.
Related Patterns
- Uses: Version Control — runbooks should be version-controlled alongside the systems they describe.
- Supports: Rollback — rollback procedures are a common and critical runbook topic.
- Supports: Deployment — deployment procedures are often documented as runbooks.
- Complements: Environment — runbooks often include environment-specific steps and commands.
- Depends on: Configuration — runbooks reference configuration values and how to change them.