--- slug: observability type: concept summary: "The degree to which a running system's internal state can be inferred from the signals it emits, naming whether the software is legible or opaque." created: 2026-04-04 updated: 2026-05-23 related: agent-registry: relation: used-by note: "Registered agents emit signals routed into the same observability stack as services and jobs, so the population becomes legible in the same dashboards." agent-sprawl: relation: depended-on-by note: "Without shared observability the fleet is invisible, and sprawl is the default state of any invisible system." agentops: relation: generalized-by note: "AgentOps is observability specialized for systems that reason and choose, adding trajectory and decision-quality signals on top of the classic stack." deprecation: relation: used-by note: "Visibility into who still calls the deprecated thing is the signal that tells you when removal is safe." domain-oriented-observability: relation: refined-by note: "A specialization that instruments business-meaningful events rather than infrastructure signals." failure-mode: relation: enables note: "Detection — you can't detect failure modes you can't observe." feedback-loop: relation: used-by note: "You can only close a loop around an observable system." feedback-sensor: relation: related note: "Observability provides the runtime signals that feedback sensors capture at development time." fixture: relation: contrasts-with note: "Fixtures control inputs for testing, while observability captures outputs in production." logging: relation: enabled-by note: "Logging is the primary mechanism for achieving runtime observability." metric: relation: used-by note: "Metrics come from observable systems; you can't measure what you can't see." performance-envelope: relation: informs note: "Metrics data defines and monitors the envelope." premature-optimization: relation: related note: "Measurement must precede optimization." printf-debugging: relation: complements note: "Printf debugging is ad-hoc observability; systematic observability reduces the need for it." runtime-governance: relation: complements note: "Every runtime decision (allow, throttle, sandbox, escalate, block) is an observability event that needs to land in the same telemetry as everything else." service-level-objective: relation: used-by note: "SLIs are computed from the signals observability makes visible." shadow-agent: relation: related note: "Shadow agents evade all observability systems." silent-failure: relation: enables note: "Detection — observability is the primary defense against silent failures." technical-debt: relation: related note: "Debt hides in unmonitored code." test: relation: complements note: "Tests verify behavior before deployment; observability verifies behavior after." test-oracle: relation: contrasts-with note: "Oracles verify correctness in test; observability reveals behavior in production." --- # Observability *Observability is the degree to which a running system's internal state can be inferred from the signals it emits: the vocabulary that lets a team talk about whether the software they're operating is legible or opaque.* > **Concept** > > Vocabulary that names a phenomenon. > **📝 Where the name comes from** > > The word is borrowed from control theory. In 1960 Rudolf Kalman defined a dynamic system as *observable* if its complete internal state could be deduced from its external outputs over a finite period. Software engineers picked up the term in the mid-2010s and kept the structural meaning intact (a system you can see into is observable, one you can't is opaque) while shifting it from a yes-or-no mathematical property to a graded engineering one. When a team says a service has "good observability," they mean its emitted signals are rich enough that an operator (or an agent) can reconstruct what it was doing without attaching a debugger to a live process. ## What It Is Observability is a property of a running system, not a piece of infrastructure you install. A system has it to the degree that someone outside the system can answer questions about what happened inside it, using only the signals the system emits. Those signals fall into three established categories, which together form the running vocabulary practitioners use: - **Logs** are timestamped records of discrete events. "Order 789 was placed by user 42 at 14:32:07." A log tells you *what happened*, one event at a time. Structured logs (key-value pairs or JSON) are vastly more useful than free-form text because they can be filtered and aggregated by machines, including agents. - **Metrics** are numerical measurements over time. "p99 request latency is 230ms." "Error rate is 0.3%." "Queue depth is 47." A metric tells you *how the system is performing* at a given moment and across time. Metrics are cheap to collect and store relative to logs, and they are the right surface for alerts that fire on thresholds. - **Traces** are records of one request's path through a distributed system, showing which services it touched, how long each step took, and where time was spent. A trace tells you *where time goes*, and it is the only signal that meaningfully diagnoses performance problems whose root cause is split across multiple services. A fourth category has been gaining ground in the past few years: **events**, sometimes called *wide events* or *canonical log lines*, which collapse a request's full context (user, route, status, latency, feature flags, downstream calls) into a single high-cardinality record per request. Wide events sit between the three classical pillars and reduce the number of stitching joins an operator has to perform when investigating. The vocabulary hasn't fully settled, and different practitioner communities use the categories differently, but the underlying property they all describe is the same: enough is emitted that an outside observer can reconstruct what the system did. Observability is distinct from monitoring. Monitoring is the practice of watching predetermined signals against predetermined thresholds; it answers known questions ("is the error rate above 1%?"). Observability is the property that makes it possible to answer questions you didn't know to ask in advance ("why did *this particular* request fail in *this particular* way?"). Monitoring tells you a known thing is wrong; observability lets you investigate an unknown thing. Charity Majors's framing of this distinction, that observability is about handling "unknown unknowns" rather than known ones, is the move that took the borrowed control-theory term and gave it operational meaning. For agentic systems, observability picks up a second meaning that overlaps but isn't identical. The classical pillars still apply, but the agent itself becomes a system whose internal state matters: which tools it called, in what order, with what arguments, what intermediate reasoning it surfaced, where it backtracked, where it gave up. *Agent observability* (or *AgentOps*) is the same concept specialized to systems that reason and choose. It extends the surface area by adding trajectory signals (the sequence of tool calls and their outcomes), decision-quality signals (did the agent's chosen path match what a reasonable practitioner would have chosen), and prompt-response provenance (which input produced which output). The pillars stay; the things you instrument expand. ## Why It Matters Software in production behaves differently than software in testing. Real data is messier, real load is higher, and real users find paths nobody anticipated. When something goes wrong, or just behaves unexpectedly, you need to understand *why*, not just *that*. A system without observability gives you only the binary: it worked, or it didn't. A system with observability gives you the explanation. The gap between those two regimes is the difference between debugging in minutes and debugging in days. The discipline matters because instrumentation can't be retrofitted cheaply. By the time an outage is in progress, the signals you wish you had are the ones you didn't add before deployment, and adding them now means a deploy in the middle of an incident, with whatever side effects that introduces. Observability is a design property: every significant operation should emit enough information that someone investigating a problem six months from now can reconstruct what happened, without needing to redeploy the system to capture that information. Teams that learn this the hard way usually learn it once and then refuse to ship a service without baseline instrumentation; teams that haven't learned it yet rediscover the cost every quarter. There's a second-order effect that compounds. A team that can see its system can also reason about it as a population, not just as individual incidents. Patterns become visible: a particular endpoint that's always slow on Mondays, a deployment that mysteriously increases p99 by 30ms, a class of errors that's quietly trending upward without crossing any alert threshold. These signals exist in any sufficiently complex system; the question is only whether anyone can see them. The same property that lets you debug a specific incident also lets you anticipate the next one. For agentic systems the stakes are higher because the population is larger and noisier. A fleet of agents executing many tasks across many users produces a volume of activity that no human can spot-check. The team's only handle on whether the agents are doing useful work is the telemetry coming back from them. Without trajectory signals, an agent that confidently does the wrong thing is indistinguishable from one that does the right thing; both return a success code and an explanation. With trajectory signals, the team can sample the population, find the agents that took unusual paths, and surface the cases that need review. The agent's autonomy is bounded by the team's ability to see what it did, which is to say, by observability. There's a final framing that some teams find clarifying. Reliability is not the absence of failure; it's the deliberate handling of failure modes the team has thought about. Without observability the team can't know which failure modes are firing or how often, so its reliability work is guesswork. Observability is what makes the [Failure Mode](failure-mode.md) catalog populated by evidence rather than imagination, and it's what lets a team argue from data about which mode deserves the next round of investment. ## How to Recognize It You're looking at an observable system when an operator can answer a question about its recent behavior in minutes, using only the signals the system emits, without attaching a debugger or redeploying. You're looking at an opaque system when the answer to any specific question requires guessing, reproducing, or rebuilding. The diagnostic is the time-to-explanation, not the volume of logs. Concrete signs that a team is operating in an observable system: - **Structured logs everywhere.** Log lines are JSON or key-value pairs with consistent field names across services, not free-form English sentences. A grep across the fleet for `user_id=42` returns a coherent narrative; a grep for `Something went wrong with the order` returns silence. - **Metrics by signal, not by service.** The dashboard shows latency percentiles, error rates, and saturation per endpoint — the signals that map onto failure modes. It doesn't show "service X is up" as the primary view, because liveness is the wrong question. - **Traces that cross service boundaries.** Clicking on a slow request shows its path through every service it touched, with timing per hop. The trace IDs propagate through queues, retries, and async work, so the picture stays whole even when the request fans out. - **Wide events on the hot paths.** The high-volume requests emit one canonical record each with all the context that matters: user, route, status, latency, feature flags, downstream calls, agent ID if applicable. The team investigates from those records first and falls back to component logs only when the wide event leaves something unexplained. - **Sampling is principled.** High-cardinality signals are sampled deliberately (head-based for cost, tail-based for unusual cases) rather than emitted at 100% or dropped at random. The team knows their sampling strategy and can defend it. - **Alerts have runbooks.** Each alert links to a page that names the failure mode the alert is meant to catch and the investigation steps that follow. Alerts without runbooks are alerts that no one understands. Signs an opaque system reveals itself: - **Debugging by deploy.** The only way to investigate a production behavior is to add logging, push to production, wait for the behavior to recur, then remove the logging. The cycle time for one question is measured in hours. - **The dashboards lie.** "Everything is green" on the dashboards while customers are reporting outages, because the dashboards measure surface symptoms (server is up, endpoint returns 200) rather than the failure modes that are actually firing (the 200 contains garbage). - **Postmortems read as fiction.** The incident timeline is a reconstruction from human memory and Slack scrollback because the relevant signals weren't captured. Causality is inferred rather than evidenced. - **The team can't sample.** When asked "show me ten typical requests from the last hour," the team can produce only the requests that errored or the requests they happened to log. The normal traffic is invisible. For agent observability specifically, additional signs matter: - **Trajectories are reconstructible.** Given an agent run, the team can pull up the sequence of tool calls, their arguments, their returns, and any intermediate reasoning the agent surfaced. The trail isn't perfect (some intermediate state is genuinely inside the model), but the actions and their inputs and outputs are all captured. - **Prompt-response provenance.** For every output the agent produced, the team can locate the prompt that produced it, including system prompt, conversation history, and any retrieved context. Without this, the team can't reproduce or audit agent decisions after the fact. - **Sampling by quality, not just by error.** The team reviews successful agent runs as well as failures, sampling for unusual trajectories or decisions that look out of distribution. Errors are the easy cases; the dangerous cases are the confidently wrong ones, and they only surface through deliberate sampling. > **💡 Tip** > > Structure your logs as key-value pairs (or JSON), not free-form sentences. Structured logs are searchable by machines, including AI agents, while "Something went wrong with the order" is useful to nobody. ## How It Plays Out An e-commerce site experiences intermittent slow checkouts. The team opens the tracing dashboard, finds a slow checkout request, and sees that the payment service call took 8 seconds instead of the usual 200 milliseconds. They check the payment service metrics and see a spike in database connection wait time. The root cause, connection pool exhaustion under a particular traffic pattern, is identified in minutes, not days. The same investigation in an opaque system would have started with "is the database up?", proceeded through hours of guessing, and ended with a redeploy that added logging to find the answer that observability was supposed to surface immediately. In an agentic workflow, observability becomes the mechanism that lets agents monitor and maintain deployed systems. An agent reads metrics, detects an anomaly, and investigates by querying logs and traces, programmatically, in the same way a human operator would, just faster and without breaks. "Alert: error rate exceeded 1%." The agent pulls the recent error logs, identifies the most common error pattern, traces it to a deployment that landed an hour earlier, and posts a finding with the suspect commit linked. This kind of automated triage is only possible when the system is observable, and the agent's behavior is itself only observable to the supervising team because of trajectory instrumentation that captures which signals the agent looked at and what conclusions it drew. A platform team running a fleet of customer-deployed agents notices through wide-event sampling that one particular tool, a file-write helper, is being called with unusually large payloads by agents handling a specific customer's workflow. The wide events show the trajectory cleanly: the agent reads a large file, "edits" it by re-emitting the entire content with small changes, then writes the result back. The pattern is invisible in classical metrics (the calls succeed, the latency is bounded), but the agent is burning context tokens at a rate that will become a cost problem at scale. The fix is upstream of the agent: route file edits through a diff-based tool that takes only the changes. The signal that surfaced the issue was high-cardinality event data plus the discipline of sampling normal traffic, not just errors. > **💡 Example Prompt** > > "Add structured JSON logging to the checkout flow. Each log entry should include a request_id, the step name, the duration in milliseconds, and any error details. Replace the existing print statements. Make sure trace context propagates through any async work the checkout dispatches." A senior engineer reviewing an agent-generated module for a new microservice notices the module has zero instrumentation. The agent added the business logic the prompt requested and stopped there. The reviewer doesn't ship the change without it; they brief the agent to add structured logs for each significant decision in the flow, expose a small set of metrics (request count, latency histogram, error count by category), and propagate trace context through every external call. The pattern recurs often enough that the team adds "observability instrumented" as an explicit acceptance criterion in the briefing template they hand the agent. The agent isn't lazy; it just optimizes for what the brief says. The brief now says. ## Consequences Treating observability as a designed property of a system, rather than a bag of tools to install, changes how the system is structured and how it's operated. **Benefits.** Observable systems are easier to debug, easier to evolve, and easier to reason about as they grow. Time-to-explanation for an incident drops from hours to minutes. New engineers can come up to speed on a system by reading its dashboards and traces rather than its source. Reliability work becomes evidence-based: the team can argue from data about which failure mode deserves the next investment. For agentic systems, observability is also what makes safe autonomy possible. The team can grant the agent more decision-making latitude precisely because the team retains the ability to audit what the agent decided. And observability data feeds back into design: persistent slow paths become refactoring targets, hot endpoints become caching targets, and failure modes that show up in the wild become tests that prevent regressions. **Liabilities.** The discipline has real cost. Telemetry consumes storage, network, and engineering time; high-cardinality signals are particularly expensive at scale. Sensitive data leaks through logs and traces faster than through any other surface, so PII handling becomes a first-class concern (and a recurring source of incidents in its own right). Dashboards proliferate and become harder to read than the code they were supposed to summarize. Sampling decisions are subtle, and getting them wrong silently degrades the team's investigative capability without anyone noticing. And there's a softer cost: a team that becomes fluent in observability can develop the bad habit of treating *visible* problems as the only problems worth solving, while invisible ones (intent drift, design erosion, technical debt that hasn't yet manifested as a failure mode) accumulate unaddressed. The discipline mirrors the one for [Failure Mode](failure-mode.md) in shape: name the signals you're tracking, justify the ones you're not, and revisit both as the system evolves. Observability that grows without pruning becomes noise; observability that shrinks without thought becomes blindness. The goal is enough visibility, not all visibility, and the team's judgment about *enough* is what separates a system that's expensively over-instrumented from one that's cheaply opaque. ## Sources - Rudolf Kalman introduced observability as a formal property of dynamic systems in his 1960 paper "[On the General Theory of Control Systems](https://doi.org/10.1016/S1474-6670(17)70094-8)," where it meant the ability to infer a system's internal state from its external outputs. Software engineers borrowed the term decades later, but the core idea is unchanged. - Twitter's Observability Engineering team published one of the first uses of "observability" in a software context in the 2013 post "[Observability at Twitter](https://blog.x.com/engineering/en_us/a/2013/observability-at-twitter)," followed by a detailed two-part technical overview in 2016 describing their metrics, tracing, and log aggregation infrastructure at scale ([part I](https://blog.x.com/engineering/en_us/a/2016/observability-at-twitter-technical-overview-part-i.html), [part II](https://blog.x.com/2016/observability-at-twitter-technical-overview-part-ii)). - Charity Majors, co-founder of Honeycomb, adopted the control-theory term for software systems in 2016 and became its most visible advocate, framing observability as the property that lets a team investigate unknown unknowns rather than just monitor known ones. She, Liz Fong-Jones, and George Miranda codified the practice in *[Observability Engineering](https://www.oreilly.com/library/view/observability-engineering/9781492076438/)* (O'Reilly, 2022). - Cindy Sridharan's *[Distributed Systems Observability](https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/)* (O'Reilly, 2018) organized the "three pillars" framework of logs, metrics, and traces that the article follows, giving practitioners a shared vocabulary for what observable systems produce. - Benjamin Sigelman and colleagues at Google described Dapper, their production distributed tracing system, in the 2010 technical report "[Dapper, a Large-Scale Distributed Systems Tracing Infrastructure](https://research.google/pubs/dapper-a-large-scale-distributed-systems-tracing-infrastructure/)." Dapper's span-and-trace model became the foundation for open-source tracers like Zipkin and Jaeger and established distributed tracing as a pillar of observability. --- - [Next: Domain-Oriented Observability](domain-oriented-observability.md) - [Previous: Consumer-Driven Contract Testing](consumer-driven-contract-testing.md)