AgentOps

Pattern

A named solution to a recurring problem.

AgentOps is the practice of operating, monitoring, and governing AI agents in production, applying DevOps discipline to systems that reason, choose tools, and act on behalf of users.

“You cannot manage what you cannot measure.” — Peter Drucker

Also known as: Agent Observability, LLMOps for Agents, Production Agent Monitoring

Understand This First

Observability – AgentOps is the agent-specific specialization of observability.
Feedback Sensor – production monitoring is a feedback sensor that runs in the real world.
Eval – evals score agents offline; AgentOps watches them live.

You have shipped an agent. It is not a demo or a benchmark run; it is making decisions for real users, calling tools, spending tokens, and producing outputs you will be held responsible for. This is an operational concern, the step after construction and before the next iteration.

Traditional monitoring was built for services that answered requests the same way every time. Agents don’t. Two calls with the same input can take different paths, invoke different tools, and return different answers. A green health check tells you the process is alive; it tells you nothing about whether the agent is still doing what it’s meant to do.

Problem

How do you know whether an AI agent in production is behaving correctly, efficiently, and within its authority, when each run is a multi-step reasoning process with no guaranteed shape?

Traditional dashboards show you latency, error rates, and throughput. None of those catch an agent that quietly regressed on tool selection last Tuesday, burned a week of budget on a retry loop, or started answering off-policy questions because a prompt template drifted. By the time the classical signals light up, the damage has already shipped to users.

Forces

Agent behavior is emergent. The same prompt and tools can yield different paths every run. You can’t monitor a path that doesn’t exist yet.
Cost is a first-class signal. Tokens and tool calls translate directly to dollars. An agent that works correctly but spends triple what it should is still a production incident.
Quality is not binary. “Did it succeed?” rarely has a yes-or-no answer. Partial success, hedged answers, and plausible-but-wrong outputs are all common.
Privacy and compliance apply at every step. Reasoning traces and tool inputs often contain sensitive data that must not leak into logs indefinitely.
Debugging needs replay. When an agent does something strange, you need to reconstruct the run: which context it saw, which tools it picked, what each one returned.

Solution

Instrument every agent run end to end, then monitor the dimensions that traditional observability misses: reasoning steps, tool calls, token cost, quality signals, and autonomy boundaries. Treat AgentOps as a superset of service observability, not a replacement.

At the technical layer, capture the same logs, metrics, and traces you would capture for any service. At the agent layer, capture four additional streams:

Trajectory. The ordered sequence of thoughts, tool calls, tool results, and intermediate outputs that made up a single run. This is the agent-level analog of a distributed trace, and it is the first thing you will want when something goes wrong.
Cost. Tokens in, tokens out, cached tokens, tool invocations, and the model version used for each step. Aggregate by user, feature, and route so you can see where the money is going.
Quality. Periodic sampled evaluation of live runs using the same rubrics you use offline. A drop in first-pass acceptance rate or a rise in retries is an early warning.
Autonomy compliance. Did the agent stay inside its approval policy and bounded autonomy tier? Every step outside the sandbox needs a record.

Feed these streams into alerting. Classical alerts fire on latency and errors; AgentOps alerts fire on cost per run, retry rate, tool-selection drift, eval-score drop, and policy violations. The goal is to notice a regression in behavior before users do, not after the support tickets arrive.

Tooling is no longer the bottleneck. Production SDKs and platforms (AgentOps.ai, Langfuse, Arize Phoenix, LangSmith, Maxim, and the native tracing surfaces in major agent frameworks) cover most of the capture and storage work. The engineering effort is in deciding what to measure, how to slice it, and which signals earn an alert.

Tip

Before shipping a new agent, write the three AgentOps alerts you would want if it started misbehaving at 3 a.m. “Cost per successful run is 2x the rolling median.” “Retry rate above 20% for ten minutes.” “Any tool call outside the allowlist.” If you can’t articulate the alerts, you’re not ready for production.

How It Plays Out

A team operates a coding agent that reviews pull requests. A week after shipping, cost per review doubles overnight. The classical dashboards are green: latency is fine, error rate is zero. The AgentOps dashboard shows the cause in one chart: the average number of tool calls per review jumped from four to eleven. A trajectory replay reveals that a recent prompt change removed an explicit “stop when you have enough context” instruction, so the agent now fetches every file in the diff’s directory before commenting. The fix is a three-line prompt edit; the alert would have caught it in hours instead of days if it had been wired up.

At a SaaS company running a support-automation agent, the on-call engineer wakes up to no pages: latency is fine, error rate is zero, uptime is green. The one red signal is on the AgentOps dashboard: an eval-score drop on a sampled slice of live runs, scored against a rubric that includes “answers the user’s actual question.” Tracing back, the team finds that a routing rule was updated and the agent now receives truncated context that omits the billing-policy section, so it has started telling users it cannot answer billing questions. No exception was thrown. No test failed. Only the quality signal exposed the regression, and the team shipped a fix the same day.

An autonomous data-migration agent runs under a tight approval policy: it may read any table, but may only write to a staging schema. The AgentOps layer records every tool call and flags any attempt to write outside staging as a policy violation. One morning the violation counter increments. Investigation shows the agent never actually wrote to production; a newly added tool had a misleading description that led the agent to try to call it against the production schema. The sandbox held. The alert prompted the team to rewrite the tool description before the next incident could happen without a sandbox to catch it.

Consequences

Benefits. You see what your agents are actually doing in production, not what you hoped they would do. Cost becomes a managed variable instead of a monthly surprise. Regressions in quality and tool selection surface as alerts instead of customer complaints. Trajectory replay makes debugging tractable, including for failures that only happen at real-world scale. Auditors, compliance teams, and skeptical executives get a real answer to “what did the agent do, and under what authority?”

Liabilities. Instrumentation costs engineering time and storage. Trajectories are verbose, and storing them in full for every run gets expensive fast, so you will need sampling and retention policies. Sensitive data in traces needs redaction before it hits long-term storage. A poor alerting strategy will flood the team with noise and train them to ignore the dashboards; alert quality matters more than alert quantity. AgentOps doesn’t replace evals or feedback sensors inside the agent’s control loop. It runs alongside them, covering the outer loop where the code meets real users and real money.

Sources

IBM’s 2026 treatment of AgentOps in What is AgentOps? gave the discipline its current name and framing, positioning it as the agent-era successor to DevOps and MLOps.
The four-dimension model used here (trajectory, cost, quality, autonomy) draws on production experience documented by several commercial agent-monitoring platforms that emerged in 2025 and 2026. No single source owns the taxonomy; it has converged across the industry.
The broader observability lineage comes from the classical “three pillars” (logs, metrics, traces) as popularized by Charity Majors, Liz Fong-Jones, and George Miranda in Observability Engineering (O’Reilly, 2022) and the Honeycomb team’s body of work, with the agent-level additions treated as a fourth pillar rather than a replacement.
The guides-and-sensors framework from Birgitta Boeckeler and Martin Fowler’s Harness engineering for coding agent users supplies the conceptual boundary between inside-the-loop sensing (Feedback Sensor) and outside-the-loop monitoring (AgentOps).

AgentOps

Understand This First

Context

Problem

Forces

Solution

How It Plays Out

Consequences

Sources

Further Reading

Keyboard shortcuts