Agent Trace

An agent trace is the structured record of one agent run, captured as a tree of spans where each span represents a step the agent took: a model call, a tool invocation, a sub-agent dispatch, or a retrieval.

Concept

A foundational idea to recognize and understand.

Also known as: Agent Trajectory, Reasoning Trace, Run Trace

Understand This First

Observability — the general practice agent traces serve.
Logging — the lower-level mechanism a trace can fall back on.
Tool — most spans inside an agent trace describe tool calls.
Subagent — sub-agents create the nested branches that make traces tree-shaped rather than flat.

What It Is

Take the OpenTelemetry trace model, the one originally invented to follow a single web request through a fleet of microservices, and point it inwards at one agent. The web request becomes the agent’s task. The microservices become the model calls, tool invocations, retrieval steps, and sub-agent dispatches the agent makes along the way. The result is an agent trace: a tree of spans rooted at the user’s request, branching every time the agent calls something, each leaf carrying its own inputs, outputs, latency, token counts, and errors.

A span is the unit. Each one has a name (tool_call:read_file, model:claude-opus-4, subagent:researcher), a start and end time, structured attributes (the arguments, the result, the model temperature, the token usage), and a parent span ID that hangs it onto the tree. A trace is the closed graph of spans that share a single root. Run the agent twice on the same task and you get two traces, usually with different shapes: different number of tool calls, different sequence, different token totals. That variability is what makes agent debugging different from web-service debugging.

The tree shape matters. A linear log of “the agent did this, then this, then this” hides which step caused which side effect. A tree exposes the dependencies: the file read was a follow-up to a planner request, the failed search ran inside a sub-agent the orchestrator dispatched, the second model call was a retry forced by an argument-validation error on the first. The structure is the explanation.

The 2025 OpenTelemetry GenAI semantic conventions standardized the attribute names for this domain (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.tool.name), so traces emitted by one tool can be read by another. Before the conventions, every platform invented its own field names; afterwards, a trace from a custom orchestrator can land in any backend that speaks the standard.

Why It Matters

Without a trace, an agent run is opaque. You see the prompt that went in and the answer that came out, and you have to imagine everything in between. When the answer is wrong (and with non-deterministic models it sometimes will be), you can’t ask “where did this go off the rails?” because you have no rails to inspect. The whole middle of the run is a black box.

A trace turns the black box into a glass one. The reviewer sees that the agent called search_codebase("permission") first, got back fifteen results, picked the wrong one, then asked the model to summarize that file, then wrote a fix based on the summary. The fix’s bug is now traceable to a specific span: the search ranking, not the model. Debugging an agent without a trace is like debugging a distributed system without a tracer: possible, but you spend most of your time guessing.

The same record carries several other jobs once an agent ships:

Token and cost attribution. Each span carries its own token count. Sum across all model spans in a trace to get per-run cost, group across traces to get per-feature cost, roll up across users to get per-customer cost. Without per-span accounting, the bill arrives as one undifferentiated number you can’t diagnose.
Multi-agent correlation. When a coordinator agent dispatches three workers in parallel, you need a single trace ID that ties their spans back to the parent. The tree structure handles this naturally: the workers’ root spans become children of the coordinator’s dispatch span, and the whole branch lives under the original user request.
Replay and post-hoc evaluation. Because every span captures inputs, outputs, and the model version, a trace is enough state to re-run the agent’s decisions offline. Pull a thousand production traces, swap in a new model, and you can see whether quality goes up or down before shipping the upgrade.

This capability has stopped being optional. LangSmith, Langfuse, Arize Phoenix, and the native tracing surfaces in the major agent frameworks all emit OpenTelemetry-compatible traces by default. The interesting question is no longer whether to capture them; it’s what to put on each span and how long to keep them.

How to Recognize It

Real agent traces share a few properties. They are tree-shaped, not flat: nested spans, parent IDs, branches under sub-agent dispatches. They are complete, in the sense that every model call, every tool call, and every retrieval step shows up as a span, not just the ones the engineer remembered to instrument. And they survive the run, persisted to durable storage with a stable ID you can paste into a debugger, share with a teammate, or attach to a bug ticket.

The absence of traces shows up in the symptoms. Engineers explain agent failures by saying “I think it called the wrong tool” and can’t point at the span. Token bills arrive as a single line item with no per-feature breakdown. A bug reproduces in production but not in development, because there is no captured input to replay against. Multi-agent runs come back as three independent log streams that have to be stitched together by hand.

Keep the line between an agent trace and a progress log clear. A progress log is a human-readable narrative the agent writes for the next session’s reader: “I tried approach A, it failed because of X, so I switched to approach B.” A trace is a machine-readable structure the framework emits whether the agent intends it or not. Both record what happened. Only the trace lets you query, aggregate, replay, and evaluate.

How It Plays Out

A team has shipped an agentic customer-support assistant that resolves about half of incoming tickets without escalation. After a model upgrade, the resolution rate quietly drops to thirty percent. The dashboards stay green: latency is fine, error rate is fine, no exceptions are firing. With agent traces in the system, an engineer pulls a hundred recent traces, groups by outcome, and notices that under the new model the agent is calling search_knowledge_base four times more often, often with the same query phrased four different ways. The model has become more diligent about searching and less decisive about acting. The fix lands in the system prompt, not the model, and the team would never have located it without the per-span tool-call counts. The whole investigation takes an afternoon instead of the week it would have cost from the dashboards alone.

In a multi-agent research workflow, an orchestrator dispatches three researcher sub-agents in parallel: one to search papers, one to scan the web, one to summarize a local document. One of them returns nonsense. Without trace correlation, the engineer has three independent log streams and has to guess which sub-agent produced which output. With a single trace tree rooted at the orchestrator, the misbehaving sub-agent’s full branch is visible: the prompt it received, the four tool calls it made, the model output that drove the bad summary. The bug, a stale prompt template that the orchestrator was passing to that one role, is found in minutes.

Tip

Pick a trace ID format that is paste-friendly and human-recognizable. A 32-character hex blob is correct and unreadable; a hyphenated short prefix plus the timestamp is just as unique in practice and survives a screenshot in a Slack thread. The trace is only useful if engineers actually open it.

Consequences

Benefits. Debugging gets faster, often dramatically: every step the agent took is inspectable, and a failed run can be opened, read, and explained instead of guessed about. Cost shows up where it actually came from, because token usage is broken down per span and rolled up per trace. Multi-agent correlation works without scaffolding — the tree shape preserves the parent-child structure across delegations. Because every span carries inputs and model version, runs become replayable: a thousand captured traces can be re-fed to a new model offline before anyone has to commit to the upgrade. And the organization can build evals that score real production traces, not just synthetic test cases.

Liabilities. Traces are verbose. A long agentic run can produce thousands of spans, each with a payload of inputs and outputs, and storing every trace in full quickly gets expensive. Sampling and retention policies are unavoidable: keep all traces for failed runs and a percentage of successful ones, and tier the storage so old traces age into cheaper backends. Trace data is also sensitive. Model inputs and tool arguments often contain personally identifiable information, API keys, or internal documents, so the same handling rules that apply to logs apply with more force to traces. A trace pipeline that leaks customer data into long-term storage is now a privacy incident, not just an observability lapse.

The hardest trap is trace drift. A team instruments tool calls, ships, and then a new tool gets added without a span. Six weeks later, the new tool is the third most expensive call in the system and nobody can see it. Treat agent traces as a contract on the agent’s instrumentation, the same way a typed interface is a contract on a function. New tools, new sub-agent roles, and new retrieval sources need their span shape defined when they are added, not after the fact. Frameworks that emit spans automatically on tool registration close most of the gap, but the discipline still belongs to the team.

A second trap is using a trace as a substitute for evaluation. A trace tells you what the agent did. It doesn’t tell you whether what the agent did was correct. Two traces with identical shapes can have wildly different quality, and only an Eval or a downstream business metric will tell you which is which. Pair the trace with a quality signal; a trace alone is not a verdict.

		Note
Complements	Domain-Oriented Observability	Domain probes capture business-meaningful events; traces capture per-run agent mechanics. Both belong on the same instrumentation surface.
Complements	Progress Log	The progress log is the human-readable narrative; the trace is the machine-readable record. They serve different readers.
Enables	AgentOps	Production agent monitoring is built on the trace as its primary data source.
Enables	Eval	Evals score the inputs and outputs that traces capture; without traces, evals run only on synthetic cases.
Refines	Observability	Agent traces are the agent-shaped specialization of the general observability discipline.
Related	Memory	Memory persists across runs; a trace persists across the spans of one run.
Related	Verification Loop	The inner act-observe-correct loop emits the spans a trace is built from.
Used by	Feedback Sensor	Production sensors read trace data to detect drift and regression.
Uses	Logging	When no trace pipeline is available, a structured log stream is the fallback that keeps the same information accessible.
Uses	Subagent	Sub-agent dispatches create the nested branches that make a trace a tree rather than a list.
Uses	Tool	Most spans inside a trace describe a tool invocation and its result.

Sources

Benjamin Sigelman and colleagues at Google described the span-and-trace model in Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (Google Technical Report, 2010). Every modern tracing system, including the agent-focused ones, inherits its data model from this paper.

The OpenTelemetry project published the GenAI Semantic Conventions (2024-2025), standardizing the attribute names for model calls, tool calls, and token usage that most agent tracing platforms now emit.

Cindy Sridharan’s Distributed Systems Observability (O’Reilly, 2018) framed the three-pillars model and gave practitioners the vocabulary that the agent-tracing community extended.

Charity Majors, Liz Fong-Jones, and George Miranda’s Observability Engineering (O’Reilly, 2022) made the case for wide events with high cardinality as the unit of observability, the property that lets a trace span carry the structured payload an agent run requires.

The trace-tree shape entered the agent literature through the practitioner community around 2024-2025, as platforms such as LangSmith, Langfuse, and Arize Phoenix converged on OpenTelemetry-compatible trace models for multi-step LLM applications. The convergence is community-driven rather than the work of a single author.

Keyboard shortcuts