Skip to content
Install
Back to Guides

Agent Observability for AI Coding: How to Trace What Your Agents Actually Did

Apr 19, 2026
Ani Galstian
Ani Galstian
Agent Observability for AI Coding: How to Trace What Your Agents Actually Did

AI agent observability for coding requires structured tracing across four dimensions: execution traces, output evaluations, token cost attribution, and per-agent identity tracking, because traditional APM tools cannot detect the non-deterministic, multi-step failure modes that define agent workflows.

TL;DR

Attribution is the pillar most observability stacks get wrong. Traces, evaluations, and cost tracking have mature tooling; answering "which agent produced this broken code, under which model version, at what cost" usually does not. Two paths solve the attribution gap. SDK-based instrumentation (LangSmith, Braintrust, Phoenix) captures every span in exchange for setup work. Structural isolation (Intent's per-agent worktrees and MCP connections) makes attribution a property of the workspace itself.

Why Multi-Agent Coding Sessions Are Invisible to Standard Tooling

A four-hour multi-agent coding session can produce 180+ tool calls, tens of dollars in token spend, and a silent authentication regression that passes every health check. When the bug surfaces the next morning, engineering teams need to answer three questions at once: which agent wrote the broken code, what context did it have, and how much did the session cost to produce it?

Traditional APM tools like Datadog, New Relic, and Prometheus were built around a deterministic request-response contract. A request arrives, code executes a predictable path, and a response returns. AI coding agents violate every assumption embedded in that model, so the resulting failure modes are invisible:

  • No stable baseline. Honeycomb describes LLM-based systems as nondeterministic, capable of producing different outputs given the same input depending on shifts in content, data, or prompt phrasing.
  • No exception on wrong answers. An agent can return 200 OK while generating code that compiles, passes partial tests, and ships a security bug.
  • No per-session state. Pre-aggregated metrics collapse the high-cardinality context (session ID, agent ID, prompt version) that makes agent debugging possible.

Engineering teams need purpose-built observability that captures whether agents executed, whether the output was correct, which agent produced it, and what it cost. Intent addresses the attribution dimension directly by isolating each agent in its own git worktree, so per-agent observability becomes a property of the workspace.

See how Intent's living specs keep parallel agents aligned across cross-service refactors.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

Where Traditional Monitoring Breaks Down

Traditional APM fails for AI coding agents across six distinct failure modes, each rooted in a broken architectural assumption.

Failure ModeTraditional APM SignalWhy Detection Fails
Non-deterministic outputsNoneNo stable baseline to compare against
Multi-step reasoning failurePartial: spans visible, not correctnessWrong decision at step 3 invisible until step 12
Tool-calling loopsNone: all spans succeedPre-aggregation cannot represent per-session state
Output quality / hallucinationNone: agent returns 200 OKAPM measures execution, not correctness
High-cardinality context lossArchitecturally penalizedPre-aggregated metrics discard session-specific data
Silent drift without crashNo alert condition metAbsence of exception is not evidence of correctness

The tool-call loop is the canonical example. An agent rewrites a function, runs tests, observes a failure, attempts a fix, and loops. Each iteration is a valid LLM call returning within normal latency bounds. Traditional APM reports every span as successful while the agent burns tokens in a loop that no individual span can that reveal. Honeycomb documents this pattern in production agent deployments and emphasizes observability to detect when queries fail to converge.

Runtime decisions in agentic systems happen inside the model, which makes traces the primary artifact for debugging. Traditional software engineers read source code to understand behavior; agent engineers read traces.

The Four Pillars of Agent Observability

Agent observability for multi-agent coding systems organizes around four distinct pillars, each capturing data that the others cannot provide.

Pillar 1: Traces and Spans

A trace represents the complete lifecycle of an agent task, structured as a parent-child tree of typed observations. Typed span kinds matter because a generic "span" collapses the distinctions an agent debugger needs. LangFuse's taxonomy now covers an expanded set including Generation (LLM calls with token usage), Tool (function calls), Retriever (retrieval steps), Agent (agent-level operations), Chain, Evaluator, Embedding, and Guardrail, among others.

The key tradeoff at this pillar is span granularity versus storage cost. Capturing every tool argument and every retrieval result produces rich debugging data along with rapidly expanding storage bills. Most production stacks use tail-based sampling, keeping all traces that contain errors or exceed a cost threshold while sampling the remainder at 5-10%. One failure class makes the granularity worth the cost: in Plan-and-Execute architectures, an executor can successfully retrieve the correct answer while the replanner repeatedly rejects the intermediate result, causing timeouts. Without typed spans, that pattern is invisible in aggregate metrics.

Pillar 2: Evaluations

Evaluations answer whether what happened was correct, something traces alone cannot determine. LangFuse distinguishes two evaluation environments: online evaluations running against live production traces for real-time monitoring, and offline evaluations running against dataset benchmarks for CI and regression testing.

For a team starting from zero, a minimum viable eval suite for coding agents is:

  1. Output compiles. Cheap, deterministic, catches the most common hallucination.
  2. Tests pass. Slower but catches regressions traces cannot detect.
  3. Diff stays within bounds. Flag PRs that modify more than N files or exceed a line threshold, which catches runaway refactors.

Both evaluation approaches involve tradeoffs. LLM-as-judge is cheap and scales, but it introduces correlated errors when the judge shares biases with the agent under test. Human annotation is the gold standard but cannot keep pace with agent throughput. Braintrust identifies annotation as a distinct sub-pillar because human feedback is what calibrates the LLM judges themselves and keeps evaluator quality in check.

Pillar 3: Cost and Token Tracking

Multi-agent coding agents autonomously chain LLM and API calls, so real-time cost tracking is essential. LangFuse tracks usage types including prompt_tokens, completion_tokens, total_tokens, cached_tokens, audio_tokens, and image_tokens, with aggregated cost propagated up the trace tree and color-coded to identify outliers.

The tradeoff at this pillar is where the measurement happens. Proxy-based tracking (Helicone's approach) captures cost with zero code changes but adds latency to every LLM call. SDK-based tracking (Langfuse, LangSmith) adds no request-path latency but requires instrumentation in every service. For teams running latency-sensitive agents against interactive users, the proxy overhead can push p95 past acceptable thresholds. For teams doing batch code generation, the proxy is usually fine.

Pillar 4: Attribution

Attribution answers: which agent, running which model version, made which tool call that produced which output? OpenTelemetry defines an agent identity schema including gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description, and gen_ai.agent.version. All four attributes are specified as conditionally required (populated when the application provides the value).

Non-determinism compounds across agents. The MAESTRO evaluation suite documents this empirically: multi-agent systems remain structurally stable yet temporally variable across repeated runs, with architecture itself emerging as the dominant driver of reproducibility and cost-latency-accuracy tradeoffs. Without attribution fields on every span, failures traced to Agent C cannot be traced back to Agent A's bad output, and post-mortems stall at "the pipeline produced a bug somewhere."

Intent handles this pillar through isolation instead of instrumentation. Each implementor agent runs in a dedicated git worktree with its own MCP connections, so cost, latency, and output quality are naturally scoped per agent without SDK integration.

PillarWhat It CapturesKey Schema Fields
Traces and SpansFull execution tree: LLM calls, tool calls, retrieval, control flowtrace_id, span_id, parent_span_id, type, latency_ms
EvaluationsQuality scores per span; online and offline; LLM judge, code checks, human annotationscores[].name, scores[].value, scores[].source
Cost and Token TrackingToken counts per type, USD cost per type, aggregated up trace treeprompt_tokens, cached_tokens, cost_details.total
AttributionAgent identity, model version, user/session context on every spanagent_type, agent_version, model, user_id, session_id

See how Intent isolates agent work in separate environments backed by git worktrees.

Build with Intent.

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

OpenTelemetry for AI Agent Workflows

OpenTelemetry's GenAI semantic conventions provide an emerging vendor-neutral way to instrument LLM calls and support evolving agent instrumentation. The entire GenAI namespace remains in Development status, which has a direct engineering consequence: instrumentation libraries depending on these conventions cannot ship stable releases. OpenTelemetry's stability proposal acknowledges this directly, noting that many instrumentation libraries are stuck on pre-release versions because they depend on experimental semantic conventions.

Recommendation: Teams already running OTel for application telemetry should adopt GenAI conventions now and accept the breaking-change risk, since the alternative is a parallel observability stack. Teams starting fresh should evaluate OpenInference, which offers richer agent-specific span kinds (AGENT, TOOL, RETRIEVER, RERANKER) and is stable enough to build on today.

Instrumentation Example

A compliant LLM inference span requires gen_ai.operation.name and gen_ai.provider.name, and includes gen_ai.request.model when available. Token tracking attributes carry a Recommended requirement level:

python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("chat gpt-4o") as span:
span.set_attribute("gen_ai.provider.name", "openai")
span.set_attribute("gen_ai.operation.name", "chat")
span.set_attribute("gen_ai.request.model", "gpt-4o")
span.set_attribute("gen_ai.request.temperature", 0.7)
response = call_openai(messages)
span.set_attribute("gen_ai.response.model", response.model)
span.set_attribute("gen_ai.usage.input_tokens",
response.usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens",
response.usage.completion_tokens)

For teams that prefer not to write span attributes by hand, OpenLLMetry auto-instruments OpenAI, Anthropic, AWS Bedrock, LangChain, LlamaIndex, and others.

Key Limitations

Before committing to OTel GenAI conventions, account for these documented gaps:

  • Missing agentic primitives. Standardized attributes for tasks, actions, teams, artifacts, and memory are not yet defined. Issue #2664 identifies these gaps and issue #2665 proposes conventions for tasks.
  • Sensitive content capture. Prompt and completion content may be captured, but because it can contain secrets and PII, implementations should minimize collection and apply filtering or redaction at the span processor level.
  • Opt-in tool arguments. Tool call arguments (gen_ai.tool.call.arguments) are not captured by default, which limits debugging depth unless explicitly enabled.
  • Complex attribute types (any typed) are not efficiently queryable across all backends, which affects dashboard design.

Tool Comparison: LangSmith, Braintrust, Arize Phoenix, Helicone

Choosing an observability platform depends on framework coupling, evaluation depth, hosting requirements, and cost structure.

DimensionLangSmithBraintrustArize Phoenix (OSS)Helicone
Primary integrationSDK-based, lowest friction for LangChainSDK (framework-agnostic)SDK (OTel/OpenInference)Proxy-first; SDK available
Trace depthFull tree with nested spansTree- or DAG-structured spans with typed taxonomyFull hierarchy with span replaySessions-based grouping
EvaluationOffline datasets, online monitoring, LLM-as-judge, human annotationFive-stage eval lifecycle with CI/CD integrationLLM-based evals, Ragas integration, hand-annotated datasetsReporting and analysis via integrations
Cost trackingPer-step token and cost; multi-modal supportToken and cost per llm spanVia OTel span attributesPer-request cost, cache savings
Self-hostingEnterprise onlyEnterprise onlyFree, no external dependenciesAll tiers
Open sourceNoNoYesYes
Caching/rate limitingNoNoNoYes
Base pricingFree (5K traces/mo, 1 seat); $39/seat/mo PlusFree (1M spans); $249/mo ProFree (self-hosted); AX Pro $50/moFree (10K req/mo, 7-day retention); $79/mo Pro

Selection Guidance with Tradeoffs

Each platform has a failure mode worth knowing before adoption:

  • LangSmith delivers the fastest path to useful traces for LangChain/LangGraph teams, validated in production at Acxiom and ServiceNow. The tradeoff is framework lock-in plus per-seat pricing: at $39/seat/month on Plus, a team of five pays $195/month before any trace overage, and teams that later migrate off LangChain face a second observability migration.
  • Braintrust is framework-agnostic and its five-stage eval lifecycle is the strongest in the comparison, but the jump from free to $249/month Pro is steep for teams with modest trace volume. Usage-based overages push costs higher for eval-heavy workflows.
  • Arize Phoenix is the only free self-hostable option with serious evaluation depth, but teams absorb the operational cost: running it on a ClickHouse or Postgres backend at production scale requires dedicated ownership.
  • Helicone has the lowest integration friction through its proxy model and adds caching and rate limiting, but the proxy sits on the request path and adds latency. The free Hobby tier also caps data retention at 7 days; Pro extends it to 1 month. Evaluation capabilities are shallower than the other three.

Each tool addresses traces, cost, and some degree of evaluation. Attribution remains the gap, requiring either manual instrumentation through these tools or an architectural approach that provides attribution structurally.

How Intent Provides Built-In Observability

Intent, the Augment Code desktop workspace for agent orchestration, takes a different approach to multi-agent observability: structural isolation instead of instrumentation-based tracing. Intent builds the attribution boundary into the workspace itself, so developers do not need to propagate trace IDs across agents.

Architecture-Driven Attribution

Intent's workspace model creates clean observability boundaries. Each implementor agent operates in an isolated git worktree with its own MCP (Model Context Protocol) connections. This architecture produces clear attribution points for cost, latency, and output quality per agent and per task.

Because agents are physically separated by worktree and MCP boundaries, attribution falls out of the architecture without SDK integration or span propagation code. Teams get per-agent granularity without building custom trace propagation. Living specs act as a second observability layer: every agent reads from and writes to the spec, producing a persistent record of what each agent was tasked with and what it delivered.

Isolation MechanismObservability Effect
Worktree-level isolationEach workspace backed by its own git worktree; parallel agent work without code conflicts
MCP connection isolationTool access and external service connections scoped per agent; cost and latency attributable at the connection boundary
Living spec coordinationAgents read from and update a shared spec, creating a persistent record of what each agent was tasked with and what it delivered

What Intent Does Not Replace

Structural isolation narrows the observability surface by design. Teams running the following workflows still need a dedicated observability platform alongside Intent:

  • Prompt and completion capture. Intent does not store the raw prompts sent to each agent or the full completion content, which LangSmith and Phoenix do.
  • Prompt versioning and A/B testing. Iterative prompt engineering against historical traces is a core Braintrust feature with no direct equivalent in Intent.
  • Per-tool-call latency breakdowns. Intent attributes cost and latency per agent; teams needing per-LLM-call timing data to debug slow tool invocations need an SDK-based stack.
  • Offline regression suites. CI-integrated eval runs against curated datasets live in Braintrust or LangSmith, not in Intent's workspace.

The two approaches cover different ground. Intent handles attribution cleanly without setup cost. External SDKs capture the prompt-level and eval-level detail Intent does not store.

Structural vs. Instrumentation-Based Approaches

DimensionIntent (Structural)LangSmith / Braintrust (Instrumentation)
Attribution mechanismIsolated worktree and MCP boundariesSDK trace propagation, span IDs
Setup requiredBuilt into workspace architectureSDK integration, proxy configuration
Attribution granularityPer agent, per taskPer LLM call, per chain step, per tool invocation
Framework couplingCoordinator-based orchestration with BYOA support for Claude Code, Codex, and OpenCodeLangSmith strongest in LangChain; Braintrust framework-agnostic
Evaluation/experimentationHandled through the Verifier agent against the living specCore feature; supports datasets, A/B tests, prompt versioning

Teams coordinating multi-agent coding sessions in Intent work with isolated execution contexts through git worktrees, and the spec doubles as a coordination record that survives the session.

Open source
augmentcode/augment-swebench-agent868
Star on GitHub

Setting Up an Observability Pipeline for Multi-Agent Sprints

A practical pipeline for multi-agent coding sessions flows through five layers: agent code, OTel SDK with span processors, OTel Collector, storage backends, and alerting.

Step 1: Instrument Agent Code

Pick a semantic convention first. OTel GenAI conventions are the emerging vendor-neutral standard but remain in Development; OpenInference conventions are more mature for agent-specific span kinds (AGENT, TOOL, RETRIEVER, RERANKER). Teams feeding traces into Phoenix or Arize should start with OpenInference; teams aligning with broader platform telemetry should accept OTel GenAI's stability risk.

Set gen_ai.agent.id, gen_ai.agent.name, and gen_ai.agent.version on agent root spans. For the OpenAI Agents SDK:

python
from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)

Step 2: Add PII Redaction at the Span Processor Level

SDK-level redaction is the safest default because sensitive data never leaves the process. Collector-level redaction centralizes rules but requires the Collector to sit inside the trust boundary.

python
from opentelemetry.sdk.trace import SpanProcessor
class SensitiveDataRedactor(SpanProcessor):
SENSITIVE_ATTRS = ["llm.prompts", "llm.completions", "user.email"]
def on_end(self, span):
for attr in self.SENSITIVE_ATTRS:
if attr in span.attributes:
span.set_attribute(attr, "[REDACTED]")

Step 3: Configure Alerting for Agent Failure Modes

Tool-call loops and context overflow are good paging candidates because they correlate with runaway cost; latency degradation is usually dashboard-only unless agents serve interactive users. Thresholds below draw on OpenObserve's AI agent monitoring reference:

MetricAlert ThresholdFailure Mode Detected
tool_call.count per sessionGreater than 20Infinite tool-call loop
agent.stepsGreater than configured maxStuck reasoning chain
error.rateGreater than 1% over 5 minutesSystematic failures
llm.usage.prompt_tokensGreater than 80% of context windowContext overflow risk
duration (end-to-end)p95 greater than 10s for interactive agentsLatency degradation

Step 4: Choose a Storage and Visualization Stack

Buy-versus-build is the core decision, driven by trace volume, data residency, and engineering capacity:

  • Starter (1-3 engineers): Langfuse Cloud with OpenInference instrumentors and Slack webhooks. Hands-off; accept vendor data retention.
  • Mid-scale (5-15 engineers): OTel SDK plus Collector with fan-out, self-hosted Langfuse or Arize Phoenix, Prometheus and Grafana for metrics. Dedicated ownership pays off once volume exceeds cloud tier limits.
  • Enterprise (data residency, compliance): OTel Collector as sidecar with tenant routing, self-hosted Langfuse on ClickHouse. Langfuse's Cresta case study validates this pattern, keeping trace data inside the security boundary under full retention control.

One decision cuts across all four steps: sampling. Full-fidelity capture is affordable in development but breaks storage budgets at scale. A common production pattern is tail-based sampling — retain 100% of traces containing errors, 100% of traces exceeding a cost threshold (e.g., $0.50), and 5-10% of the remainder.

Start with Trace Attribution Before the Next Multi-Agent Sprint

The core tradeoff in agent observability is scope versus setup. Instrumentation-based stacks give teams deeper prompt, span, and evaluation visibility in exchange for schema decisions, propagation code, and storage design. Structural approaches narrow the data surface in exchange for near-zero setup cost.

A practical 30-day plan for teams starting from zero:

  • Week 1: Establish a trace boundary for every agent run. Even a single root span with agent.id, session.id, and start/end timestamps beats no attribution.
  • Week 2: Instrument one agent end-to-end with OpenInference span kinds. Pick the highest-cost agent; the cost data pays for the work.
  • Week 3: Add cost attribution at the LLM call level. Tag every call with agent.id and aggregate by agent in Langfuse or Phoenix.
  • Week 4: Wire alerts for tool-call loops and context overflow. These two detect the most expensive silent failures.

For teams coordinating parallel coding agents, Intent's isolated worktrees, dedicated MCP connections, and living specs provide week-1 attribution out of the box, which shortens the runway to weeks 2-4.

See how Intent's workspace isolation gives every agent a traceable execution boundary without custom instrumentation.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

FAQ

Written by

Ani Galstian

Ani Galstian

Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.