AI agent observability for coding requires structured tracing across four dimensions: execution traces, output evaluations, token cost attribution, and per-agent identity tracking, because traditional APM tools cannot detect the non-deterministic, multi-step failure modes that define agent workflows.
TL;DR
Attribution is the pillar most observability stacks get wrong. Traces, evaluations, and cost tracking have mature tooling; answering "which agent produced this broken code, under which model version, at what cost" usually does not. Two paths solve the attribution gap. SDK-based instrumentation (LangSmith, Braintrust, Phoenix) captures every span in exchange for setup work. Structural isolation (Intent's per-agent worktrees and MCP connections) makes attribution a property of the workspace itself.
Why Multi-Agent Coding Sessions Are Invisible to Standard Tooling
A four-hour multi-agent coding session can produce 180+ tool calls, tens of dollars in token spend, and a silent authentication regression that passes every health check. When the bug surfaces the next morning, engineering teams need to answer three questions at once: which agent wrote the broken code, what context did it have, and how much did the session cost to produce it?
Traditional APM tools like Datadog, New Relic, and Prometheus were built around a deterministic request-response contract. A request arrives, code executes a predictable path, and a response returns. AI coding agents violate every assumption embedded in that model, so the resulting failure modes are invisible:
- No stable baseline. Honeycomb describes LLM-based systems as nondeterministic, capable of producing different outputs given the same input depending on shifts in content, data, or prompt phrasing.
- No exception on wrong answers. An agent can return 200 OK while generating code that compiles, passes partial tests, and ships a security bug.
- No per-session state. Pre-aggregated metrics collapse the high-cardinality context (session ID, agent ID, prompt version) that makes agent debugging possible.
Engineering teams need purpose-built observability that captures whether agents executed, whether the output was correct, which agent produced it, and what it cost. Intent addresses the attribution dimension directly by isolating each agent in its own git worktree, so per-agent observability becomes a property of the workspace.
See how Intent's living specs keep parallel agents aligned across cross-service refactors.
Free tier available · VS Code extension · Takes 2 minutes
Where Traditional Monitoring Breaks Down
Traditional APM fails for AI coding agents across six distinct failure modes, each rooted in a broken architectural assumption.
| Failure Mode | Traditional APM Signal | Why Detection Fails |
|---|---|---|
| Non-deterministic outputs | None | No stable baseline to compare against |
| Multi-step reasoning failure | Partial: spans visible, not correctness | Wrong decision at step 3 invisible until step 12 |
| Tool-calling loops | None: all spans succeed | Pre-aggregation cannot represent per-session state |
| Output quality / hallucination | None: agent returns 200 OK | APM measures execution, not correctness |
| High-cardinality context loss | Architecturally penalized | Pre-aggregated metrics discard session-specific data |
| Silent drift without crash | No alert condition met | Absence of exception is not evidence of correctness |
The tool-call loop is the canonical example. An agent rewrites a function, runs tests, observes a failure, attempts a fix, and loops. Each iteration is a valid LLM call returning within normal latency bounds. Traditional APM reports every span as successful while the agent burns tokens in a loop that no individual span can that reveal. Honeycomb documents this pattern in production agent deployments and emphasizes observability to detect when queries fail to converge.
Runtime decisions in agentic systems happen inside the model, which makes traces the primary artifact for debugging. Traditional software engineers read source code to understand behavior; agent engineers read traces.
The Four Pillars of Agent Observability
Agent observability for multi-agent coding systems organizes around four distinct pillars, each capturing data that the others cannot provide.
Pillar 1: Traces and Spans
A trace represents the complete lifecycle of an agent task, structured as a parent-child tree of typed observations. Typed span kinds matter because a generic "span" collapses the distinctions an agent debugger needs. LangFuse's taxonomy now covers an expanded set including Generation (LLM calls with token usage), Tool (function calls), Retriever (retrieval steps), Agent (agent-level operations), Chain, Evaluator, Embedding, and Guardrail, among others.
The key tradeoff at this pillar is span granularity versus storage cost. Capturing every tool argument and every retrieval result produces rich debugging data along with rapidly expanding storage bills. Most production stacks use tail-based sampling, keeping all traces that contain errors or exceed a cost threshold while sampling the remainder at 5-10%. One failure class makes the granularity worth the cost: in Plan-and-Execute architectures, an executor can successfully retrieve the correct answer while the replanner repeatedly rejects the intermediate result, causing timeouts. Without typed spans, that pattern is invisible in aggregate metrics.
Pillar 2: Evaluations
Evaluations answer whether what happened was correct, something traces alone cannot determine. LangFuse distinguishes two evaluation environments: online evaluations running against live production traces for real-time monitoring, and offline evaluations running against dataset benchmarks for CI and regression testing.
For a team starting from zero, a minimum viable eval suite for coding agents is:
- Output compiles. Cheap, deterministic, catches the most common hallucination.
- Tests pass. Slower but catches regressions traces cannot detect.
- Diff stays within bounds. Flag PRs that modify more than N files or exceed a line threshold, which catches runaway refactors.
Both evaluation approaches involve tradeoffs. LLM-as-judge is cheap and scales, but it introduces correlated errors when the judge shares biases with the agent under test. Human annotation is the gold standard but cannot keep pace with agent throughput. Braintrust identifies annotation as a distinct sub-pillar because human feedback is what calibrates the LLM judges themselves and keeps evaluator quality in check.
Pillar 3: Cost and Token Tracking
Multi-agent coding agents autonomously chain LLM and API calls, so real-time cost tracking is essential. LangFuse tracks usage types including prompt_tokens, completion_tokens, total_tokens, cached_tokens, audio_tokens, and image_tokens, with aggregated cost propagated up the trace tree and color-coded to identify outliers.
The tradeoff at this pillar is where the measurement happens. Proxy-based tracking (Helicone's approach) captures cost with zero code changes but adds latency to every LLM call. SDK-based tracking (Langfuse, LangSmith) adds no request-path latency but requires instrumentation in every service. For teams running latency-sensitive agents against interactive users, the proxy overhead can push p95 past acceptable thresholds. For teams doing batch code generation, the proxy is usually fine.
Pillar 4: Attribution
Attribution answers: which agent, running which model version, made which tool call that produced which output? OpenTelemetry defines an agent identity schema including gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description, and gen_ai.agent.version. All four attributes are specified as conditionally required (populated when the application provides the value).
Non-determinism compounds across agents. The MAESTRO evaluation suite documents this empirically: multi-agent systems remain structurally stable yet temporally variable across repeated runs, with architecture itself emerging as the dominant driver of reproducibility and cost-latency-accuracy tradeoffs. Without attribution fields on every span, failures traced to Agent C cannot be traced back to Agent A's bad output, and post-mortems stall at "the pipeline produced a bug somewhere."
Intent handles this pillar through isolation instead of instrumentation. Each implementor agent runs in a dedicated git worktree with its own MCP connections, so cost, latency, and output quality are naturally scoped per agent without SDK integration.
| Pillar | What It Captures | Key Schema Fields |
|---|---|---|
| Traces and Spans | Full execution tree: LLM calls, tool calls, retrieval, control flow | trace_id, span_id, parent_span_id, type, latency_ms |
| Evaluations | Quality scores per span; online and offline; LLM judge, code checks, human annotation | scores[].name, scores[].value, scores[].source |
| Cost and Token Tracking | Token counts per type, USD cost per type, aggregated up trace tree | prompt_tokens, cached_tokens, cost_details.total |
| Attribution | Agent identity, model version, user/session context on every span | agent_type, agent_version, model, user_id, session_id |
See how Intent isolates agent work in separate environments backed by git worktrees.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
OpenTelemetry for AI Agent Workflows
OpenTelemetry's GenAI semantic conventions provide an emerging vendor-neutral way to instrument LLM calls and support evolving agent instrumentation. The entire GenAI namespace remains in Development status, which has a direct engineering consequence: instrumentation libraries depending on these conventions cannot ship stable releases. OpenTelemetry's stability proposal acknowledges this directly, noting that many instrumentation libraries are stuck on pre-release versions because they depend on experimental semantic conventions.
Recommendation: Teams already running OTel for application telemetry should adopt GenAI conventions now and accept the breaking-change risk, since the alternative is a parallel observability stack. Teams starting fresh should evaluate OpenInference, which offers richer agent-specific span kinds (AGENT, TOOL, RETRIEVER, RERANKER) and is stable enough to build on today.
Instrumentation Example
A compliant LLM inference span requires gen_ai.operation.name and gen_ai.provider.name, and includes gen_ai.request.model when available. Token tracking attributes carry a Recommended requirement level:
For teams that prefer not to write span attributes by hand, OpenLLMetry auto-instruments OpenAI, Anthropic, AWS Bedrock, LangChain, LlamaIndex, and others.
Key Limitations
Before committing to OTel GenAI conventions, account for these documented gaps:
- Missing agentic primitives. Standardized attributes for tasks, actions, teams, artifacts, and memory are not yet defined. Issue #2664 identifies these gaps and issue #2665 proposes conventions for tasks.
- Sensitive content capture. Prompt and completion content may be captured, but because it can contain secrets and PII, implementations should minimize collection and apply filtering or redaction at the span processor level.
- Opt-in tool arguments. Tool call arguments (
gen_ai.tool.call.arguments) are not captured by default, which limits debugging depth unless explicitly enabled. - Complex attribute types (
anytyped) are not efficiently queryable across all backends, which affects dashboard design.
Tool Comparison: LangSmith, Braintrust, Arize Phoenix, Helicone
Choosing an observability platform depends on framework coupling, evaluation depth, hosting requirements, and cost structure.
| Dimension | LangSmith | Braintrust | Arize Phoenix (OSS) | Helicone |
|---|---|---|---|---|
| Primary integration | SDK-based, lowest friction for LangChain | SDK (framework-agnostic) | SDK (OTel/OpenInference) | Proxy-first; SDK available |
| Trace depth | Full tree with nested spans | Tree- or DAG-structured spans with typed taxonomy | Full hierarchy with span replay | Sessions-based grouping |
| Evaluation | Offline datasets, online monitoring, LLM-as-judge, human annotation | Five-stage eval lifecycle with CI/CD integration | LLM-based evals, Ragas integration, hand-annotated datasets | Reporting and analysis via integrations |
| Cost tracking | Per-step token and cost; multi-modal support | Token and cost per llm span | Via OTel span attributes | Per-request cost, cache savings |
| Self-hosting | Enterprise only | Enterprise only | Free, no external dependencies | All tiers |
| Open source | No | No | Yes | Yes |
| Caching/rate limiting | No | No | No | Yes |
| Base pricing | Free (5K traces/mo, 1 seat); $39/seat/mo Plus | Free (1M spans); $249/mo Pro | Free (self-hosted); AX Pro $50/mo | Free (10K req/mo, 7-day retention); $79/mo Pro |
Selection Guidance with Tradeoffs
Each platform has a failure mode worth knowing before adoption:
- LangSmith delivers the fastest path to useful traces for LangChain/LangGraph teams, validated in production at Acxiom and ServiceNow. The tradeoff is framework lock-in plus per-seat pricing: at $39/seat/month on Plus, a team of five pays $195/month before any trace overage, and teams that later migrate off LangChain face a second observability migration.
- Braintrust is framework-agnostic and its five-stage eval lifecycle is the strongest in the comparison, but the jump from free to $249/month Pro is steep for teams with modest trace volume. Usage-based overages push costs higher for eval-heavy workflows.
- Arize Phoenix is the only free self-hostable option with serious evaluation depth, but teams absorb the operational cost: running it on a ClickHouse or Postgres backend at production scale requires dedicated ownership.
- Helicone has the lowest integration friction through its proxy model and adds caching and rate limiting, but the proxy sits on the request path and adds latency. The free Hobby tier also caps data retention at 7 days; Pro extends it to 1 month. Evaluation capabilities are shallower than the other three.
Each tool addresses traces, cost, and some degree of evaluation. Attribution remains the gap, requiring either manual instrumentation through these tools or an architectural approach that provides attribution structurally.
How Intent Provides Built-In Observability
Intent, the Augment Code desktop workspace for agent orchestration, takes a different approach to multi-agent observability: structural isolation instead of instrumentation-based tracing. Intent builds the attribution boundary into the workspace itself, so developers do not need to propagate trace IDs across agents.
Architecture-Driven Attribution
Intent's workspace model creates clean observability boundaries. Each implementor agent operates in an isolated git worktree with its own MCP (Model Context Protocol) connections. This architecture produces clear attribution points for cost, latency, and output quality per agent and per task.
Because agents are physically separated by worktree and MCP boundaries, attribution falls out of the architecture without SDK integration or span propagation code. Teams get per-agent granularity without building custom trace propagation. Living specs act as a second observability layer: every agent reads from and writes to the spec, producing a persistent record of what each agent was tasked with and what it delivered.
| Isolation Mechanism | Observability Effect |
|---|---|
| Worktree-level isolation | Each workspace backed by its own git worktree; parallel agent work without code conflicts |
| MCP connection isolation | Tool access and external service connections scoped per agent; cost and latency attributable at the connection boundary |
| Living spec coordination | Agents read from and update a shared spec, creating a persistent record of what each agent was tasked with and what it delivered |
What Intent Does Not Replace
Structural isolation narrows the observability surface by design. Teams running the following workflows still need a dedicated observability platform alongside Intent:
- Prompt and completion capture. Intent does not store the raw prompts sent to each agent or the full completion content, which LangSmith and Phoenix do.
- Prompt versioning and A/B testing. Iterative prompt engineering against historical traces is a core Braintrust feature with no direct equivalent in Intent.
- Per-tool-call latency breakdowns. Intent attributes cost and latency per agent; teams needing per-LLM-call timing data to debug slow tool invocations need an SDK-based stack.
- Offline regression suites. CI-integrated eval runs against curated datasets live in Braintrust or LangSmith, not in Intent's workspace.
The two approaches cover different ground. Intent handles attribution cleanly without setup cost. External SDKs capture the prompt-level and eval-level detail Intent does not store.
Structural vs. Instrumentation-Based Approaches
| Dimension | Intent (Structural) | LangSmith / Braintrust (Instrumentation) |
|---|---|---|
| Attribution mechanism | Isolated worktree and MCP boundaries | SDK trace propagation, span IDs |
| Setup required | Built into workspace architecture | SDK integration, proxy configuration |
| Attribution granularity | Per agent, per task | Per LLM call, per chain step, per tool invocation |
| Framework coupling | Coordinator-based orchestration with BYOA support for Claude Code, Codex, and OpenCode | LangSmith strongest in LangChain; Braintrust framework-agnostic |
| Evaluation/experimentation | Handled through the Verifier agent against the living spec | Core feature; supports datasets, A/B tests, prompt versioning |
Teams coordinating multi-agent coding sessions in Intent work with isolated execution contexts through git worktrees, and the spec doubles as a coordination record that survives the session.
Setting Up an Observability Pipeline for Multi-Agent Sprints
A practical pipeline for multi-agent coding sessions flows through five layers: agent code, OTel SDK with span processors, OTel Collector, storage backends, and alerting.
Step 1: Instrument Agent Code
Pick a semantic convention first. OTel GenAI conventions are the emerging vendor-neutral standard but remain in Development; OpenInference conventions are more mature for agent-specific span kinds (AGENT, TOOL, RETRIEVER, RERANKER). Teams feeding traces into Phoenix or Arize should start with OpenInference; teams aligning with broader platform telemetry should accept OTel GenAI's stability risk.
Set gen_ai.agent.id, gen_ai.agent.name, and gen_ai.agent.version on agent root spans. For the OpenAI Agents SDK:
Step 2: Add PII Redaction at the Span Processor Level
SDK-level redaction is the safest default because sensitive data never leaves the process. Collector-level redaction centralizes rules but requires the Collector to sit inside the trust boundary.
Step 3: Configure Alerting for Agent Failure Modes
Tool-call loops and context overflow are good paging candidates because they correlate with runaway cost; latency degradation is usually dashboard-only unless agents serve interactive users. Thresholds below draw on OpenObserve's AI agent monitoring reference:
| Metric | Alert Threshold | Failure Mode Detected |
|---|---|---|
| tool_call.count per session | Greater than 20 | Infinite tool-call loop |
| agent.steps | Greater than configured max | Stuck reasoning chain |
| error.rate | Greater than 1% over 5 minutes | Systematic failures |
| llm.usage.prompt_tokens | Greater than 80% of context window | Context overflow risk |
| duration (end-to-end) | p95 greater than 10s for interactive agents | Latency degradation |
Step 4: Choose a Storage and Visualization Stack
Buy-versus-build is the core decision, driven by trace volume, data residency, and engineering capacity:
- Starter (1-3 engineers): Langfuse Cloud with OpenInference instrumentors and Slack webhooks. Hands-off; accept vendor data retention.
- Mid-scale (5-15 engineers): OTel SDK plus Collector with fan-out, self-hosted Langfuse or Arize Phoenix, Prometheus and Grafana for metrics. Dedicated ownership pays off once volume exceeds cloud tier limits.
- Enterprise (data residency, compliance): OTel Collector as sidecar with tenant routing, self-hosted Langfuse on ClickHouse. Langfuse's Cresta case study validates this pattern, keeping trace data inside the security boundary under full retention control.
One decision cuts across all four steps: sampling. Full-fidelity capture is affordable in development but breaks storage budgets at scale. A common production pattern is tail-based sampling — retain 100% of traces containing errors, 100% of traces exceeding a cost threshold (e.g., $0.50), and 5-10% of the remainder.
Start with Trace Attribution Before the Next Multi-Agent Sprint
The core tradeoff in agent observability is scope versus setup. Instrumentation-based stacks give teams deeper prompt, span, and evaluation visibility in exchange for schema decisions, propagation code, and storage design. Structural approaches narrow the data surface in exchange for near-zero setup cost.
A practical 30-day plan for teams starting from zero:
- Week 1: Establish a trace boundary for every agent run. Even a single root span with
agent.id,session.id, and start/end timestamps beats no attribution. - Week 2: Instrument one agent end-to-end with OpenInference span kinds. Pick the highest-cost agent; the cost data pays for the work.
- Week 3: Add cost attribution at the LLM call level. Tag every call with
agent.idand aggregate by agent in Langfuse or Phoenix. - Week 4: Wire alerts for tool-call loops and context overflow. These two detect the most expensive silent failures.
For teams coordinating parallel coding agents, Intent's isolated worktrees, dedicated MCP connections, and living specs provide week-1 attribution out of the box, which shortens the runway to weeks 2-4.
See how Intent's workspace isolation gives every agent a traceable execution boundary without custom instrumentation.
Free tier available · VS Code extension · Takes 2 minutes
FAQ
Related
Written by

Ani Galstian
Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance