What is the difference between agent tracing and traditional APM tracing?

Agent tracing captures non-deterministic, multi-step reasoning chains where failures manifest as incorrect outputs rather than exceptions. Traditional APM tracing assumes deterministic execution paths and detects failures through error codes and latency violations, missing the primary failure mode of agents: producing wrong answers without crashing.

Are OpenTelemetry GenAI conventions stable enough for production use?

All GenAI semantic conventions remain in Development status per the OTel specification. Instrumentation libraries are not prohibited from shipping stable releases based on these conventions, but many remain on pre-release versions for compatibility. Teams instrumenting today should treat the conventions as a directional standard and use the OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental flag for compatibility with future changes.

How does Intent handle agent observability differently from LangSmith or Braintrust?

Intent achieves per-agent attribution through workspace architecture, specifically isolated git worktrees and per-agent MCP connections. LangSmith and Braintrust achieve attribution through SDK trace propagation and span IDs. Intent's approach requires no instrumentation setup and produces a durable audit trail through the living spec. External tools require integration work but provide richer debugging data including prompt content, evaluation scores, and prompt versioning.

What's the minimum viable observability setup for a 2-person team?

Three things: a trace boundary per agent run, per-call cost attribution, and an alert on tool-call loops. Langfuse Cloud's free Hobby tier (50,000 units per month, where units cover traces, observations, and scores combined) covers all three for most small-team workloads without self-hosting. Add evaluations only once the first three catch a real failure.

How much trace data should I sample?

Production agent stacks typically retain 100% of traces containing errors, 100% of traces exceeding a cost threshold (e.g., $0.50 per session), and sample the remainder at 5-10%. Tail-based sampling at the OTel Collector preserves debugging data for failures while controlling storage costs on successful runs.

What is the most common agent failure mode that observability catches?

Infinite tool-call loops, where agents repeat the same tool invocations without making progress, are the most recognized and potentially costly failure mode. Detection requires counting tool call repetitions per trace and alerting when counts exceed a threshold. Framework-level controls include LangGraph's recursion_limit, CrewAI's max_iter, and AutoGen's max_rounds (GroupChat) or max_turns (team-level) parameters in the current v0.4+ API.

Agent Observability for AI Coding: How to Trace What Your Agents Actually Did

AI agent observability for coding requires structured tracing across four dimensions: execution traces, output evaluations, token cost attribution, and per-agent identity tracking, because traditional APM tools cannot detect the non-deterministic, multi-step failure modes that define agent workflows.

TL;DR

Attribution is the pillar most observability stacks get wrong. Traces, evaluations, and cost tracking have mature tooling; answering "which agent produced this broken code, under which model version, at what cost" usually does not. Two paths solve the attribution gap. SDK-based instrumentation (LangSmith, Braintrust, Phoenix) captures every span in exchange for setup work. Structural isolation (Intent's per-agent worktrees and MCP connections) makes attribution a property of the workspace itself.

Why Multi-Agent Coding Sessions Are Invisible to Standard Tooling

A four-hour multi-agent coding session can produce 180+ tool calls, tens of dollars in token spend, and a silent authentication regression that passes every health check. When the bug surfaces the next morning, engineering teams need to answer three questions at once: which agent wrote the broken code, what context did it have, and how much did the session cost to produce it?

Traditional APM tools like Datadog, New Relic, and Prometheus were built around a deterministic request-response contract. A request arrives, code executes a predictable path, and a response returns. AI coding agents violate every assumption embedded in that model, so the resulting failure modes are invisible:

No stable baseline. Honeycomb describes LLM-based systems as nondeterministic, capable of producing different outputs given the same input depending on shifts in content, data, or prompt phrasing.
No exception on wrong answers. An agent can return 200 OK while generating code that compiles, passes partial tests, and ships a security bug.
No per-session state. Pre-aggregated metrics collapse the high-cardinality context (session ID, agent ID, prompt version) that makes agent debugging possible.

Engineering teams need purpose-built observability that captures whether agents executed, whether the output was correct, which agent produced it, and what it cost. Intent addresses the attribution dimension directly by isolating each agent in its own git worktree, so per-agent observability becomes a property of the workspace.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

Where Traditional Monitoring Breaks Down

Traditional APM fails for AI coding agents across six distinct failure modes, each rooted in a broken architectural assumption.

Failure Mode	Traditional APM Signal	Why Detection Fails
Non-deterministic outputs	None	No stable baseline to compare against
Multi-step reasoning failure	Partial: spans visible, not correctness	Wrong decision at step 3 invisible until step 12
Tool-calling loops	None: all spans succeed	Pre-aggregation cannot represent per-session state
Output quality / hallucination	None: agent returns 200 OK	APM measures execution, not correctness
High-cardinality context loss	Architecturally penalized	Pre-aggregated metrics discard session-specific data
Silent drift without crash	No alert condition met	Absence of exception is not evidence of correctness

The tool-call loop is the canonical example. An agent rewrites a function, runs tests, observes a failure, attempts a fix, and loops. Each iteration is a valid LLM call returning within normal latency bounds. Traditional APM reports every span as successful while the agent burns tokens in a loop that no individual span can that reveal. Honeycomb documents this pattern in production agent deployments and emphasizes observability to detect when queries fail to converge.

Runtime decisions in agentic systems happen inside the model, which makes traces the primary artifact for debugging. Traditional software engineers read source code to understand behavior; agent engineers read traces.

The Four Pillars of Agent Observability

Agent observability for multi-agent coding systems organizes around four distinct pillars, each capturing data that the others cannot provide.

Pillar 1: Traces and Spans

A trace represents the complete lifecycle of an agent task, structured as a parent-child tree of typed observations. Typed span kinds matter because a generic "span" collapses the distinctions an agent debugger needs. LangFuse's taxonomy now covers an expanded set including Generation (LLM calls with token usage), Tool (function calls), Retriever (retrieval steps), Agent (agent-level operations), Chain, Evaluator, Embedding, and Guardrail, among others.

The key tradeoff at this pillar is span granularity versus storage cost. Capturing every tool argument and every retrieval result produces rich debugging data along with rapidly expanding storage bills. Most production stacks use tail-based sampling, keeping all traces that contain errors or exceed a cost threshold while sampling the remainder at 5-10%. One failure class makes the granularity worth the cost: in Plan-and-Execute architectures, an executor can successfully retrieve the correct answer while the replanner repeatedly rejects the intermediate result, causing timeouts. Without typed spans, that pattern is invisible in aggregate metrics.

Pillar 2: Evaluations

Evaluations answer whether what happened was correct, something traces alone cannot determine. LangFuse distinguishes two evaluation environments: online evaluations running against live production traces for real-time monitoring, and offline evaluations running against dataset benchmarks for CI and regression testing.

For a team starting from zero, a minimum viable eval suite for coding agents is:

Output compiles. Cheap, deterministic, catches the most common hallucination.
Tests pass. Slower but catches regressions traces cannot detect.
Diff stays within bounds. Flag PRs that modify more than N files or exceed a line threshold, which catches runaway refactors.

Both evaluation approaches involve tradeoffs. LLM-as-judge is cheap and scales, but it introduces correlated errors when the judge shares biases with the agent under test. Human annotation is the gold standard but cannot keep pace with agent throughput. Braintrust identifies annotation as a distinct sub-pillar because human feedback is what calibrates the LLM judges themselves and keeps evaluator quality in check.

Pillar 3: Cost and Token Tracking

Multi-agent coding agents autonomously chain LLM and API calls, so real-time cost tracking is essential. LangFuse tracks usage types including prompt_tokens, completion_tokens, total_tokens, cached_tokens, audio_tokens, and image_tokens, with aggregated cost propagated up the trace tree and color-coded to identify outliers.

The tradeoff at this pillar is where the measurement happens. Proxy-based tracking (Helicone's approach) captures cost with zero code changes but adds latency to every LLM call. SDK-based tracking (Langfuse, LangSmith) adds no request-path latency but requires instrumentation in every service. For teams running latency-sensitive agents against interactive users, the proxy overhead can push p95 past acceptable thresholds. For teams doing batch code generation, the proxy is usually fine.

Pillar 4: Attribution

Attribution answers: which agent, running which model version, made which tool call that produced which output? OpenTelemetry defines an agent identity schema including gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description, and gen_ai.agent.version. All four attributes are specified as conditionally required (populated when the application provides the value).

Non-determinism compounds across agents. The MAESTRO evaluation suite documents this empirically: multi-agent systems remain structurally stable yet temporally variable across repeated runs, with architecture itself emerging as the dominant driver of reproducibility and cost-latency-accuracy tradeoffs. Without attribution fields on every span, failures traced to Agent C cannot be traced back to Agent A's bad output, and post-mortems stall at "the pipeline produced a bug somewhere."

Intent handles this pillar through isolation instead of instrumentation. Each implementor agent runs in a dedicated git worktree with its own MCP connections, so cost, latency, and output quality are naturally scoped per agent without SDK integration.

Pillar	What It Captures	Key Schema Fields
Traces and Spans	Full execution tree: LLM calls, tool calls, retrieval, control flow	trace_id, span_id, parent_span_id, type, latency_ms
Evaluations	Quality scores per span; online and offline; LLM judge, code checks, human annotation	scores[].name, scores[].value, scores[].source
Cost and Token Tracking	Token counts per type, USD cost per type, aggregated up trace tree	prompt_tokens, cached_tokens, cost_details.total
Attribution	Agent identity, model version, user/session context on every span	agent_type, agent_version, model, user_id, session_id

OpenTelemetry for AI Agent Workflows

OpenTelemetry's GenAI semantic conventions provide an emerging vendor-neutral way to instrument LLM calls and support evolving agent instrumentation. The entire GenAI namespace remains in Development status, which has a direct engineering consequence: instrumentation libraries depending on these conventions cannot ship stable releases. OpenTelemetry's stability proposal acknowledges this directly, noting that many instrumentation libraries are stuck on pre-release versions because they depend on experimental semantic conventions.

Recommendation: Teams already running OTel for application telemetry should adopt GenAI conventions now and accept the breaking-change risk, since the alternative is a parallel observability stack. Teams starting fresh should evaluate OpenInference, which offers richer agent-specific span kinds (AGENT, TOOL, RETRIEVER, RERANKER) and is stable enough to build on today.

Instrumentation Example

A compliant LLM inference span requires gen_ai.operation.name and gen_ai.provider.name, and includes gen_ai.request.model when available. Token tracking attributes carry a Recommended requirement level:

python

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("chat gpt-4o") as span:
    span.set_attribute("gen_ai.provider.name", "openai")
    span.set_attribute("gen_ai.operation.name", "chat")
    span.set_attribute("gen_ai.request.model", "gpt-4o")
    span.set_attribute("gen_ai.request.temperature", 0.7)

    response = call_openai(messages)

    span.set_attribute("gen_ai.response.model", response.model)
    span.set_attribute("gen_ai.usage.input_tokens",
                       response.usage.prompt_tokens)
    span.set_attribute("gen_ai.usage.output_tokens",
                       response.usage.completion_tokens)

For teams that prefer not to write span attributes by hand, OpenLLMetry auto-instruments OpenAI, Anthropic, AWS Bedrock, LangChain, LlamaIndex, and others.

Key Limitations

Before committing to OTel GenAI conventions, account for these documented gaps:

Missing agentic primitives. Standardized attributes for tasks, actions, teams, artifacts, and memory are not yet defined. Issue #2664 identifies these gaps and issue #2665 proposes conventions for tasks.
Sensitive content capture. Prompt and completion content may be captured, but because it can contain secrets and PII, implementations should minimize collection and apply filtering or redaction at the span processor level.
Opt-in tool arguments. Tool call arguments (gen_ai.tool.call.arguments) are not captured by default, which limits debugging depth unless explicitly enabled.
Complex attribute types (any typed) are not efficiently queryable across all backends, which affects dashboard design.

Tool Comparison: LangSmith, Braintrust, Arize Phoenix, Helicone

Choosing an observability platform depends on framework coupling, evaluation depth, hosting requirements, and cost structure.

Dimension	LangSmith	Braintrust	Arize Phoenix (OSS)	Helicone
Primary integration	SDK-based, lowest friction for LangChain	SDK (framework-agnostic)	SDK (OTel/OpenInference)	Proxy-first; SDK available
Trace depth	Full tree with nested spans	Tree- or DAG-structured spans with typed taxonomy	Full hierarchy with span replay	Sessions-based grouping
Evaluation	Offline datasets, online monitoring, LLM-as-judge, human annotation	Five-stage eval lifecycle with CI/CD integration	LLM-based evals, Ragas integration, hand-annotated datasets	Reporting and analysis via integrations
Cost tracking	Per-step token and cost; multi-modal support	Token and cost per llm span	Via OTel span attributes	Per-request cost, cache savings
Self-hosting	Enterprise only	Enterprise only	Free, no external dependencies	All tiers
Open source	No	No	Yes	Yes
Caching/rate limiting	No	No	No	Yes
Base pricing	Free (5K traces/mo, 1 seat); $39/seat/mo Plus	Free (1M spans); $249/mo Pro	Free (self-hosted); AX Pro $50/mo	Free (10K req/mo, 7-day retention); $79/mo Pro

Selection Guidance with Tradeoffs

Each platform has a failure mode worth knowing before adoption:

LangSmith delivers the fastest path to useful traces for LangChain/LangGraph teams, validated in production at Acxiom and ServiceNow. The tradeoff is framework lock-in plus per-seat pricing: at $39/seat/month on Plus, a team of five pays $195/month before any trace overage, and teams that later migrate off LangChain face a second observability migration.
Braintrust is framework-agnostic and its five-stage eval lifecycle is the strongest in the comparison, but the jump from free to $249/month Pro is steep for teams with modest trace volume. Usage-based overages push costs higher for eval-heavy workflows.
Arize Phoenix is the only free self-hostable option with serious evaluation depth, but teams absorb the operational cost: running it on a ClickHouse or Postgres backend at production scale requires dedicated ownership.
Helicone has the lowest integration friction through its proxy model and adds caching and rate limiting, but the proxy sits on the request path and adds latency. The free Hobby tier also caps data retention at 7 days; Pro extends it to 1 month. Evaluation capabilities are shallower than the other three.

Each tool addresses traces, cost, and some degree of evaluation. Attribution remains the gap, requiring either manual instrumentation through these tools or an architectural approach that provides attribution structurally.

How Intent Provides Built-In Observability

Intent, the Augment Code desktop workspace for agent orchestration, takes a different approach to multi-agent observability: structural isolation instead of instrumentation-based tracing. Intent builds the attribution boundary into the workspace itself, so developers do not need to propagate trace IDs across agents.

Architecture-Driven Attribution

Intent's workspace model creates clean observability boundaries. Each implementor agent operates in an isolated git worktree with its own MCP (Model Context Protocol) connections. This architecture produces clear attribution points for cost, latency, and output quality per agent and per task.

Because agents are physically separated by worktree and MCP boundaries, attribution falls out of the architecture without SDK integration or span propagation code. Teams get per-agent granularity without building custom trace propagation. Living specs act as a second observability layer: every agent reads from and writes to the spec, producing a persistent record of what each agent was tasked with and what it delivered.

Isolation Mechanism	Observability Effect
Worktree-level isolation	Each workspace backed by its own git worktree; parallel agent work without code conflicts
MCP connection isolation	Tool access and external service connections scoped per agent; cost and latency attributable at the connection boundary
Living spec coordination	Agents read from and update a shared spec, creating a persistent record of what each agent was tasked with and what it delivered

What Intent Does Not Replace

Structural isolation narrows the observability surface by design. Teams running the following workflows still need a dedicated observability platform alongside Intent:

Prompt and completion capture. Intent does not store the raw prompts sent to each agent or the full completion content, which LangSmith and Phoenix do.
Prompt versioning and A/B testing. Iterative prompt engineering against historical traces is a core Braintrust feature with no direct equivalent in Intent.
Per-tool-call latency breakdowns. Intent attributes cost and latency per agent; teams needing per-LLM-call timing data to debug slow tool invocations need an SDK-based stack.
Offline regression suites. CI-integrated eval runs against curated datasets live in Braintrust or LangSmith, not in Intent's workspace.

The two approaches cover different ground. Intent handles attribution cleanly without setup cost. External SDKs capture the prompt-level and eval-level detail Intent does not store.

Structural vs. Instrumentation-Based Approaches

Dimension	Intent (Structural)	LangSmith / Braintrust (Instrumentation)
Attribution mechanism	Isolated worktree and MCP boundaries	SDK trace propagation, span IDs
Setup required	Built into workspace architecture	SDK integration, proxy configuration
Attribution granularity	Per agent, per task	Per LLM call, per chain step, per tool invocation
Framework coupling	Coordinator-based orchestration with BYOA support for Claude Code, Codex, and OpenCode	LangSmith strongest in LangChain; Braintrust framework-agnostic
Evaluation/experimentation	Handled through the Verifier agent against the living spec	Core feature; supports datasets, A/B tests, prompt versioning

Teams coordinating multi-agent coding sessions in Intent work with isolated execution contexts through git worktrees, and the spec doubles as a coordination record that survives the session.

Open source

augmentcode/augment.vim★607

Star on GitHub

Setting Up an Observability Pipeline for Multi-Agent Sprints

A practical pipeline for multi-agent coding sessions flows through five layers: agent code, OTel SDK with span processors, OTel Collector, storage backends, and alerting.

Step 1: Instrument Agent Code

Pick a semantic convention first. OTel GenAI conventions are the emerging vendor-neutral standard but remain in Development; OpenInference conventions are more mature for agent-specific span kinds (AGENT, TOOL, RETRIEVER, RERANKER). Teams feeding traces into Phoenix or Arize should start with OpenInference; teams aligning with broader platform telemetry should accept OTel GenAI's stability risk.

Set gen_ai.agent.id, gen_ai.agent.name, and gen_ai.agent.version on agent root spans. For the OpenAI Agents SDK:

python

from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)

Step 2: Add PII Redaction at the Span Processor Level

SDK-level redaction is the safest default because sensitive data never leaves the process. Collector-level redaction centralizes rules but requires the Collector to sit inside the trust boundary.

python

from opentelemetry.sdk.trace import SpanProcessor

class SensitiveDataRedactor(SpanProcessor):
    SENSITIVE_ATTRS = ["llm.prompts", "llm.completions", "user.email"]

    def on_end(self, span):
        for attr in self.SENSITIVE_ATTRS:
            if attr in span.attributes:
                span.set_attribute(attr, "[REDACTED]")

Step 3: Configure Alerting for Agent Failure Modes

Tool-call loops and context overflow are good paging candidates because they correlate with runaway cost; latency degradation is usually dashboard-only unless agents serve interactive users. Thresholds below draw on OpenObserve's AI agent monitoring reference:

Metric	Alert Threshold	Failure Mode Detected
tool_call.count per session	Greater than 20	Infinite tool-call loop
agent.steps	Greater than configured max	Stuck reasoning chain
error.rate	Greater than 1% over 5 minutes	Systematic failures
llm.usage.prompt_tokens	Greater than 80% of context window	Context overflow risk
duration (end-to-end)	p95 greater than 10s for interactive agents	Latency degradation

Step 4: Choose a Storage and Visualization Stack

Buy-versus-build is the core decision, driven by trace volume, data residency, and engineering capacity:

Starter (1-3 engineers): Langfuse Cloud with OpenInference instrumentors and Slack webhooks. Hands-off; accept vendor data retention.
Mid-scale (5-15 engineers): OTel SDK plus Collector with fan-out, self-hosted Langfuse or Arize Phoenix, Prometheus and Grafana for metrics. Dedicated ownership pays off once volume exceeds cloud tier limits.
Enterprise (data residency, compliance): OTel Collector as sidecar with tenant routing, self-hosted Langfuse on ClickHouse. Langfuse's Cresta case study validates this pattern, keeping trace data inside the security boundary under full retention control.

One decision cuts across all four steps: sampling. Full-fidelity capture is affordable in development but breaks storage budgets at scale. A common production pattern is tail-based sampling — retain 100% of traces containing errors, 100% of traces exceeding a cost threshold (e.g., $0.50), and 5-10% of the remainder.

Start with Trace Attribution Before the Next Multi-Agent Sprint

The core tradeoff in agent observability is scope versus setup. Instrumentation-based stacks give teams deeper prompt, span, and evaluation visibility in exchange for schema decisions, propagation code, and storage design. Structural approaches narrow the data surface in exchange for near-zero setup cost.

A practical 30-day plan for teams starting from zero:

Week 1: Establish a trace boundary for every agent run. Even a single root span with agent.id, session.id, and start/end timestamps beats no attribution.
Week 2: Instrument one agent end-to-end with OpenInference span kinds. Pick the highest-cost agent; the cost data pays for the work.
Week 3: Add cost attribution at the LLM call level. Tag every call with agent.id and aggregate by agent in Langfuse or Phoenix.
Week 4: Wire alerts for tool-call loops and context overflow. These two detect the most expensive silent failures.

For teams coordinating parallel coding agents, Intent's isolated worktrees, dedicated MCP connections, and living specs provide week-1 attribution out of the box, which shortens the runway to weeks 2-4.

Agent Observability for AI Coding: How to Trace What Your Agents Actually Did

TL;DR

Why Multi-Agent Coding Sessions Are Invisible to Standard Tooling

The Agentic SDLC

Where Traditional Monitoring Breaks Down

The Four Pillars of Agent Observability

Pillar 1: Traces and Spans

Pillar 2: Evaluations

Pillar 3: Cost and Token Tracking

Pillar 4: Attribution

OpenTelemetry for AI Agent Workflows

Instrumentation Example

Key Limitations

Tool Comparison: LangSmith, Braintrust, Arize Phoenix, Helicone

Selection Guidance with Tradeoffs

How Intent Provides Built-In Observability

Architecture-Driven Attribution

What Intent Does Not Replace

Structural vs. Instrumentation-Based Approaches

Setting Up an Observability Pipeline for Multi-Agent Sprints

Step 1: Instrument Agent Code

Step 2: Add PII Redaction at the Span Processor Level

Step 3: Configure Alerting for Agent Failure Modes

Step 4: Choose a Storage and Visualization Stack

Start with Trace Attribution Before the Next Multi-Agent Sprint

FAQ

Written by

Ani Galstian

Give your codebase the agents it deserves

TL;DR

Why Multi-Agent Coding Sessions Are Invisible to Standard Tooling

The Agentic SDLC

Where Traditional Monitoring Breaks Down

The Four Pillars of Agent Observability

Pillar 1: Traces and Spans

Pillar 2: Evaluations

Pillar 3: Cost and Token Tracking

Pillar 4: Attribution

OpenTelemetry for AI Agent Workflows

Instrumentation Example

Key Limitations

Tool Comparison: LangSmith, Braintrust, Arize Phoenix, Helicone

Selection Guidance with Tradeoffs

How Intent Provides Built-In Observability

Architecture-Driven Attribution

What Intent Does Not Replace

Structural vs. Instrumentation-Based Approaches

Setting Up an Observability Pipeline for Multi-Agent Sprints

Step 1: Instrument Agent Code

Step 2: Add PII Redaction at the Span Processor Level

Step 3: Configure Alerting for Agent Failure Modes

Step 4: Choose a Storage and Visualization Stack

Start with Trace Attribution Before the Next Multi-Agent Sprint

FAQ

What is the difference between agent tracing and traditional APM tracing?

Are OpenTelemetry GenAI conventions stable enough for production use?

How does Intent handle agent observability differently from LangSmith or Braintrust?

What's the minimum viable observability setup for a 2-person team?

How much trace data should I sample?

What is the most common agent failure mode that observability catches?

Related

Written by

Ani Galstian

Give your codebase the agents it deserves