Skip to content
Install
Back to Guides

AI Agent Loop Token Costs: How to Constrain Context

Apr 6, 2026
Paula Hingel
Paula Hingel
AI Agent Loop Token Costs: How to Constrain Context

Naive AI agent loops compound token costs at O(N²) because LLM APIs bill for the entire conversation history on every call, turning a 10-step agent run into a significantly higher bill than architectures that constrain context to only what each step needs.

TL;DR

Naive agent loops rebill prior context on every call, so input token cost grows quadratically as tool outputs and reasoning traces accumulate. A 20-step loop can consume over 10x the tokens a simple per-step estimate suggests. Scope limiting, state resets, and coordinator-specialist designs reduce that waste by keeping each step's context narrow.

Why Agent Loops Quietly Drain Budgets

The root cause of unexpectedly high agent bills is how agent loops handle context: conversation history accumulates across iterations, and the LLM API re-bills the entire history on every call.

Many traditional chat/completions LLM APIs are stateless, but newer APIs increasingly support server-side conversational state management. They receive and bill for the entire conversation history on every single call. Naive agent loops are stateful: they append tool outputs, observations, and reasoning traces to the message history after each iteration. Research on agent memory strategies characterizes this directly: "Most agent frameworks today adopt a naïve memory accumulation strategy, where each new LLM response is appended to the next input prompt. While simple to implement, this leads to monotonic prompt growth."

A 10-step coding agent therefore does not cost 10x a single call. The cost follows a triangular number series where each step re-bills every previous step's content. Separate repository context research adds another dimension: files like AGENTS.md can substantially increase inference costs by over 20% per session, while offering minimal improvement and sometimes reducing task success on complex tasks. Context that causes agents to explore more broadly costs more, even when the exploration is counterproductive.

This guide breaks down the exact math, presents five production-tested constraint patterns, and shows how coordinator-specialist architectures address the root cause.

See how Intent's isolated workspaces reduce redundant context handoffs in multi-agent runs.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

The Math: How 10 Retries Cost 5-10x a Constrained Approach

Context accumulation in naive agent loops follows a quadratic cost curve because the entire history is re-serialized and re-injected into the LLM's context window at every step. While the message history grows linearly with each iteration, total billed input tokens grow quadratically because each call re-sends prior context.

The total input tokens for a naive N-step loop follow this formula:

text
Total_naive = N×S + u×N(N+1)/2 + r×N(N-1)/2
Where:
S = system prompt tokens (fixed)
u = new input tokens per iteration (user message + tool result)
r = output tokens per iteration
N = total iterations

The N(N+1)/2 triangular number term is the cost trap. A 20-step loop where each step generates 1,000 tokens produces 210,000 cumulative input tokens rather than the 20,000 tokens a per-step estimate would suggest.

One common first response to this problem is prompt caching, but caching only addresses the fixed system prompt portion of each call. On Claude Sonnet 4.6, cache reads cost $0.30/1M versus $3.00/1M for uncached input, a 90% reduction on the cached prefix. However, the growing conversation history is the dominant cost driver in multi-step loops, and each new tool output or reasoning trace is unique per iteration, so it cannot be cached. Caching reduces the N×S term in the formula above but leaves the quadratic N(N+1)/2 term untouched.

Worked Example: File-Reading Agent on Claude Sonnet 4.6

This example uses Anthropic's published pricing ($3.00/$15.00 per 1M input/output tokens) with the following parameters: S = 1,000 system prompt tokens, u = 8,000 tokens per

ApproachTotal Input TokensTotal Costvs. Single-Pass
Naive 10-step loop472,500$1.4943.3x
Constrained (W=2 window)260,000$0.86~25x
Single-pass9,000$0.031.0x

iteration (file contents), r = 500 output tokens, N = 10 iterations.

ApproachTotal Input TokensTotal Costvs. Single-Pass
Naive 10-step loop472,500$1.4943.3x
Constrained (W=2 window)260,000$0.86~29x
Single-pass9,000$0.031.0x

The constrained approach keeps only two prior iterations in context. This lowers total cost compared with a naive full-context approach. The savings are even larger on Gemini's pricing tiers, where crossing the 200K token threshold activates a 2x input rate.

The Turn-by-Turn Reality

A directional example from a context engineering benchmark shows the accumulation pattern in a real agent run:

text
Iteration 1: 888 tokens (system + user message)
Iteration 2: 3,400 tokens (+ list_directory result)
Iteration 3: 8,900 tokens (+ read_file: context.py)
Iteration 4: 14,200 tokens (+ read_file: compaction.py)
Iteration 5: 18,900 tokens (+ grep + read_file: _agent.py)

The critical insight from SWE-bench measurements is that most of this growth is removable waste: in one analysis, 30,400 of 48,400 total tokens came from tool results alone, and 39.9-59.7% of those tokens were removable with no performance loss. The tool outputs retained in message history, not the reasoning or planning tokens, are where constraint patterns deliver the largest savings.

Cost Calculation in Python

python
# Python 3.11+: verified against worked examples above
def naive_loop_cost(N, S, u, r, P_in, P_out):
"""Calculate total cost of a naive (full-context) agent loop."""
total_input = N*S + u*N*(N+1)//2 + r*N*(N-1)//2
total_output = N * r
return total_input * P_in + total_output * P_out
def constrained_loop_cost(N, S, u, r, W, P_in, P_out):
"""Calculate total cost with sliding window W (prior iterations kept)."""
context_per_call = S + W*(u+r) + u
total_input = N * context_per_call
total_output = N * r
return total_input * P_in + total_output * P_out
# Claude Sonnet 4.6 pricing (anthropic.com/pricing)
P_in = 3.00 / 1_000_000
P_out = 15.00 / 1_000_000
# File-reading agent example
naive = naive_loop_cost(N=10, S=1000, u=8000, r=500, P_in=P_in, P_out=P_out)
const = constrained_loop_cost(N=10, S=1000, u=8000, r=500, W=2, P_in=P_in, P_out=P_out)
# Output: Naive: $1.4925, Constrained: $0.8550

Five Context Constraint Patterns That Cut Agent Costs

Reducing AI agent cost requires breaking the accumulation cycle at the architectural level. Five patterns have emerged across production systems, each targeting a different aspect of the problem. Teams building multi-agent systems can apply these individually or layer them for compounding savings.

Pattern 1: Scope Limiting via Subagent Isolation

Scope limiting distributes work across stateless subagents that each receive only context relevant to their specific subtask. No subagent's internal trace is visible to any other; the parent orchestrator synthesizes outputs.

Multi-agent architecture benchmarks measured the token impact directly: isolated subagents used approximately 9K total tokens for a multi-domain query, compared to 15K tokens for a skills-based pattern that accumulates context.

Tradeoff: Subagents provide zero amortization benefit on repeated similar queries because each call resets. Stateful patterns can improve efficiency on repeat requests by maintaining context. Teams running agents that frequently revisit the same files or services should weigh this cost against the isolation savings.

Pattern 2: State Resets with External Persistence

State reset architecture periodically resets the agent's context window to a clean state, maintaining continuity through external durable storage, including filesystem, database, and git, rather than the context window itself.

Research on agent harness anatomy names the failure mode this addresses: "Context Rot," where models become worse at reasoning as their context window fills up. The Ralph Loop pattern restarts the agent in a fresh context window on each iteration while carrying forward state through files or other persistent artifacts.

Context engineering research implements a similar approach: agents summarize completed work phases and store essential information in external memory, then spawn fresh subagents with clean contexts and retrieve stored context from memory.

Tradeoff: The handoff between iterations is lossy. The summarization step that compresses completed work into external state inevitably discards nuance: variable naming rationale, rejected approaches, and implicit constraints the model inferred from prior tool outputs. Teams implementing state resets need to define explicitly what gets persisted (task status, file paths modified, test results) versus what gets dropped (reasoning traces, intermediate hypotheses). Over-aggressive resets force the agent to re-derive context it already had, potentially increasing total iterations even as per-iteration cost drops.

Pattern 3: Reasoning-Execution Separation

Reasoning-execution separation decouples the token-heavy planning phase from the lighter execution phase. A larger model generates a complete plan before tool calls. A smaller, cheaper model executes individual steps.

Research on plan-then-execute patterns demonstrates generating a symbolic plan referencing placeholders for tool outputs, then executing tool calls to fill those slots, followed by a lightweight synthesis step. The plan becomes an auditable artifact, and executors operate with minimal context containing only their assigned step and inputs.

A complementary model-escalation approach initiates with the most cost-effective LLM first. If the conversation concludes without resolving the query, the system restarts with the next more expensive model, invoking premium models only after cheaper models fail.

Tradeoff: Plan staleness is the primary risk, and recovery from a stale plan is expensive. If step three of a 10-step plan reveals the approach is flawed (e.g., an API doesn't behave as expected, or a file has a different structure than assumed), the system must either re-plan from scratch, paying the full planner cost again, or attempt to patch the plan mid-execution, which smaller executor models handle poorly. Teams using this pattern should build plan validation checkpoints at key steps and set a re-planning budget (e.g., allow one full re-plan per task before escalating to a human).

Pattern 4: Context Trimming and Selective Injection

Context trimming ensures only tokens required for the immediate next step enter the context window. Research on explicit compression scheduling measured the impact on SWE-bench: compression instructions scheduled every 10-15 tool calls achieved 22.7% token savings, from 14.9M to 11.5M tokens, while matching baseline accuracy.

A critical finding from this research is that passive prompting yielded only 6% savings and caused accuracy degradation. Compression must be explicitly instructed and scheduled to be effective without quality loss.

Pattern 5: Conversation Summarization

Conversation summarization compresses full conversation history into a summary when context approaches the window limit. Claude Code has been documented as triggering auto-compaction when the context window nears capacity, often around 95% in earlier references, though more recent documentation reports a lower threshold of about 83.5%.

Critical caveat: Some research on AI summarization reports failure modes and quality degradation in aggressive or lossy summarization settings. While per-step tokens dropped from 8,500 to 2,100, average turns to solve increased from 4.0 to 14.0. Total token consumption only dropped from 34K to 29.4K, a modest 14% savings, while introducing context drift. Summarization works at clear phase boundaries rather than as an aggressive ongoing compression mechanism.

Pattern Comparison

The sections above cite sources for each measured finding. The following table summarizes the token mechanism, measured findings, and primary risk for each pattern.

PatternToken MechanismKey Measured FindingPrimary Risk
Subagent isolationScope restriction per call~40% fewer tokens in benchmarksNo savings on repeat requests
State resetsContext rot preventionEnables multi-hour tasksHandoff quality loss
Reasoning-execution splitPlan-once execute-manyStructural token reductionPlan staleness
Context trimmingPer-turn output filtering22.7% savings with scheduled compressionAccuracy loss if unscheduled
SummarizationWindow compressionToken reduction (net savings vary)Turn multiplication

Choosing a Pattern: Decision Framework

The five patterns address different cost profiles. Selecting the right starting point depends on where the quadratic growth concentrates in a given agent workflow:

  • Loops under five steps with large tool outputs (file reads, API responses): Start with Pattern 4 (context trimming). The per-turn filtering targets the biggest contributor to context growth without requiring architectural changes.
  • Loops exceeding 10 steps with moderate tool outputs: Apply Pattern 2 (state resets) to break the quadratic curve at defined phase boundaries. The cost of lossy handoffs is lower than the cost of unbounded accumulation at this step count.
  • Multi-domain tasks with independent subtasks: Pattern 1 (subagent isolation) delivers the largest savings when subtasks share minimal context. Combine with Pattern 4 within each subagent for compounding reductions.
  • Tasks requiring complex upfront planning followed by repetitive execution: Pattern 3 (reasoning-execution split) amortizes the expensive planning call across many cheap execution steps.
  • Long-running sessions approaching context limits: Pattern 5 (summarization) works as a safety net at clear phase boundaries, but should be paired with another pattern rather than used as the primary cost control.

Explore how Intent's living specs keep parallel specialists aligned while each agent works in a scoped workspace.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

Architecture: Coordinator with Scoped Specialists

The coordinator-specialist architecture prevents any single agent from accumulating the full workflow history. A lean coordinator holds task decomposition logic, routing decisions, and compressed summaries. Specialist agents each receive scoped, task-specific context, process their subtask, and return a compressed result rather than raw data.

Why Context Scoping Is the Core Mechanism

The most direct measurement of the coordinator-specialist benefit comes from research on shared context windows: token consumption dropped by 53.7% on average, including 58.4% on HotpotQA and 56.2% on GSM8K, when a single shared context window replaced redundant re-encoding across agent calls.

That reduction matters because the baseline costs are substantial. Multi-agent research measuring token baselines reports that agents typically use approximately 4x more tokens than chat interactions, and multi-agent systems use approximately 15x more tokens than chats. Without proper context isolation, an unoptimized verified multi-agent system consumed 850K tokens versus 100K for a single agent, an 8.5x multiplier.

Empirical measurements of self-organizing agent systems show that the planning and orchestration layer should consume approximately 9.8% of total tokens, with worker agents at 70.6%. If a coordinator is consuming significantly more, the coordinator is likely accumulating context that should be delegated.

How Production Systems Implement This Pattern

Several coding assistants and agent frameworks have adopted isolated sub-agent context windows, and the specific isolation mechanism each chooses has direct cost implications. GitHub's parallel agent architecture dispatches sub-agents where "each sub-agent gets its own context window but shares the same filesystem." The shared filesystem is the key cost design: coordination happens through file I/O rather than through the context window, so the orchestrator never re-serializes specialist outputs into its own growing history. Cognition's managed agent model applies the same principle by giving each agent "a clean slate, a narrow focus, its own shell, and its own test runner."

Open source
augmentcode/augment.vim613
Star on GitHub

Intent uses this pattern through a coordinator-specialist architecture. The Coordinator Agent analyzes the codebase, breaks the spec into a structured task list, and delegates to Implementor Agents that execute tasks in parallel waves. Each specialist receives a scoped context window rather than the full project history. A Verifier Agent checks results against the living spec, flagging inconsistencies while using implementation context maintained by the Context Engine across 400,000+ files. Applied to the file-reading example from the math section, this architecture would split a 10-step task across three to four parallel specialists, each running two to three steps with isolated context. Each specialist's cost follows a short linear curve instead of all 10 steps accumulating in a single quadratic window.

When Single-Agent Execution Costs Less

The coordinator-specialist architecture introduces a fixed overhead: the coordinator's own token consumption for task decomposition, routing, and result synthesis. For short tasks that a single agent completes in two to three steps, this overhead exceeds the savings from context isolation. The crossover point depends on per-step token growth. For the file-reading example (8,000 tokens per step), a single agent becomes more expensive than a coordinator with two specialists around step five or six, where the quadratic accumulation in the single agent exceeds the fixed coordinator overhead plus two shorter parallel curves. Tasks with smaller per-step token growth (e.g., code generation without large file reads) push this crossover point higher, so the single-agent approach remains cheaper for longer chains.

The Reliability Tax on Token Costs

Agent loop optimization must account for reliability compounding. A sequential chain with 95% per-step reliability achieves only 60% end-to-end reliability over 10 steps (0.95^10 = 0.598). Each failure requires retry calls that re-run some or all of the chain. Those retries multiply token costs: a 10-step sequential agent with 95% per-step reliability spends approximately 40% more tokens on retries than a system with perfect reliability at the same step count.

Parallel specialist execution shortens the sequential chain. Splitting a 10-step task into three parallel tracks of three to four steps each raises end-to-end reliability from 60% to approximately 81-86% (0.95^3 to 0.95^4 per track, with independent failure). Fewer retries mean fewer wasted tokens. Teams that understand why multi-agent systems fail can design retry strategies that minimize wasted tokens.

Monitoring: Token-Per-Task Metrics That Catch Cost Spirals

An API format change once caused 200x the baseline token rate in a production agent system, costing approximately $50 over 40 minutes, as described in an anecdotal incident postmortem. CPU and memory stayed flat because LLM calls are I/O-bound, so only per-cycle token tracking revealed the anomaly. Traditional infrastructure monitoring misses agent cost spirals entirely.

Core Metrics to Track

The following metrics provide the minimum observability needed to detect cost spirals before they compound. These alert thresholds draw from the token baselines discussed in the architecture section.

MetricDefinitionAlert Threshold
Token-per-taskTotal input + output tokens per complete task2x established baseline (catches quadratic growth before it compounds beyond one billing cycle)
Cost-per-completionUSD cost per completed task across all stepsDaily spend exceeds historical baseline
Loop iterations per taskNumber of LLM calls before completionAverage exceeds 2x baseline (signals retry loops or plan staleness)
Context utilization ratioTokens used / max context window per callGreater than 85% of window (approaching summarization triggers and pricing tier thresholds)
Per-subagent cost shareCost broken down by individual subagentOrchestrator exceeding 10-15% of total (the 9.8% orchestration benchmark from self-organizing systems research suggests orchestrators consuming more are accumulating specialist context)

OpenTelemetry Instrumentation

The following OTel counters and histograms map directly to the metrics above, providing real-time alerting on agent loop health.

python
# Python 3.11+: OTel metrics for agent loop monitoring
# Source: oneuptime.com agent monitoring guide
tool_calls = meter.create_counter("agent.tool.calls")
tool_failures = meter.create_counter("agent.tool.failures")
# Alert: tool_failures / tool_calls > 0.05 for any tool
llm_duration = meter.create_histogram(
"agent.llm.duration",
unit="s",
description="LLM call duration in seconds"
)
loop_iterations = meter.create_histogram(
"agent.loop.iterations",
description="Number of LLM calls per agent run"
)
# Alert: avg iterations > 2x established baseline

Observability Platform Comparison

Four platforms currently offer the per-step cost attribution needed for agent loop monitoring. Each has different strengths depending on deployment requirements.

CapabilityLangSmithLangfuseHeliconePortkey
Per-step cost in trace tree
Custom/tool cost attribution
Multi-provider support
OTel integration
Alert webhooks
Self-hostablePartial
Cache token tracking

LangSmith provides unified cost tracking across LLM calls, tool calls, and retrieval steps, with PagerDuty integration for alerting. For teams already using LangChain, LangSmith offers the tightest integration with existing agent traces. Helicone operates as an AI Gateway proxy requiring no code changes, with Custom Properties for per-user and per-feature cost attribution. Teams that need cost visibility without modifying agent code can deploy Helicone fastest. Langfuse offers self-hostable hierarchical tracing with automatic token breakdown by type, the strongest option for regulated environments requiring on-premises observability. Portkey provides the broadest alerting options with webhook support and multi-provider coverage, suited for teams running agents across multiple LLM providers.

Intent's workspace model gives each agent run an isolated context within a tracked workspace, providing a natural boundary for per-task cost measurement.

Constrain Context Before Scaling Agent Count

The highest-leverage optimization for AI agent cost is constraining what enters each agent's context window. Naive history accumulation causes context length to grow over time, which in turn drives quadratic attention compute and memory costs. Start by measuring token-per-task on existing agent workflows to identify which loops are accumulating context unnecessarily. Then apply scope limiting and state resets to the highest-cost loops before investing in full architectural changes.

See how Intent's coordinator, implementors, and verifier keep multi-agent development aligned through structured task decomposition, isolated workspaces, and spec-based verification.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

FAQ

Written by

Paula Hingel

Paula Hingel

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.