What separates agentic infrastructure from standard LLM serving infrastructure?

Agentic infrastructure manages stateful, multi-step agent sessions with tool-calling capabilities, while LLM serving handles stateless single request-response inference. The operational differences include persistent state per session, task durations measured in minutes to hours, and security controls for tool authorization that have no inference counterpart.

How much does a production agentic infrastructure stack cost per month?

A starter stack using managed APIs, open-source orchestration, and cloud-hosted vector databases can start around $400/month for very small deployments, with many production setups costing substantially more. Platform and infrastructure costs represent 25-40% of TCO at scale, so a team spending $5,000/month on tokens should expect platform costs to be a major additional line item.

Which orchestration framework should teams start with?

LangGraph is a strong default because it provides graph-based execution with production-grade PostgreSQL-backed checkpointing and native human-in-the-loop interrupts. Teams with simpler role-based workflows may start elsewhere, but long-term state management and recovery should remain the main selection criteria. Teams that want orchestration packaged as a workspace, with Coordinator/Implementor/Verifier patterns and living specs built in, can adopt Intent as a higher-level alternative.

Where does open source work well, and where does it fall short?

Open source is production-ready for orchestration (LangGraph), observability (Langfuse, Arize Phoenix), and guardrails (NeMo Guardrails). The least mature layer is memory and context management, where most production teams still build custom implementations on top of vector databases combined with relational stores.

What security standards should teams review before shipping agents to production?

Review the OWASP LLM Top 10 and the OWASP Agentic Top 10. Teams subject to EU AI Act requirements should plan around the EU AI Act's own high-risk system compliance timeline (August 2, 2026 for Annex III high-risk systems) rather than any timeline referenced in the NIST CAISI request for information, which does not address EU AI Act timelines.

Agentic Infrastructure: What Actually Goes in the Stack

Agentic infrastructure is the set of runtime systems, orchestration layers, state management services, tool-integration protocols, memory stores, security controls, and observability tooling required to deploy and operate autonomous AI agents reliably in production. It represents a distinct budget category because agents need stateful, multi-step execution with tool-calling capabilities that existing MLOps and LLM serving infrastructure do not cover.

TL;DR

Agentic AI infrastructure differs from LLM serving because agents maintain state across multi-step sessions, call external tools, and run for minutes to hours. The stack has five layers: compute, orchestration, context, observability, and security. At scale, platform and infrastructure costs become a major part of total cost of ownership alongside token spend. Workspaces like Intent, Augment Code's spec-driven multi-agent environment, address several of these layers in a single product surface, particularly orchestration, context, and human-in-the-loop control.

Why Agentic Infrastructure Needs Its Own Budget Line

Traditional LLM serving handles stateless, single request-response inference, and agents break that model. An agent perceives its environment, reasons through context, plans, and takes actions toward a goal. Execution requires persistent state per session across multiple inference calls, tool invocations, and potentially hours of runtime.

The lifecycle unit shift makes the distinction concrete. MLOps tracks model versions. LLM serving tracks requests. Agentic infrastructure tracks sessions and tasks as the atomic unit of operation, which is why workspaces like Intent organize work around isolated git worktrees where each agent session runs as a first-class artifact rather than a transient API call.

Three architectural breaks force the category separation:

State persistence: LLM serving achieves economic efficiency through stateless shared infrastructure, while agents require per-session state across multiple inference calls and tool executions. Google Cloud's documentation describes Cloud Run as stateless and without built-in persistent storage, so multi-step agent workflows typically rely on external storage or state services.
Duration mismatch: AWS Bedrock AgentCore supports agentic workloads up to 8 hours, well beyond typical request limits in conventional inference infrastructure.
Tool-use security: The OWASP Top 10 for Agentic Applications identifies agent-specific risks including tool misuse and exploitation, identity and privilege abuse, and insecure inter-agent communication. Mitigations require infrastructure-layer controls that LLM serving does not provide.

Dimension	LLM Serving Infra	Agentic Infrastructure
Execution model	Single request/response	Iterative loops with tool calls
State	Stateless	Stateful sessions, persistent memory
Lifecycle unit	Request	Agent session / task
Task duration	Milliseconds to seconds	Minutes to hours

Cloud providers have begun separating the category through dedicated product lines. AWS launched Bedrock AgentCore as a separate service from SageMaker (MLOps) and Bedrock inference. Google Cloud published dedicated agentic architecture guidance. Azure published Agent Factory patterns covering agent-specific infrastructure challenges.

Gartner's January 2026 forecast puts total worldwide AI spending at about $2.5 trillion in 2026, with AI infrastructure as the largest sub-segment at roughly $1.37 trillion. Gartner has also published a dedicated Hype Cycle for Agentic AI, and Forrester discusses agentic AI and adaptive process orchestration as emerging enterprise capabilities.

Teams planning 2026-2027 agent rollouts need to budget for five distinct infrastructure layers. The sections below cover what each layer contains, what to evaluate, and where the money goes.

See how Intent's coordinated multi-agent workspace replaces ad-hoc orchestration scripts with living specs that stay aligned across services.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

The Five Layers of the Agentic Infrastructure Stack

Agentic infrastructure organizes into five layers, each with its own responsibilities, evaluation criteria, and build-vs-buy tradeoffs. The layers interact: orchestration governs context retrieval, observability spans all layers, and security design belongs in the architecture from day one rather than bolted on after deployment.

Layer 1: Compute

The compute layer handles GPU provisioning, inference hosting, and autoscaling for the LLM calls that power agent reasoning. For most teams, this layer means managed API providers rather than self-hosted GPU infrastructure.

What to evaluate:

Token pricing tiers: Gemini 2.5 Flash at $0.30/M input tokens compared with GPT-5.4 at $2.50/M input tokens shows a wide price spread, though these models occupy different capability tiers.
Batch API availability: Some major providers offer around a 50% batch reduction.
Autoscaling for bursty workloads: Agents triggered by user actions can kick off agent bursts. Queue depth and batch size align better with inference load than GPU utilization.
Session-hour pricing: Anthropic charges $0.08 per session hour for managed agents, a pricing model with no equivalent in standard inference.

Build vs. buy: API providers are the correct default. Self-hosting makes economic sense only when compliance mandates prohibit third-party data processing, or when sustained high-concurrency open-source model workloads create a large cost differential against proprietary APIs.

Multi-provider routing through open-source gateways like LiteLLM or Portkey reduces model-level lock-in from day one. The a16z CIO survey found 37% of enterprises use five or more models to avoid vendor dependency. Intent supports this pattern directly through its model picker, letting teams use Opus for complex architecture, Sonnet for rapid iteration, and GPT-5.2 for deep code analysis within a single multi-agent run.

Layer 2: Orchestration

The orchestration layer manages agent execution loops: LLM reasoning, tool calling, state persistence, multi-agent coordination, retry logic, and human-in-the-loop controls. Graph-based control flow has become central in production framework design.

The criteria below describe what teams should look for when evaluating an orchestration layer:

Criterion	What to Look For
State persistence	Checkpoint backends (PostgreSQL, MongoDB); recovery at failing node, not full restart
Human-in-the-loop	Native interrupt primitives before irreversible operations
Multi-agent coordination	Supervisor patterns, handoffs, subgraph composition
Error recovery	Checkpoint-based resumption versus conversation restart
Vendor lock-in	Model-agnostic design; switching cost for state management and prompt dependencies

Framework selection: LangGraph offers built-in state persistence through graph-based checkpointing. CrewAI delivers faster time-to-value for role-based workflows, though teams should weigh long-term production constraints carefully. The OpenAI Agents SDK optimizes for minimal configuration, but teams pursuing multi-model strategies may weigh potential lock-in trade-offs against more provider-neutral frameworks. Teams comparing production framework options can find a deeper breakdown of six patterns for multi-agent code development, covering spec-driven decomposition, worktree isolation, and the coordinator/specialist/verifier role split.

Build vs. buy: Open source is the right default for orchestration. LangGraph is production-grade for complex stateful workflows, and no managed equivalent matches its capability on branching and checkpoint-based error recovery. Microsoft Agent Framework targets Azure-native AI agent development and workflow orchestration. Teams that want orchestration packaged as a workspace rather than a library can use Intent, which runs a Coordinator/Implementor/Verifier pattern out of the box and keeps every agent reading from and writing to the same living spec. For teams whose orchestration needs extend beyond a single workspace, with agents triggered by SDLC events like Linear tickets or incident alerts, Augment Cosmos is a platform-level environment that adds shared filesystem, organization-wide memory, and human-in-the-loop policies across the broader development lifecycle. Cosmos is currently in research preview for MAX plan users.

Critical gap: Many major agent frameworks do not natively provide enterprise-grade resilience patterns such as circuit breakers, payload validation, or deterministic fallback routing. These frameworks ship with retry and error-handling features beyond basic API-call-level retries, so teams must address higher-level resilience patterns above the framework layer.

Layer 3: Context

The context layer manages what information agents can access at each reasoning step: retrieval pipelines, vector databases, agent memory, and knowledge graphs. All LLMs are bounded by finite context windows, which makes context engineering a scarce discipline that production systems must plan for explicitly.

What to evaluate:

Memory architecture: Short-term (in-context window management) and long-term (external stores for episodic, semantic, and procedural memory). Production frameworks typically combine short-term context with external or long-term storage rather than relying on a single standardized model.
Context rot: Chroma's research shows degradation as input token count increases. Teams must actively manage context through compaction, deletion, and scratchpads.
Vector database selection: Pinecone (fully managed, $50/month minimum on Standard plans), Weaviate (open-source, native multi-tenancy for per-agent memory isolation), and Qdrant (Apache 2.0, hybrid search filtering). For selection criteria across the full vendor field, see this breakdown of context engineering patterns for agentic swarm coding.
RAG pipeline governance: The orchestration layer should govern retrieval. The LLM should never have direct access to the vector database. Retrieval becomes a predictable routing mechanism rather than ad hoc model improvisation.

Build vs. buy: Memory and context management remains the least mature layer for off-the-shelf solutions. Managed options like Mem0 and Zep cover standard use cases, but no plug-and-play memory layer has yet emerged as a de facto standard. Production teams typically build custom memory architectures on top of vector databases combined with relational stores. Intent's Context Engine processes 400,000+ files through semantic dependency analysis, giving every agent in a workspace shared architectural understanding without each agent recomputing context independently.

Layer 4: Observability

The observability layer provides tracing, evaluation, guardrails, and cost monitoring for agent execution. Traditional APM tools fall short here because agents handle memory, context, and decisions across many turns. LLM monitoring surfaces that an agent failed, while agent observability surfaces which step in the reasoning loop caused the failure.

The capabilities below define what production observability should cover:

Capability	Why It Matters
Hierarchical execution tracing	Capture every LLM call, tool invocation, retrieval step, and agent handoff in a single trace
Session-level evaluation	Evaluate goal completion across multi-turn conversations, beyond individual outputs
Agent trajectory evaluation	Assess the decision path an agent takes, not only the final output
Cost attribution	Per-call, per-trace, per-session, per-user/team cost breakdowns
OpenTelemetry alignment	OTel GenAI conventions define agent span attributes and reduce vendor lock-in
Production trace to eval pipeline	Capture production failures and promote them to evaluation datasets

For teams standardizing on OpenTelemetry, Arize's writeup on agent trajectory evaluation and the OTel GenAI agent span specification describe how to implement these capabilities consistently across vendors. The same Arize team has published a longer review of production-to-eval pipelines.

Platform options: Langfuse (MIT licensed, self-hostable, framework-agnostic, starting at $29/month for the managed cloud tier) suits compliance-sensitive teams. Arize Phoenix (open-source, OTel-native, 2.5M+ monthly downloads) provides agent trajectory evaluation. LangSmith offers deep LangGraph integration, but per-seat pricing scales poorly for large teams. A broader review of monitoring options is available in this comparison of observability platforms for AI coding assistants.

Build vs. buy: Strong open-source options make this one of the clearest cases for OSS-first. Self-hosted Langfuse eliminates per-event costs entirely. At 50M spans, LangSmith overage charges alone exceed $4,900.

Layer 5: Security

The security layer addresses threat models unique to agentic systems: prompt injection, tool-use authorization, sandboxed execution, agent identity management, and compliance. The threat surface differs fundamentally from LLM applications because agents can query databases, call APIs, execute code, and chain tool calls without human review of each step.

What to evaluate:

Prompt injection defense: Anthropic's pilot found a 23.6% attack rate in autonomous mode, reduced to 11.2% after implementing classifiers and mandatory confirmations. Policy enforcement must operate at a layer the agent cannot reach or influence.
Tool authorization: A gateway should enforce policy before tool calls execute. AWS Bedrock AgentCore separates four concerns: gateway, identity, runtime, and policy.
MCP server integrity: The postmark-mcp incident, disclosed in late September 2025, is the first confirmed case of a malicious MCP server in the wild. Version 1.0.16 of the impersonating npm package introduced a single line of code that BCC'd every outgoing email to an attacker-controlled domain, exfiltrating messages from agent-driven workflows before npm removed the package. Merkle root signing and registry provenance tracking are required.
Compliance timelines: EU AI Act high-risk AI system obligations are enforceable August 2, 2026.
Standards baseline: OWASP LLM Top 10 and OWASP Agentic Top 10 are widely used frameworks that identify the most critical risks and offer practical guidance. A broader analysis of how security controls integrate with agent orchestration is available in this guide to AI security and data exfiltration.

Build vs. buy: Security must be designed into the architecture from the start. In multi-agent systems, communication topology, role decomposition, and memory sharing directly shape which threats materialize. NVIDIA NeMo Guardrails (Apache 2.0) provides open-source guardrail infrastructure, and retrofitting security onto an existing multi-agent architecture costs substantially more than designing it in from day one. Intent supports this design-in pattern by treating human-in-the-loop confirmations as a first-class part of every workspace, with Verifier checks and explicit approval gates on irreversible operations.

See how Intent's living specs keep parallel agents aligned as plans evolve across services, with Coordinator-driven orchestration and Verifier-led checks built in.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline

···

$ cat build.log | auggie --print --quiet \

"Summarize the failure"

Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42

Fix: npm install lodash @types/lodash

Cost Modeling: What Teams Spend on Agent Infrastructure in 2026

Token costs are one component of total cost of ownership at scale, and several other layers carry comparable weight. The TCO breakdown below, drawn from Scaled Agile Institute research, shows the relative weight of each cost layer:

Open source

augmentcode/augment-swebench-agent★871

Star on GitHub

Cost Layer	% of TCO	Key Drivers
Compute consumption	15-30%	Token unit cost, GPU compute hours, model API fees
Platform and infrastructure	25-40%	Security, MLOps/LLMOps, guardrails, observability, integration
Data foundation	15-25%	Data cleaning, integration, quality, governance
People and change	20-30%	Training, process redesign, team ramp-up

Platform and infrastructure becomes a large share of TCO as workloads grow. Average enterprise LLM spend is projected to reach approximately $11.6M on current trajectories.

Where Costs Spike Non-Linearly

Four cost inflection points deserve specific budget attention:

Context window thresholds: Gemini 2.5 Pro output pricing increases 50% above 200K tokens. Active context management in the orchestration layer prevents this. Workspaces that share context across agents, like Intent, avoid the redundant context payloads that push individual prompts past these thresholds.
Web search tool calls: $10.00/1K calls (OpenAI, Anthropic); Google pricing varies by product, with some Google Search grounding services listed at $35.00/1K above free usage tiers. Web search costs compound quickly and add to token costs.
Vector DB read units at high QPS: Pinecone Standard charges about $16/M read units, with storage priced separately.
Observability overage: LangSmith charges $4,900+ at 50M spans. Extending retention from 14 to 400 days increases per-trace cost by about 10x.

Published Cost Optimization Levers

Several published optimizations meaningfully reduce cost when applied at the orchestration layer. The table below summarizes savings cited by official provider documentation:

Lever	Published Savings	Notes
Batch API (OpenAI, Anthropic)	50% on input and output	Documented on each provider's pricing page
Claude prompt caching (reads vs. standard)	90% on repeated reads	Anthropic pricing
Model tier routing (Haiku vs. Opus)	~80% on tasks routed from Opus to Haiku	Anthropic pricing

Anthropic publishes the per-model rates that produce these numbers on its pricing page. Model routing is among the highest-impact cost optimization levers in the agentic infrastructure stack because it matches task complexity to model capability tier automatically. Teams running structured multi-agent workflows in Intent can route different roles in a Coordinator/Implementor/Verifier setup to different models per task, collapsing the Haiku-vs-Opus decision into a single workspace setting.

Gartner projects inference costs on 1T parameter LLMs will drop over 90% by 2030. As token costs decline, infrastructure, security, and engineering costs become an increasing share of TCO.

A Starter Stack for Teams Deploying First Production Agents

Three guiding principles from production practitioners anchor the recommendations below: existing infrastructure wins over new categories, developer velocity beats architectural purity, and teams that track costs from day one make better decisions across the board.

The starter stack below pairs each layer with a default tool, license terms, starting cost, and the signal that should trigger a re-evaluation:

Layer	Starter Tool	License	Starting Cost	Signal to Evolve
Compute / LLM API	Claude Sonnet 4.6 or Gemini 2.5 Flash	Commercial	Pay-per-token (Claude Sonnet 4.6 ≈$315/month, Gemini 2.5 Flash ≈$46.50/month at 1M input + 500K output tokens/day)	Compliance mandate or 10M+ tokens/day sustained
Orchestration	LangGraph OSS, or Intent for spec-driven workspaces	MIT (LangGraph); Augment credits (Intent)	Free (LangGraph); Augment credits (Intent)	LangGraph Platform when ops burden exceeds team capacity; Intent when multi-agent coordination becomes the bottleneck
Context / Vector DB	Qdrant Cloud or Weaviate Cloud	Apache 2.0 / BSD-3	Free tier, then $25-45/month	Self-host once workloads reach roughly 60M-100M queries/month
Observability	Langfuse Cloud	MIT	Free tier (50K events), then $29/month	Self-host when trace volume or compliance requires it
Security	NVIDIA NeMo Guardrails	Apache 2.0	Free (OSS)	Add dedicated IAM when multi-cloud agent identity needed
LLM Gateway	LiteLLM	MIT	Free (OSS)	Cloudflare AI Gateway for enterprise caching at scale

Seven Mistakes That Derail First Production Deployments

The mistakes below recur often enough across production deployments to budget for as known risks rather than edge cases:

Using InMemorySaver in production: State is ephemeral and may be lost on process restart. PostgresSaver provides persistent state and fault tolerance with relatively modest configuration.
Skipping HITL gates: Irreversible actions (data writes, external API calls) without human confirmation are a production incident waiting to happen.
Treating token costs as the only cost: Platform and infrastructure represent 25-40% of TCO at scale. Budget observability and security alongside API spend.
Choosing a framework without planning for switch cost: Migration after sustained use requires state migration costs, prompt dependencies, and error handling simultaneously. Evaluate the ceiling of the chosen framework before committing.
Ignoring context rot: Accumulating conversation history without active management degrades performance measurably.
Skipping the eval loop before production: The production trace to evaluation dataset pipeline is how teams measure and improve agent quality. Retrofitting the pipeline is much harder than building it from the start.
Deploying MCP servers without integrity validation: The malicious postmark-mcp package shows that connecting to untrusted MCP servers is risky and underscores the value of integrity checks and provenance tracking.

Map the Five Layers Before You Scale Agent Spend

The agentic infrastructure stack is five interdependent layers, each with distinct build-vs-buy tradeoffs and cost dynamics. Token spend represents one cost layer, while platform and infrastructure can account for a large share of TCO. Teams that budget only for API calls discover this gap in production, when observability overages, security retrofits, and context management rework compound at the same time.

The concrete next step is to map the five layers against existing infrastructure, identify which components already exist (Kubernetes, PostgreSQL, monitoring), and budget for the gaps. Existing infrastructure wins over new categories, and tracking costs from day one produces better decisions at every subsequent scaling inflection point. For teams that want orchestration, context, and human-in-the-loop control packaged into a single workspace, Intent collapses several of these layers into a coordinated multi-agent environment with a living spec at the center.

See how Intent's living specs keep multi-agent work aligned as plans evolve across services.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

Agentic Infrastructure: What Actually Goes in the Stack

TL;DR

Why Agentic Infrastructure Needs Its Own Budget Line

See how Intent's coordinated multi-agent workspace replaces ad-hoc orchestration scripts with living specs that stay aligned across services.

The Five Layers of the Agentic Infrastructure Stack

Layer 1: Compute

Layer 2: Orchestration

Layer 3: Context

Layer 4: Observability

Layer 5: Security

See how Intent's living specs keep parallel agents aligned as plans evolve across services, with Coordinator-driven orchestration and Verifier-led checks built in.

Cost Modeling: What Teams Spend on Agent Infrastructure in 2026

Where Costs Spike Non-Linearly

Published Cost Optimization Levers

A Starter Stack for Teams Deploying First Production Agents

Seven Mistakes That Derail First Production Deployments

Map the Five Layers Before You Scale Agent Spend

See how Intent's living specs keep multi-agent work aligned as plans evolve across services.

FAQ

Written by

Ani Galstian

Give your codebase the agents it deserves

TL;DR

Why Agentic Infrastructure Needs Its Own Budget Line

See how Intent's coordinated multi-agent workspace replaces ad-hoc orchestration scripts with living specs that stay aligned across services.

The Five Layers of the Agentic Infrastructure Stack

Layer 1: Compute

Layer 2: Orchestration

Layer 3: Context

Layer 4: Observability

Layer 5: Security

See how Intent's living specs keep parallel agents aligned as plans evolve across services, with Coordinator-driven orchestration and Verifier-led checks built in.

Cost Modeling: What Teams Spend on Agent Infrastructure in 2026

Where Costs Spike Non-Linearly

Published Cost Optimization Levers

A Starter Stack for Teams Deploying First Production Agents

Seven Mistakes That Derail First Production Deployments

Map the Five Layers Before You Scale Agent Spend

See how Intent's living specs keep multi-agent work aligned as plans evolve across services.

FAQ

What separates agentic infrastructure from standard LLM serving infrastructure?

How much does a production agentic infrastructure stack cost per month?

Which orchestration framework should teams start with?

Where does open source work well, and where does it fall short?

What security standards should teams review before shipping agents to production?

Related

Written by

Ani Galstian

Give your codebase the agents it deserves