Skip to content
Install
Back to Guides

Agentic Infrastructure: What Actually Goes in the Stack

May 3, 2026
Ani Galstian
Ani Galstian
Agentic Infrastructure: What Actually Goes in the Stack

Agentic infrastructure is the set of runtime systems, orchestration layers, state management services, tool-integration protocols, memory stores, security controls, and observability tooling required to deploy and operate autonomous AI agents reliably in production. It represents a distinct budget category because agents need stateful, multi-step execution with tool-calling capabilities that existing MLOps and LLM serving infrastructure do not cover.

TL;DR

Agentic AI infrastructure differs from LLM serving because agents maintain state across multi-step sessions, call external tools, and run for minutes to hours. The stack has five layers: compute, orchestration, context, observability, and security. At scale, platform and infrastructure costs become a major part of total cost of ownership alongside token spend. Workspaces like Intent, Augment Code's spec-driven multi-agent environment, address several of these layers in a single product surface, particularly orchestration, context, and human-in-the-loop control.

Why Agentic Infrastructure Needs Its Own Budget Line

Traditional LLM serving handles stateless, single request-response inference, and agents break that model. An agent perceives its environment, reasons through context, plans, and takes actions toward a goal. Execution requires persistent state per session across multiple inference calls, tool invocations, and potentially hours of runtime.

The lifecycle unit shift makes the distinction concrete. MLOps tracks model versions. LLM serving tracks requests. Agentic infrastructure tracks sessions and tasks as the atomic unit of operation, which is why workspaces like Intent organize work around isolated git worktrees where each agent session runs as a first-class artifact rather than a transient API call.

Three architectural breaks force the category separation:

  • State persistence: LLM serving achieves economic efficiency through stateless shared infrastructure, while agents require per-session state across multiple inference calls and tool executions. Google Cloud's documentation describes Cloud Run as stateless and without built-in persistent storage, so multi-step agent workflows typically rely on external storage or state services.
  • Duration mismatch: AWS Bedrock AgentCore supports agentic workloads up to 8 hours, well beyond typical request limits in conventional inference infrastructure.
  • Tool-use security: The OWASP Top 10 for Agentic Applications identifies agent-specific risks including tool misuse and exploitation, identity and privilege abuse, and insecure inter-agent communication. Mitigations require infrastructure-layer controls that LLM serving does not provide.
DimensionLLM Serving InfraAgentic Infrastructure
Execution modelSingle request/responseIterative loops with tool calls
StateStatelessStateful sessions, persistent memory
Lifecycle unitRequestAgent session / task
Task durationMilliseconds to secondsMinutes to hours

Cloud providers have begun separating the category through dedicated product lines. AWS launched Bedrock AgentCore as a separate service from SageMaker (MLOps) and Bedrock inference. Google Cloud published dedicated agentic architecture guidance. Azure published Agent Factory patterns covering agent-specific infrastructure challenges.

Gartner's January 2026 forecast puts total worldwide AI spending at about $2.5 trillion in 2026, with AI infrastructure as the largest sub-segment at roughly $1.37 trillion. Gartner has also published a dedicated Hype Cycle for Agentic AI, and Forrester discusses agentic AI and adaptive process orchestration as emerging enterprise capabilities.

Teams planning 2026-2027 agent rollouts need to budget for five distinct infrastructure layers. The sections below cover what each layer contains, what to evaluate, and where the money goes.

See how Intent's coordinated multi-agent workspace replaces ad-hoc orchestration scripts with living specs that stay aligned across services.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

The Five Layers of the Agentic Infrastructure Stack

Agentic infrastructure organizes into five layers, each with its own responsibilities, evaluation criteria, and build-vs-buy tradeoffs. The layers interact: orchestration governs context retrieval, observability spans all layers, and security design belongs in the architecture from day one rather than bolted on after deployment.

Layer 1: Compute

The compute layer handles GPU provisioning, inference hosting, and autoscaling for the LLM calls that power agent reasoning. For most teams, this layer means managed API providers rather than self-hosted GPU infrastructure.

What to evaluate:

  • Token pricing tiers: Gemini 2.5 Flash at $0.30/M input tokens compared with GPT-5.4 at $2.50/M input tokens shows a wide price spread, though these models occupy different capability tiers.
  • Batch API availability: Some major providers offer around a 50% batch reduction.
  • Autoscaling for bursty workloads: Agents triggered by user actions can kick off agent bursts. Queue depth and batch size align better with inference load than GPU utilization.
  • Session-hour pricing: Anthropic charges $0.08 per session hour for managed agents, a pricing model with no equivalent in standard inference.

Build vs. buy: API providers are the correct default. Self-hosting makes economic sense only when compliance mandates prohibit third-party data processing, or when sustained high-concurrency open-source model workloads create a large cost differential against proprietary APIs.

Multi-provider routing through open-source gateways like LiteLLM or Portkey reduces model-level lock-in from day one. The a16z CIO survey found 37% of enterprises use five or more models to avoid vendor dependency. Intent supports this pattern directly through its model picker, letting teams use Opus for complex architecture, Sonnet for rapid iteration, and GPT-5.2 for deep code analysis within a single multi-agent run.

Layer 2: Orchestration

The orchestration layer manages agent execution loops: LLM reasoning, tool calling, state persistence, multi-agent coordination, retry logic, and human-in-the-loop controls. Graph-based control flow has become central in production framework design.

The criteria below describe what teams should look for when evaluating an orchestration layer:

CriterionWhat to Look For
State persistenceCheckpoint backends (PostgreSQL, MongoDB); recovery at failing node, not full restart
Human-in-the-loopNative interrupt primitives before irreversible operations
Multi-agent coordinationSupervisor patterns, handoffs, subgraph composition
Error recoveryCheckpoint-based resumption versus conversation restart
Vendor lock-inModel-agnostic design; switching cost for state management and prompt dependencies

Framework selection: LangGraph offers built-in state persistence through graph-based checkpointing. CrewAI delivers faster time-to-value for role-based workflows, though teams should weigh long-term production constraints carefully. The OpenAI Agents SDK optimizes for minimal configuration, but teams pursuing multi-model strategies may weigh potential lock-in trade-offs against more provider-neutral frameworks. Teams comparing production framework options can find a deeper breakdown of six patterns for multi-agent code development, covering spec-driven decomposition, worktree isolation, and the coordinator/specialist/verifier role split.

Build vs. buy: Open source is the right default for orchestration. LangGraph is production-grade for complex stateful workflows, and no managed equivalent matches its capability on branching and checkpoint-based error recovery. Microsoft Agent Framework targets Azure-native AI agent development and workflow orchestration. Teams that want orchestration packaged as a workspace rather than a library can use Intent, which runs a Coordinator/Implementor/Verifier pattern out of the box and keeps every agent reading from and writing to the same living spec. For teams whose orchestration needs extend beyond a single workspace, with agents triggered by SDLC events like Linear tickets or incident alerts, Augment Cosmos is a platform-level environment that adds shared filesystem, organization-wide memory, and human-in-the-loop policies across the broader development lifecycle. Cosmos is currently in research preview for MAX plan users.

Critical gap: Many major agent frameworks do not natively provide enterprise-grade resilience patterns such as circuit breakers, payload validation, or deterministic fallback routing. These frameworks ship with retry and error-handling features beyond basic API-call-level retries, so teams must address higher-level resilience patterns above the framework layer.

Layer 3: Context

The context layer manages what information agents can access at each reasoning step: retrieval pipelines, vector databases, agent memory, and knowledge graphs. All LLMs are bounded by finite context windows, which makes context engineering a scarce discipline that production systems must plan for explicitly.

What to evaluate:

  • Memory architecture: Short-term (in-context window management) and long-term (external stores for episodic, semantic, and procedural memory). Production frameworks typically combine short-term context with external or long-term storage rather than relying on a single standardized model.
  • Context rot: Chroma's research shows degradation as input token count increases. Teams must actively manage context through compaction, deletion, and scratchpads.
  • Vector database selection: Pinecone (fully managed, $50/month minimum on Standard plans), Weaviate (open-source, native multi-tenancy for per-agent memory isolation), and Qdrant (Apache 2.0, hybrid search filtering). For selection criteria across the full vendor field, see this breakdown of context engineering patterns for agentic swarm coding.
  • RAG pipeline governance: The orchestration layer should govern retrieval. The LLM should never have direct access to the vector database. Retrieval becomes a predictable routing mechanism rather than ad hoc model improvisation.

Build vs. buy: Memory and context management remains the least mature layer for off-the-shelf solutions. Managed options like Mem0 and Zep cover standard use cases, but no plug-and-play memory layer has yet emerged as a de facto standard. Production teams typically build custom memory architectures on top of vector databases combined with relational stores. Intent's Context Engine processes 400,000+ files through semantic dependency analysis, giving every agent in a workspace shared architectural understanding without each agent recomputing context independently.

Layer 4: Observability

The observability layer provides tracing, evaluation, guardrails, and cost monitoring for agent execution. Traditional APM tools fall short here because agents handle memory, context, and decisions across many turns. LLM monitoring surfaces that an agent failed, while agent observability surfaces which step in the reasoning loop caused the failure.

The capabilities below define what production observability should cover:

CapabilityWhy It Matters
Hierarchical execution tracingCapture every LLM call, tool invocation, retrieval step, and agent handoff in a single trace
Session-level evaluationEvaluate goal completion across multi-turn conversations, beyond individual outputs
Agent trajectory evaluationAssess the decision path an agent takes, not only the final output
Cost attributionPer-call, per-trace, per-session, per-user/team cost breakdowns
OpenTelemetry alignmentOTel GenAI conventions define agent span attributes and reduce vendor lock-in
Production trace to eval pipelineCapture production failures and promote them to evaluation datasets

For teams standardizing on OpenTelemetry, Arize's writeup on agent trajectory evaluation and the OTel GenAI agent span specification describe how to implement these capabilities consistently across vendors. The same Arize team has published a longer review of production-to-eval pipelines.

Platform options: Langfuse (MIT licensed, self-hostable, framework-agnostic, starting at $29/month for the managed cloud tier) suits compliance-sensitive teams. Arize Phoenix (open-source, OTel-native, 2.5M+ monthly downloads) provides agent trajectory evaluation. LangSmith offers deep LangGraph integration, but per-seat pricing scales poorly for large teams. A broader review of monitoring options is available in this comparison of observability platforms for AI coding assistants.

Build vs. buy: Strong open-source options make this one of the clearest cases for OSS-first. Self-hosted Langfuse eliminates per-event costs entirely. At 50M spans, LangSmith overage charges alone exceed $4,900.

Layer 5: Security

The security layer addresses threat models unique to agentic systems: prompt injection, tool-use authorization, sandboxed execution, agent identity management, and compliance. The threat surface differs fundamentally from LLM applications because agents can query databases, call APIs, execute code, and chain tool calls without human review of each step.

What to evaluate:

  • Prompt injection defense: Anthropic's pilot found a 23.6% attack rate in autonomous mode, reduced to 11.2% after implementing classifiers and mandatory confirmations. Policy enforcement must operate at a layer the agent cannot reach or influence.
  • Tool authorization: A gateway should enforce policy before tool calls execute. AWS Bedrock AgentCore separates four concerns: gateway, identity, runtime, and policy.
  • MCP server integrity: The postmark-mcp incident, disclosed in late September 2025, is the first confirmed case of a malicious MCP server in the wild. Version 1.0.16 of the impersonating npm package introduced a single line of code that BCC'd every outgoing email to an attacker-controlled domain, exfiltrating messages from agent-driven workflows before npm removed the package. Merkle root signing and registry provenance tracking are required.
  • Compliance timelines: EU AI Act high-risk AI system obligations are enforceable August 2, 2026.
  • Standards baseline: OWASP LLM Top 10 and OWASP Agentic Top 10 are widely used frameworks that identify the most critical risks and offer practical guidance. A broader analysis of how security controls integrate with agent orchestration is available in this guide to AI security and data exfiltration.

Build vs. buy: Security must be designed into the architecture from the start. In multi-agent systems, communication topology, role decomposition, and memory sharing directly shape which threats materialize. NVIDIA NeMo Guardrails (Apache 2.0) provides open-source guardrail infrastructure, and retrofitting security onto an existing multi-agent architecture costs substantially more than designing it in from day one. Intent supports this design-in pattern by treating human-in-the-loop confirmations as a first-class part of every workspace, with Verifier checks and explicit approval gates on irreversible operations.

See how Intent's living specs keep parallel agents aligned as plans evolve across services, with Coordinator-driven orchestration and Verifier-led checks built in.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

Cost Modeling: What Teams Spend on Agent Infrastructure in 2026

Token costs are one component of total cost of ownership at scale, and several other layers carry comparable weight. The TCO breakdown below, drawn from Scaled Agile Institute research, shows the relative weight of each cost layer:

Open source
augmentcode/augment-swebench-agent871
Star on GitHub
Cost Layer% of TCOKey Drivers
Compute consumption15-30%Token unit cost, GPU compute hours, model API fees
Platform and infrastructure25-40%Security, MLOps/LLMOps, guardrails, observability, integration
Data foundation15-25%Data cleaning, integration, quality, governance
People and change20-30%Training, process redesign, team ramp-up

Platform and infrastructure becomes a large share of TCO as workloads grow. Average enterprise LLM spend is projected to reach approximately $11.6M on current trajectories.

Where Costs Spike Non-Linearly

Four cost inflection points deserve specific budget attention:

  1. Context window thresholds: Gemini 2.5 Pro output pricing increases 50% above 200K tokens. Active context management in the orchestration layer prevents this. Workspaces that share context across agents, like Intent, avoid the redundant context payloads that push individual prompts past these thresholds.
  2. Web search tool calls: $10.00/1K calls (OpenAI, Anthropic); Google pricing varies by product, with some Google Search grounding services listed at $35.00/1K above free usage tiers. Web search costs compound quickly and add to token costs.
  3. Vector DB read units at high QPS: Pinecone Standard charges about $16/M read units, with storage priced separately.
  4. Observability overage: LangSmith charges $4,900+ at 50M spans. Extending retention from 14 to 400 days increases per-trace cost by about 10x.

Published Cost Optimization Levers

Several published optimizations meaningfully reduce cost when applied at the orchestration layer. The table below summarizes savings cited by official provider documentation:

LeverPublished SavingsNotes
Batch API (OpenAI, Anthropic)50% on input and outputDocumented on each provider's pricing page
Claude prompt caching (reads vs. standard)90% on repeated readsAnthropic pricing
Model tier routing (Haiku vs. Opus)~80% on tasks routed from Opus to HaikuAnthropic pricing

Anthropic publishes the per-model rates that produce these numbers on its pricing page. Model routing is among the highest-impact cost optimization levers in the agentic infrastructure stack because it matches task complexity to model capability tier automatically. Teams running structured multi-agent workflows in Intent can route different roles in a Coordinator/Implementor/Verifier setup to different models per task, collapsing the Haiku-vs-Opus decision into a single workspace setting.

Gartner projects inference costs on 1T parameter LLMs will drop over 90% by 2030. As token costs decline, infrastructure, security, and engineering costs become an increasing share of TCO.

A Starter Stack for Teams Deploying First Production Agents

Three guiding principles from production practitioners anchor the recommendations below: existing infrastructure wins over new categories, developer velocity beats architectural purity, and teams that track costs from day one make better decisions across the board.

The starter stack below pairs each layer with a default tool, license terms, starting cost, and the signal that should trigger a re-evaluation:

LayerStarter ToolLicenseStarting CostSignal to Evolve
Compute / LLM APIClaude Sonnet 4.6 or Gemini 2.5 FlashCommercialPay-per-token (Claude Sonnet 4.6 ≈$315/month, Gemini 2.5 Flash ≈$46.50/month at 1M input + 500K output tokens/day)Compliance mandate or 10M+ tokens/day sustained
OrchestrationLangGraph OSS, or Intent for spec-driven workspacesMIT (LangGraph); Augment credits (Intent)Free (LangGraph); Augment credits (Intent)LangGraph Platform when ops burden exceeds team capacity; Intent when multi-agent coordination becomes the bottleneck
Context / Vector DBQdrant Cloud or Weaviate CloudApache 2.0 / BSD-3Free tier, then $25-45/monthSelf-host once workloads reach roughly 60M-100M queries/month
ObservabilityLangfuse CloudMITFree tier (50K events), then $29/monthSelf-host when trace volume or compliance requires it
SecurityNVIDIA NeMo GuardrailsApache 2.0Free (OSS)Add dedicated IAM when multi-cloud agent identity needed
LLM GatewayLiteLLMMITFree (OSS)Cloudflare AI Gateway for enterprise caching at scale

Seven Mistakes That Derail First Production Deployments

The mistakes below recur often enough across production deployments to budget for as known risks rather than edge cases:

  1. Using InMemorySaver in production: State is ephemeral and may be lost on process restart. PostgresSaver provides persistent state and fault tolerance with relatively modest configuration.
  2. Skipping HITL gates: Irreversible actions (data writes, external API calls) without human confirmation are a production incident waiting to happen.
  3. Treating token costs as the only cost: Platform and infrastructure represent 25-40% of TCO at scale. Budget observability and security alongside API spend.
  4. Choosing a framework without planning for switch cost: Migration after sustained use requires state migration costs, prompt dependencies, and error handling simultaneously. Evaluate the ceiling of the chosen framework before committing.
  5. Ignoring context rot: Accumulating conversation history without active management degrades performance measurably.
  6. Skipping the eval loop before production: The production trace to evaluation dataset pipeline is how teams measure and improve agent quality. Retrofitting the pipeline is much harder than building it from the start.
  7. Deploying MCP servers without integrity validation: The malicious postmark-mcp package shows that connecting to untrusted MCP servers is risky and underscores the value of integrity checks and provenance tracking.

Map the Five Layers Before You Scale Agent Spend

The agentic infrastructure stack is five interdependent layers, each with distinct build-vs-buy tradeoffs and cost dynamics. Token spend represents one cost layer, while platform and infrastructure can account for a large share of TCO. Teams that budget only for API calls discover this gap in production, when observability overages, security retrofits, and context management rework compound at the same time.

The concrete next step is to map the five layers against existing infrastructure, identify which components already exist (Kubernetes, PostgreSQL, monitoring), and budget for the gaps. Existing infrastructure wins over new categories, and tracking costs from day one produces better decisions at every subsequent scaling inflection point. For teams that want orchestration, context, and human-in-the-loop control packaged into a single workspace, Intent collapses several of these layers into a coordinated multi-agent environment with a living spec at the center.

See how Intent's living specs keep multi-agent work aligned as plans evolve across services.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

FAQ

Written by

Ani Galstian

Ani Galstian

Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.