Teams building custom agent infrastructure should reassess the approach after the first production rebuild, especially given the ongoing costs of maintenance, monitoring, integration, and compliance.
TL;DR
For teams assigning 5-10 engineers for 12-18 months, internal platform labor and maintenance remain the main 24-month cost variables. A community analysis of Claude Code's leaked codebase found that 98.4% of production agent code is deterministic infrastructure, with only 1.6% handling AI decision logic. Rebuild pressure usually comes from runtime, memory, observability, and governance layers discovered after the prototype succeeds.
Why Agent Infrastructure Scope Expands After the Prototype
The trap usually starts with a working prototype. A senior engineer connects a framework to a few tools, gets an agent demo running in two weeks, and leadership sees a fast path to production. After the demo, the organization owns far more infrastructure than the prototype suggested. Permission scoping, checkpoint persistence, model migration, context isolation, observability, and governance rarely block the prototype, but they all become production responsibilities the moment it goes live.
Many teams discover too late that they budgeted for the agent logic when the larger cost sits in the runtime, memory, context, observability, and governance layers underneath it. The examples below show how that gap surfaces in production as rebuilds, tool reduction, architecture simplification, and migration pressure. The sections that follow map that hidden scope, compare DIY against Cosmos across six dimensions, and outline when to keep building and when to stop.
Cosmos is Augment Code's Unified Cloud Agents Platform, currently in public preview for MAX plan users. It runs agents in the cloud with shared context and memory that compound across the team and the software development lifecycle, treating runtime, context, governance, and human-in-the-loop as shared primitives the platform provides directly.
See how Cosmos consolidates runtime, context, and governance into one platform surface.
Free tier available · VS Code extension · Takes 2 minutes
The Build Instinct Stays Rational Only Up to a Point
A familiar pattern emerges in engineering organizations adopting agent frameworks.
- A senior engineer spins up a LangGraph pipeline, connects it to a few tools, gets a demo working in two weeks, and the CTO greenlights a full build.
- Six months later, the team has spent more time on permission scoping, checkpoint persistence, and model migration than on the agent logic itself.
- The instinct to build is rational: your codebase is unique, your compliance requirements are specific, and your team has the talent.
- The problem is scope: teams budget for building an agent when the actual scope covers the platform the agent runs on.
- Choosing to build means assembling agentic frameworks, orchestration layers, custom governance, and the underlying infrastructure to run it all.
- Much of the implementation burden sits outside model logic, especially in governance and workflow integration.
In production, the organization owns a deeper set of responsibilities than the prototype suggested. The community analysis of Claude Code's leaked codebase examined its size, tool structure, and architecture, and the deterministic layer it documents includes the runtime and governance work teams usually discover only after the prototype succeeds. That is the harness engineering burden in practice: a deterministic substrate that has to be built, maintained, and governed for every model the team plugs in.
Why Most Teams Build Agent Infrastructure Twice
Engineering teams rebuild agent infrastructure because the failure modes that trigger a rebuild stay invisible at prototype stage. They do not appear in demos, synthetic test data does not trigger them, and standard APM observability does not surface them.
Three documented production cases illustrate the pattern:
Vercel removed 80% of their agent's tools and watched success rates climb from 80% to 100%. The team reported the outcome without attaching a specific quoted conclusion about simplification improving success rates.
Manus rebuilt their agent framework multiple times. Each rebuild followed the discovery that context management was the actual bottleneck, a pattern the team documents in their context engineering lessons: "We affectionately refer to this manual process of architecture searching, prompt fiddling, and empirical guesswork as 'Stochastic Graduate Descent'".
EGO Digital discovered a high failure rate in their document processing agent. After decomposing it into three specialized agents with individual model tiers, timeout budgets, and output schemas, reliability improved materially and cost per document fell.
Across these cases, the recurring signals are consistent:
- repeated rebuilds
- tool reduction improving outcomes
- behavioral issues discovered late
- architecture simplification outperforming added complexity
Philipp Schmid, a Staff Engineer at Google DeepMind, captured the pattern precisely in his context engineering work: if your harness keeps getting more complex while models improve, that is a sign of over-engineering.
For teams evaluating managed options, Cosmos is the operating system for agentic software development, combining agent runtime, shared context, and governance-related capabilities in one platform.
What You're Actually Building: Runtime, Context, Memory, Observability, Governance
When mapped as a production-grade agent platform, the scope expands across runtime, context, memory, observability, and governance.
Runtime
Official LangChain materials highlight production-agent capabilities such as streaming, task queues, checkpointing, human-in-the-loop support, and tracing. Available sources do not confirm a standardized six-feature list consisting exactly of parallelization, streaming, task queue, checkpointing, human-in-the-loop, and tracing.
Scheduling is supported via LangGraph's and LangSmith's built-in cron jobs, so teams do not need to build their own cron and task queue infrastructure from scratch.
Retry semantics are often mismatched to agent failures. When failures are persistent semantic errors and not transient transport errors, retry-with-backoff is the wrong pattern. The right pattern is raise-and-replan.
Context Management
Anthropic's official engineering materials discuss context engineering and related strategies such as memory, compaction, and tool clearing. They do not present a formal five-part taxonomy of context types labeled input context, runtime context, context compression, context isolation, and long-term memory. An arXiv paper on agent token consumption reports that token cost can vary substantially across agentic runs. Manus reported that, with Claude Sonnet, cached input tokens cost about 10% of the base input token price, and changes in prompt prefixes including timestamps or non-deterministic ordering can cause cache misses that wipe out the savings.
Memory and State
Four memory tiers require distinct persistence strategies.
- In-context working memory
- Short-term session state
- Long-term cross-session memory
- Episodic memory with temporal retrieval
This taxonomy does not appear in the AutoGen memory docs, the LangGraph runtime work, or the Mem0 research paper, which each describe different memory models. A simple example shows the staleness problem: memory about a user's employer is accurate until the user changes jobs.
Observability
Standard APM tracks traditional services. Agent systems add observability needs the cited examples document in detail. Oracle's engineering content emphasizes agent memory and related system components for maintaining continuity across sessions. Mesa built custom LLM observability directly in Postgres because no existing tool provided trace granularity at the phase, agent session, and tool call level, joinable with business data.
Governance
Governance requires hooks into runtime, memory, context management, and observability.
- Prompt injection defenses
- Sandboxing
- SIEM and DLP integration
- Red-team testing
Agents touching code and infrastructure must meet these obligations. The runtime architecture needs integration points for observability, governance, and context management designed in from day one, before the first deployment. In the examples above, late discovery of these layers is what creates rebuild pressure.
The Hidden Cost of Harness Engineering
The recurring pattern in the cited cases is straightforward: teams begin with foundation model integration and then discover additional platform layers one by one in production.
Those layers typically include:
- memory systems
- authentication and RBAC
- multi-surface interface support
- governance controls
Six distinct cost categories accumulate over time. The harness engineering guide covers the first category in more depth, and the AI evaluation paper cited in the regression testing row explains why AI QA tooling differs architecturally from existing systems.
| Cost Category | Nature | Key Evidence |
|---|---|---|
| Harness engineering | One-time build plus compounding maintenance | Claude Code's architecture has been described as dominated by surrounding harness and infrastructure rather than AI-specific decision logic |
| Integration tax | Recurring operational tax, not one-time CAPEX | Prompts, tool schemas, model versions, and business rules change |
| Model migration | Behavioral drift problem, not deployment problem | Production migration often requires a shadow pipeline mirroring production |
| Regression testing | New probabilistic infrastructure category | AI QA requires tooling architecturally different from existing systems |
| Security and compliance | Ongoing obligation with no off switch | Prompt injection defenses, sandboxing, SIEM/DLP integration |
| Opportunity cost | Senior engineer capacity consumed by infra | "Every week spent setting up infrastructure is a week not spent improving models or delivering product value" |
Cosmos centers on runtime, shared context, and memory primitives, with the Context Engine processing 400,000+ files for codebase understanding. Runtime-layer permission scoping applied before prompts reach the model is not yet documented in public Cosmos materials, and teams with strict gating requirements should confirm posture directly.
Cosmos combines runtime, context, and governance in one platform surface and removes the custom assembly project.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
At a Glance: DIY vs Cosmos Across 6 Dimensions
The comparison below summarizes how each path handles the six platform layers teams typically discover after the first prototype: runtime, context management, memory, observability, governance, and integration. Each row reflects what is documented in publicly available materials, with caveats noted where official sources are limited.
| Dimension | DIY Build | Cosmos |
|---|---|---|
| Runtime | Teams assemble from LangGraph, CrewAI, or custom code. Scheduling, retry semantics, and tool concurrency flags require custom engineering. | An agent runtime, the Context Engine, an event bus tied to the SDLC, and an organization-wide knowledge layer act as shared primitives across the stack, with human-in-the-loop as a first-class primitive. |
| Context management | Five separate context types to engineer per Anthropic's taxonomy. Token costs grow with additional agent steps. Cache hit optimization requires prompt prefix stability discipline. | The Context Engine processes entire codebases across 400,000+ files, providing codebase understanding for AI coding agents and software-development workflows. |
| Memory and state | Four memory tiers (working, session, cross-session, episodic) with no industry-standard persistence pattern. Multiple frameworks and vector stores remain in use. | A shared filesystem with tenant and private memory, plus an Org Knowledge Layer for persistent scratchpad behavior, so corrections and patterns compound across sessions. |
| Observability | Requires purpose-built agent episode tracing distinct from standard APM. Mesa built its observability stack in-house. | Canonical positioning for staff engineer and platform lead audiences describes Cosmos as "governed, observable, and reproducible." No dedicated tracing UI or logging pipeline is documented publicly today. |
| Governance | Cross-cutting with hooks into every other layer. Cannot be bolted on after the fact. | Human-in-the-Loop policies and runtime-layer policy enforcement around tool calls and permissions are first-class, with SOC 2 Type 2, ISO 42001, and GDPR referenced in product materials. |
| Integration and deployment | Each tool, each service, each event source requires individual wiring. Integration functions as recurring operational tax. | Integrations span GitHub, Jira, and Slack, with the platform designed to plug into existing CI/CD and collaboration workflows. |
One caveat: Cosmos is in public preview for MAX plan users, and dedicated tracing and logging interfaces are underspecified in available documentation. Engineering leaders with strict observability requirements should verify this dimension directly.
Break-Even Math
This model draws from the cited source set. In a build scenario that assigns 5-10 engineers for 12-18 months, engineering labor is the largest cost variable discussed in the article. The next three subsections break that down into salary baseline, maintenance multiplier, and a 24-month TCO view.
Salary baseline
In the bounded scenario used throughout this article, engineering labor matters most because the work depends on senior AI/ML infrastructure engineering time, not generic software engineering time.
Maintenance multiplier
Maintenance runs continuously through the life of the system. Model updates, orchestration changes, prompt changes, tool schema drift, and governance requirements all create recurring work in AI agent infrastructure.
24-month model
The table below contrasts the build and buy paths across the cost elements that compound over a 24-month horizon: engineering labor, infrastructure and license spend, and ongoing maintenance overhead. Both columns assume the same scenario of 5-10 engineers and 12-18 months of initial work.
| Cost Element | Build (Conservative) | Buy (Conservative) |
|---|---|---|
| Year 1: Engineering labor | 5-10 engineers building and integrating core platform layers | Internal work shifts toward platform adoption and integration, not full platform construction |
| Year 1: Infrastructure/license | Compute, storage, and tooling stack assembled internally | Platform license plus integration work |
| Year 1 total | Internal teams carry runtime, context, memory, and governance engineering during the first 12-18 months | Managed platform adoption reduces part of that engineering scope during the same period |
| Year 2: Engineering labor | Internal teams continue carrying maintenance, model migration, and rebuild pressure after initial deployment | Managed platform adoption reduces part of runtime, memory, and governance maintenance |
| Year 2: Maintenance overhead | Additional maintenance remains with the internal team as prompts, tools, and governance requirements change | Maintenance is split between internal integration work and the licensed platform |
| Year 2: Infrastructure/license | Continued internal platform costs | Continued platform costs |
| 24-month TCO | Build economics depend on carrying full build plus maintenance responsibility across 24 months | Buy economics depend on platform pricing relative to the cost of 5-10 engineers and the associated maintenance burden across the same period |
LLM API costs sit at the center of most build-vs-buy analyses and get compared explicitly against self-hosting or GPU costs. They are not zeroed out as identical across both paths. The break-even point requires either platform vendor pricing scaling with usage to approach build costs, or the custom build generating measurable differentiated value sufficient to offset the internal engineering investment.
Three conditions justify continuing to build: AI infrastructure is the core product being sold, the organization can commit 5-10 engineers for 12-18 months without impacting core product, or sufficient AI expertise to execute exists in-house.
The Migration Path from DIY to Cosmos
For teams that have already built custom infrastructure, the strangler fig pattern documented by Microsoft's Azure Architecture Center provides the migration architecture.
- A routing façade intercepts requests and directs them to either the legacy system or the new platform.
- Teams reduce migration risk by shifting traffic in small increments, keeping each step within rollback range.
- The façade supports dual-system traffic during migration.
- The same approach preserves a rollback path while traffic is still moving.
This is the migration structure used throughout the phased plan below, and it lowers risk while preserving rollback during transition.
Phase 1: Audit and parallel setup (weeks 1-6)
The first phase maps dependencies and defines the first safe migration target.
- Map all custom agent workflows, their state stores, tool integrations, and inter-service dependencies.
- Identify the first migration candidate: the workflow with the most clearly defined inputs and outputs, lowest blast radius, and highest current maintenance burden.
- Deploy the managed platform in non-production capacity.
- Deploy the routing façade.
- Define quantitative success criteria before any traffic moves.
- A finding from a Ponsse migration study on Azure DevOps to GitHub Actions migration was that automated tooling could only partially convert pipelines, and some tasks or constructs still required manual migration.
Migration effort therefore depends heavily on the ratio of custom-to-standard components.
Phase 2: Shadow testing (weeks 6-16)
The second phase validates behavior while keeping production risk low.
- Send real production requests to both systems simultaneously.
- Serve only legacy system responses.
- Log both outputs.
- Compare against behavioral equivalence thresholds defined before testing begins.
- The façade must always be able to route 100% of traffic back to the legacy system.
- Test rollback capability explicitly before any canary release begins.
Rollback readiness belongs in the test plan and gets validated before any canary release.
Phase 3: Incremental workflow migration (months 4-9)
Migrate lowest-risk workflows first, highest-maintenance workflows second, and highest-complexity stateful workflows last. Starting fresh on state is usually the simpler default. Historical state migration makes the most sense when regulatory requirements mandate continuity, agent utility depends on historical context, or the state schema is documented well enough for reliable mapping.
Phase 4: Legacy decommission (months 9-12+)
Three criteria must all be satisfied before decommissioning: 100% of traffic routing to the managed platform, the platform stable within defined thresholds for a minimum of 30 continuous days, and formal stakeholder sign-off on rollback path retirement.
One observability cost risk to model is telemetry volume. Model cost curves at 10x and 100x your current agent invocation volume before committing. Negotiate volume pricing upfront.
Reclaim Engineering Time Before the Second Rebuild
The real decision is whether checkpoint persistence, cache hit optimization, governance retrofits, and migration plumbing deserve 12-18 months of senior engineering time. The production cases in this article show that many teams discover the full platform burden only after the first version is already live, which is why rebuild pressure appears so consistently. A practical next step is to audit the layers your team already owns: runtime, context, memory, observability, and governance. If the roadmap still assumes you are only building agent logic, the scope is already understated. Before announcing a migration, define which product work will absorb the recovered engineering time, and confirm whether the current stack still justifies carrying full build plus maintenance responsibility across 24 months.
See how Cosmos combines shared context, memory, and runtime primitives for agents across the software development lifecycle.
Free tier available · VS Code extension · Takes 2 minutes
FAQ
Related
Written by

Paula Hingel
Technical Writer
Paula writes about the patterns that make AI coding agents actually work — spec-driven development, multi-agent orchestration, and the context engineering layer most teams skip. Her guides draw on real build examples and focus on what changes when you move from a single AI assistant to a full agentic codebase.