Do living specs always produce better agent output than static specs?

No. Living specs pay off on longer, multi-step tasks where requirements or dependencies shift during execution. For a single-session task with stable inputs, static specs provide enough structure without reconciliation overhead. The crossover point is roughly when a task spans multiple sessions or touches enough files that drift-induced rework becomes more expensive than continuous sync.

Can static specs work for small, well-defined tasks?

Yes, and they are often the better choice. A requirements doc and task list generated upfront by Kiro or a similar tool provides sufficient guidance when the work fits into one short cycle with stable dependencies. The weakness appears when ambiguity surfaces during implementation and the spec has no mechanism to absorb it.

What evidence supports the drift problem most strongly?

The strongest quantitative source is an empirical study of 33,000 agent-authored PRs across GitHub. Code-level failures (CI breaks, incorrect or incomplete implementations) accounted for 22% of rejections, and agentic failures added another 2%. Both categories represent alignment loss during execution rather than flawed task descriptions: exactly the failure mode living specs target.

How do living specs differ from tests or executable specifications?

Tests verify whether code still matches expectations at a point in time. Living specs go further by updating the human-readable source of intent as implementation changes. A passing test suite tells the next agent step "nothing is broken." A current living spec tells it "here is what we are building, here is what changed, and here is why."

Does AI non-determinism undermine spec-driven development entirely?

Non-determinism means the same spec can produce different code on Tuesday than it did on Monday. Living specs do not eliminate that; they reduce the blast radius by catching divergence before the next step compounds it. If agent runs are short and isolated, this matters less. If a workflow chains 8 steps across 3 sessions, uncontained non-determinism turns a small variance into a cascading misalignment.

Living Specs vs Static Specs: Agent Output Guide

Living specs generally produce more reliable agent output than static specs on multi-step tasks with changing requirements or dependencies because bidirectional synchronization keeps the specification aligned with the code as implementation evolves.

TL;DR

Static specs give agents a strong start, but they drift as code and dependencies change. A study of failed AI-generated PRs examined rejection patterns across 600 rejected pull requests and found that alignment loss during execution caused more failures than incorrect task descriptions. Living specs are most useful on longer, higher-volatility tasks because they reduce execution-time misalignment by keeping spec and code synchronized.

Why Spec Lifecycle Matters for Agent Output

Research on documentation drift confirms that outdated or inaccurate documentation hinders effective development and that reliable synchronization between code and documentation remains an unsolved challenge. That challenge becomes sharper when AI coding agents, not humans, are making many implementation decisions in sequence: each step can compound a small misalignment into a large one.

The industry has responded with two lifecycle models. Static specs, exemplified by Amazon's Kiro, generate requirements, design, and task documents upfront, with documented checkpoints or review steps between phases before implementation proceeds. Living specs keep the specification current as work progresses, so the next agent step operates from what actually exists rather than an obsolete plan. Augment Cosmos, a unified cloud agents platform, operationalizes that persistent context at the platform level: its durable sessions and shared memory keep specs aligned with code as agents complete tasks across the software development lifecycle.

Thoughtworks' Technology Radar places spec-driven development in the Assess ring as of Volume 33 (November 2025), noting that the workflows remain elaborate and opinionated, a caution this guide returns to after examining the evidence.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat

— Thu, Jul 9 // 9:45 AM PDT

What Static Specs Are: Kiro's Upfront Blueprint Model

Static specifications are formal documents created before implementation and not automatically updated when code diverges from the plan. Amazon's Kiro is a current example: it turns a natural language prompt into structured requirements, a design document, and a task list. User review gates sit between each phase. AWS frames the value as clear requirements-to-code traceability, but that traceability is primarily one-way: spec to code, not code back to spec.

Birgitta Böckeler's tool analysis classifies Kiro as Level 1: Spec-First, where a spec is written upfront for the task at hand but may not persist as the long-term source of truth.

The model works well when its assumptions hold: requirements are understood upfront, dependencies are pinned, and the task finishes before anything changes. The first thing that breaks is usually dependency alignment. The spec says "use library X," but by the time the agent reaches step 4, a new version has shipped, a downstream service has changed its contract, or a constraint surfaced during implementation invalidates part of the design. At that point, the spec is wrong but still authoritative, so the agent keeps following it. Kiro allows developers to manually request spec updates after code changes, but does not automatically reconcile specs during execution: the developer must recognize when drift has occurred and trigger the refresh.

What Living Specs Are: Bidirectional Sync During Execution

Living specifications are structured specs that update as implementation changes, so later agent steps operate from the current state rather than the original plan. In Böckeler's framework, this aligns with Level 2: Spec-Anchored and, in some systems, may approach a spec-as-source model.

The core mechanism is bidirectional synchronization: the spec informs code, and code changes flow back to update the spec. When an agent discovers an API has changed, a database field is missing, or a constraint conflicts with an upstream service, the system reconciles the spec against what was actually built before the next step begins. Without that reverse flow, a static spec stays frozen unless a human manually updates it.

Research on trustworthy AI-augmented engineering discusses end-to-end frameworks and design principles for maintaining alignment between agent output and developer intent. Martin Fowler's team identifies the gap in work on context anchoring: code shows what was built, but not the rejected options, accepted tradeoffs, or unresolved constraints. Once that reasoning disappears, subsequent agents have to infer intent from artifacts instead of reading it directly. Spec by Example proved that executable checks catch forward drift, but those checks still leave the spec frozen when implementation forces a design change. Bidirectional sync closes that gap by updating the spec automatically.

Developers working with GitHub's spec tooling noted that agents forced to rediscover intent from production code burn context on reverse engineering. A structured, current spec carries the same intent at higher signal density.

Cosmos is a production example of this approach. As a unified cloud agents platform, it maintains living specs through durable sessions that persist across days and weeks without losing context. Specialized agents (Experts) execute tasks in parallel while a shared memory layer captures corrections, patterns, and resolved decisions so later agent steps inherit updated intent rather than rediscovering it. From the developer's perspective, reconciliation surfaces as spec updates visible in the agent workspace: the developer can see what changed, review the updated intent, and intervene before the next agent step proceeds. Review happens at reconciliation checkpoints rather than only at the end. Under the hood, the Context Engine processes codebases across 400,000+ files. That scale gives Cosmos the architectural awareness to reconcile intent against multi-file implementation reality, reuse current intent rather than re-read stale artifacts, and consider cross-service dependencies rather than only the files a single agent happened to touch.

The Spec Drift Problem: Why Static Specs Degrade Over Time

Specification drift is the growing gap between what the spec says and what the code actually does. Research and practitioner analysis point to recurring causes.

Structural causes of drift

Training data lag creates version mismatch. Analysis of AI-generated code testing illustrates the broader pattern: agents often make incorrect assumptions about libraries, infrastructure, or APIs because their training data predates the version actually running in production.

Non-determinism creates divergence even when the spec stays constant. Research on AI-augmented engineering confirms that identical prompts can yield different code. This undermines requirement-to-code consistency even when the spec is technically correct.

Path of least resistance makes manual spec maintenance unreliable. Practitioner analysis of spec-driven development workflows emphasizes that specification authoring is part of implementation and should be treated with the same rigor as source code. Without an enforcement mechanism, drift is the default outcome.

Why AI agents amplify drift

AI systems compound the problem because change no longer comes from one developer working linearly. In spec-driven workflows, changes flow through specifications from features, bug fixes, and refactoring simultaneously. That multiplicity turns divergence from an exception into a normal operating condition. The security implications are measurable: Veracode's analysis of AI-generated code security found that AI-generated code introduced OWASP Top 10 vulnerabilities in 45% of coding tasks across 100+ LLMs. Separately, an initial case study on constitutional spec-driven development showed that embedding security constraints into the specification layer reduced defects by 73% in a single-project evaluation. When specs drift, security constraints drift with them.

The strongest quantitative evidence comes from an empirical study of 33,000 agent-authored PRs across GitHub. The researchers qualitatively analyzed 600 rejected PRs (562 accessible after excluding deleted or archived cases) and built a four-level taxonomy of rejection patterns:

Rejection category	PR count	Share	What it means
Reviewer-level abandonment	228	38%	PRs closed without meaningful human engagement
Pull request-level issues	188	31%	Duplicates, unwanted features, wrong branch
Code-level failures	133	22%	CI/test failures (99), incorrect implementation (19), incomplete implementation (15)
Agentic failures	13	2%	Licensing violations, misalignment with reviewer instructions

Code-level failures are the most relevant category for spec design. CI/test failures (99 PRs) dominate, followed by incorrect implementations (19) and incomplete implementations (15): cases where the agent understood the task but produced wrong or partial code. The study establishes a boundary condition rather than a universal rule: when an agent receives a correct task description and still produces code that fails CI or implements the wrong thing, alignment loss during execution is the primary failure mode. Living specs target exactly this.

Thoughtworks' Technology Radar adds a caution here: even within spec-driven development, teams risk reverting to traditional antipatterns like heavy upfront specification and big-bang releases. The drift problem is real, but the solution carries its own overhead, which the next two sections examine.

Two Drift Scenarios That Break Agent Output

The following scenarios illustrate how spec drift manifests in practice and how each spec model responds differently.

Scenario 1: API change mid-implementation

A common pattern involves AI-generated Terraform targeting AWS provider v4.x while the platform runs provider v5.x with breaking changes. Here is how that plays out step by step:

Spec created: The requirements specify provisioning an S3 bucket with server-side encryption using the project's Terraform configuration.
Agent starts implementation: The agent generates a resource "aws_s3_bucket" block with server_side_encryption_configuration inline, following the v4.x pattern its training data reflects.
Divergence appears: In AWS provider v5.x, server_side_encryption_configuration is a separate resource (aws_s3_bucket_server_side_encryption_configuration), not an inline block. The generated code is syntactically valid Terraform but architecturally wrong for the provider version the project uses.
Static spec response: The spec still says "provision S3 with encryption." Nothing in the spec reflects the provider version constraint. Syntax validation passes. The error surfaces at terraform plan or deployment, after the agent has already built dependent resources on top of the wrong structure.
Living spec response: When the agent reads the project's .terraform.lock.hcl and provider configuration, it detects the v5.x provider. The system updates the spec to note the provider version constraint before the next step. Subsequent resources reference the correct separate-resource pattern instead of compounding the v4.x assumption across the configuration.

The benefit is bounded: it is strongest when dependency checks happen during implementation rather than only at CI or deploy time. On a multi-resource Terraform configuration, catching the mismatch at step 3 instead of step 10 avoids cascading rework.

Scenario 2: Wrong implementation despite a correct task

The empirical study of failed agentic PRs documents cases where the task description is accurate, yet the implementation still solves the wrong problem. This pattern looks like:

Task assigned: "Add rate limiting to the /api/orders endpoint."
Agent implements: The agent adds rate limiting at the API gateway level, applying it globally to all routes.
Divergence: The project's architecture applies rate limiting per-service, not at the gateway. The existing rate-limiting pattern lives in middleware, and the gateway is intentionally thin. The agent's implementation is functionally correct (it rate-limits orders) but architecturally wrong (it changes the gateway contract for every service).
Static spec response: The spec says "add rate limiting to /api/orders." That is exactly what the agent did. The architectural mismatch only surfaces during code review or when another service's traffic patterns change unexpectedly.
Living spec response: If the system reconciles the spec against existing middleware patterns and gateway configuration, it updates the spec to specify per-service middleware rate limiting before the agent writes code. The resolved architectural decision carries forward to subsequent tasks involving other endpoints.

Cosmos addresses this directly: the spec persists as shared state across sessions and agents, so resolved decisions carry forward instead of being lost between handoffs. For teams comparing this to vibe coding, the key difference is whether intent survives the next handoff.

Drift scenario	Root cause	Static spec: when is it caught?	Living spec: when is it caught?
API version mismatch	Dependency reality changed	At terraform plan, deploy, or code review (after dependent resources are built)	During implementation, before dependent steps compound the error
Correct task, wrong architecture	Alignment lost during execution	At code review or production incident	During spec reconciliation, before the agent writes code
Non-deterministic agent behavior	Same prompt, different code	Not caught until manual comparison across runs	Persistent spec constrains later runs, reducing variance

Where Living Specs Add Cost or Risk

Living specs are not free. Continuous reconciliation introduces overhead and failure modes that teams should evaluate before adopting them.

Open source

augmentcode/augment-swebench-agent★873

Star on GitHub

Reconciliation can introduce errors. When an agent updates the spec based on implementation changes, it can silently drop a requirement, weaken a constraint, or misinterpret a design tradeoff. The spec mutation itself becomes a source of bugs. These errors are most likely when reconciliation crosses domain boundaries (the agent understands the code change but misreads its business implications) and least likely on mechanical changes like dependency version bumps or schema additions where the constraint is explicit. Teams using living-spec workflows need review checkpoints on spec changes as well as code changes. Cosmos enforces human-in-the-loop policies at reconciliation points, but no automated system catches every semantic drift in the spec itself.

Overhead scales with task frequency. Each reconciliation cycle costs compute, latency, and token usage. For a three-step task on a small repo, that overhead may exceed the cost of occasional manual re-prompting. The benefit only exceeds the cost when tasks are long enough or complex enough for drift to cause rework.

Review burden shifts rather than disappearing. Static specs front-load review at gates between phases. Living specs distribute review across reconciliation points throughout execution. The total review burden may not decrease; it redistributes from a few concentrated checkpoints to ongoing monitoring of spec changes.

Small tasks get over-engineered. A feature that touches one file, uses stable dependencies, and finishes in a single session does not benefit from continuous reconciliation. Adding spec-sync overhead to throwaway prototypes or single-function changes slows iteration without reducing risk.

Vendor architecture coupling. Cosmos's living-spec model is tightly integrated with the Context Engine and its Experts and Sessions primitives. That coupling is the cost of deep codebase-aware reconciliation: the Context Engine's cross-service dependency analysis is what makes reconciliation work at scale across 400,000+ files. Teams that prioritize vendor portability over deep integration should evaluate open frameworks like GitHub Spec Kit and cc-sdd that retrofit structure onto existing agents, accepting narrower reconciliation scope in exchange for modularity.

The Tradeoff Spectrum: Three Levels of Spec Commitment

Böckeler's framework offers the most useful lens for evaluating spec-driven systems, and a separate practitioner-focused study extends the same taxonomy with case studies and implementation guidance. The main tradeoff is how long the spec remains authoritative after initial creation.

Level	Name	Tool example	Ongoing cost	Alignment strength	Best for
1	Spec-First	Kiro	Low: no sync overhead	Weakens as task grows	Single-session, stable-dependency tasks
2	Spec-Anchored	Cosmos	Medium: reconciliation cycles	Maintains through execution	Multi-session, multi-file, volatile tasks
3	Spec-as-Source	Tessl	High: spec is the primary edit surface	Strongest, but non-determinism risk grows	Long-lived features with strict compliance needs

Most current tools still operate at Level 1 or have no native spec model at all. External frameworks such as GitHub Spec Kit and cc-sdd exist to retrofit structure onto agents that otherwise begin from prompts alone. Cosmos is built around the persistent spec model at Level 2 rather than treating it as a separate framework bolted on afterward. Its durable sessions and shared memory extend naturally to multi-agent workflows, where different agents must share the same updated intent instead of forking their own interpretations.

How to Match Spec Models to Task Complexity

The most practical approach for spec-driven development is segmenting by task type. A team might use Kiro for stable one-session features and Cosmos for multi-service migrations within the same quarter. The question is where the boundary falls for a given task. These variables determine the threshold:

Variable	Static spec likely sufficient	Living spec likely worth the overhead
Task duration	Single session, under 2 hours	Multi-session, spanning days or handoffs
Files and services touched	1-3 files in one service	4+ files across 2+ services
Dependency stability	Stable APIs, pinned versions, no active migrations	Active API changes, library upgrades, or infra shifts
Agent handoff count	One agent, one run	Multiple agents or resumed sessions

A concrete example: a team is upgrading a payment service from Stripe API v2 to v3 across 8 files in 2 services. The Stripe API contract is actively changing, the work will span at least 3 sessions, and two agents will divide the frontend and backend migration. All four indicators point to the right column: the work spans multiple sessions and services, the dependencies are actively changing, and two agents split the migration. A static spec written on day one will be wrong by day three. A living spec that updates as each migration step resolves Stripe's new field names and webhook contracts keeps the second agent from repeating the first agent's discoveries.

If a task hits two or more indicators in the right column, the risk of drift-induced rework likely exceeds the overhead of continuous reconciliation. Teams should test this by running one volatile task through a living-spec workflow and comparing the rework rate against their last equivalent task using static specs or prompt-only workflows.

For organizations operating in that higher-volatility range, Cosmos provides persistent specifications through durable sessions, reconciliation against real code via the Context Engine, and shared memory that carries resolved decisions across agents and teams.

What to Do Next

The choice between static and living specs is a task-level decision, not an organizational one. Start by mapping your current agent workflows against the threshold variables from the table above: task duration, files touched, dependency stability, and handoff count. Tasks that hit two or more volatile indicators are where living specs reduce rework enough to justify reconciliation overhead. For everything else, a clean upfront spec still works.

Living Specs vs Static Specs: Agent Output Guide

TL;DR

Why Spec Lifecycle Matters for Agent Output

The New Code Review Workflow for AI-Native Engineering Teams

What Static Specs Are: Kiro's Upfront Blueprint Model

What Living Specs Are: Bidirectional Sync During Execution

The Spec Drift Problem: Why Static Specs Degrade Over Time

Structural causes of drift

Why AI agents amplify drift

Two Drift Scenarios That Break Agent Output

Scenario 1: API change mid-implementation

Scenario 2: Wrong implementation despite a correct task

Where Living Specs Add Cost or Risk

The Tradeoff Spectrum: Three Levels of Spec Commitment

How to Match Spec Models to Task Complexity

What to Do Next

FAQ

Written by

Molisha Shah

Give your codebase the agents it deserves

TL;DR

Why Spec Lifecycle Matters for Agent Output

The New Code Review Workflow for AI-Native Engineering Teams

What Static Specs Are: Kiro's Upfront Blueprint Model

What Living Specs Are: Bidirectional Sync During Execution

The Spec Drift Problem: Why Static Specs Degrade Over Time

Structural causes of drift

Why AI agents amplify drift

Two Drift Scenarios That Break Agent Output

Scenario 1: API change mid-implementation

Scenario 2: Wrong implementation despite a correct task

Where Living Specs Add Cost or Risk

The Tradeoff Spectrum: Three Levels of Spec Commitment

How to Match Spec Models to Task Complexity

What to Do Next

FAQ

Do living specs always produce better agent output than static specs?

Can static specs work for small, well-defined tasks?

What evidence supports the drift problem most strongly?

How do living specs differ from tests or executable specifications?

Does AI non-determinism undermine spec-driven development entirely?

Related

Written by

Molisha Shah

Give your codebase the agents it deserves