Can current AI tools reliably regenerate a full codebase from a spec?

Current tools can regenerate substantial portions of a service from a sufficiently detailed spec, but reliability depends on the spec's completeness. The bootstrapping study demonstrated full regeneration from a 926-word specification. Community signals are mixed: practitioners describe full commitment to regeneration as "a laughable stance with current tools," even as they run production spec-driven workflows.

What spec format works best for AI coding agents?

Structured Markdown provides cross-tool compatibility, though support varies: Cursor uses YAML frontmatter in its .cursor/rules/ system; GitHub Copilot and Claude Code use plain Markdown instruction files. For API-specific specs, OpenAPI with x- vendor extensions offers strong tooling support. Smithy docs and Protocol Buffers offer native code generation, useful for greenfield projects.

How does spec-driven development differ from waterfall?

Spec-driven development is iterative: the spec evolves as agents work and as implementation reveals new requirements. Augment Cosmos persists those intermediate decisions in shared memory, so later sessions inherit them instead of rediscovering them.

What happens when specs contradict each other at scale?

The constitution pattern addresses spec contradiction for non-negotiable constraints. For feature-level specs, CI gates using tools like oasdiff detect structural conflicts, and living specs with contract tests like Pact validate behavioral consistency.

Does spec-as-source mean developers stop reading code?

No. Comprehension debt in agentic coding is a real risk: spec-as-source shifts primary cognitive effort to the spec, but debugging, security review, and performance optimization still require reading generated code. The spec reduces the volume of code requiring manual attention; it does not eliminate code literacy.

The Spec as Source of Truth: Why Codebases Should Be Rebuildable from Documentation

Treating the spec as the single source of truth for AI coding means the specification becomes the primary artifact engineers maintain, and code becomes a derived, regenerable output that AI agents produce from that spec on demand.

TL;DR

Most teams cannot rebuild their codebase from documentation alone. The rebuild test reveals why: delete src/, point a fresh agent session at the spec, and regenerate. The divergences that surface are almost always implicit decisions: error codes, caching strategies, and library choices that live in developer memory instead of documentation.

Every engineering team has experienced the moment: a critical service breaks, the developer who built it left six months ago, and the documentation describes a system that no longer exists. The codebase is the only record of what the software actually does.

AI coding agents intensify this problem because they are stateless between sessions: they cannot remember decisions made in prior work or infer why a particular authentication pattern was chosen. Without a persistent, complete specification, each new session starts from zero, and the resulting code drifts. Augment Cosmos, the unified cloud agents platform, was built around this gap: agents on Cosmos work from shared, persistent context rather than starting cold.

Spec-as-source research frames this directly: "Spec-driven development inverts the traditional workflow by treating specifications as the source of truth and code as a generated or verified secondary artifact." The open question for engineering teams is how complete, how machine-readable, and how rigorously enforced those specifications need to be.

This guide covers the rebuild test as a completeness benchmark, what "source of truth" means when AI agents are the primary code producers, and a practical spectrum of three rigor levels for spec-driven development that teams can adopt incrementally.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

The Rebuild Test: A Spec Completeness Benchmark

The rebuild test is a binary assessment of spec quality: delete the src/ directory, open a clean AI agent session, point it at the specification files, and attempt to regenerate the codebase. If the regenerated output passes the existing test suite and matches production behavior, the spec covers those behaviors adequately. If it diverges, the spec has gaps.

A bootstrapping study demonstrated the mechanism directly: "Starting from a 926-word specification and a first implementation produced by an existing agent (Claude Code), a newly generated agent re-implements the same specification correctly from scratch."

What the Rebuild Test Reveals

The rebuild test surfaces a specific category of missing information: implicit decisions. Production codebases accumulate decisions that never get written down. Why does the cancellation endpoint return 200 on duplicate requests instead of 409? Why does the authentication middleware check for a specific header format? Why does the rate limiter use a sliding window instead of a fixed window?

When an AI agent encounters these ambiguities during regeneration, it makes its own decisions. Those decisions will differ from the original ones. The test suite catches some divergences, but many implicit decisions affect behavior in ways that standard tests do not cover.

Rebuild Test Outcome	What It Means	Required Action
All contract tests pass	Spec covers behavioral contracts	Expand to edge case coverage
Unit tests fail on business logic	Spec omits decision rationale	Add business rules with enforcement levels
Integration tests fail	Spec omits cross-service contracts	Add dependency specs and API contracts
Tests pass, but behavior differs	Tests are insufficient; spec may also be incomplete	Strengthen both spec and test coverage
The agent cannot start the generation	Spec lacks structural information (stack, dependencies, folder structure)	Add architectural context to spec

When the rebuild test produces multiple failure rows simultaneously, fix them in this order: structural gaps first (the agent cannot start generation without stack and folder information), then integration test failures (cross-service contract gaps cause cascading failures that obscure unit-level diagnosis), and finally business logic gaps. The row "tests pass, but behavior differs" requires stronger test coverage before spec changes.

The Community Signal

Some practitioners have been testing this pattern in production for extended periods, treating code as disposable: "If the implementation is bad, delete it. Write tests first if you need to, then delete and regenerate." The spec becomes a checkpoint before execution begins.

What "Source of Truth" Means for AI Agents

Spec documentation for AI coding agents encodes decisions alongside requirements. The distinction matters because agents are stateless: the specification is the only mechanism for persisting context across sessions.

How Agents Actually Consume Specs

Every major AI coding tool uses the same mechanic: it loads context files at the start of each session. Cursor's rules documentation states this explicitly: "Large language models don't retain memory between completions. Rules provide persistent, reusable context at the prompt level."

Anthropic's context engineering guide describes Claude Code's hybrid strategy: "CLAUDE.md files are naively dropped into context up front, while primitives like glob and grep allow it to navigate its environment and retrieve files just-in-time."

Tool	Primary Context File	Additional Files
Cursor	.cursor/rules/*.mdc	.cursorrules (legacy fallback)
Claude Code	CLAUDE.md	@imported files
GitHub Copilot	.github/copilot-instructions.md	.github/instructions/*.instructions.md, AGENTS.md
Aider	CONVENTIONS.md (loaded via --read)	Multiple via .aider.conf.yml

GitHub Copilot's agent documentation recognizes AGENTS.md, CLAUDE.md, and GEMINI.md, so a single AGENTS.md file can serve as a cross-tool spec artifact.

Decisions vs. Requirements

A requirement says "all API endpoints require authentication." A decision says "authorization failures return 403, not 404, to avoid enumeration attacks on order IDs (see CWE-200), per security review Q3 2024." The requirement specifies what the agent should build. The decision tells the agent why a specific implementation was chosen over alternatives.

Specs that encode only requirements produce code that works, but makes different implementation choices on every regeneration. Specs that encode decisions produce code that converges toward the same implementation regardless of which agent or model generates it.

The spec-as-source research captures this: the planning phase "encodes constraints that the implementation must respect, for example, 'use PostgreSQL for persistence' or 'all API endpoints require authentication.' When using AI coding assistants, the plan provides crucial context: the AI learns not just what to build but how the system is structured and what conventions it should follow."

The Session Fidelity Problem

Static specs create a specific failure mode when agents execute across multiple sessions. An analysis of 600 rejected pull requests finds that agentic PR failures stem from misalignment with repository workflows and developer expectations: CI failures, duplicated work, and unwanted features dominate the rejection taxonomy. The agent receives an accurate spec but still produces code that fails CI because the spec does not update to reflect intermediate decisions made during implementation.

Augment Cosmos addresses this at the platform layer. Agents on Cosmos work over a shared filesystem with tenant memory, so decisions made during implementation persist beyond the session that produced them. Sessions turn one-off prompts into auditable, replayable workflows that can stay private to one engineer or be promoted into a shared capability for the whole organization. As the shared filesystem accumulates patterns, conventions, and corrections, each new session inherits the current project state, and that context compounds across the team and the software development lifecycle.

Pattern	Flow	Session Continuity
Static spec	Requirements → Agent → Code (spec unchanged)	Lost between sessions
Living spec	Requirements → Agent → Code → Spec updates → Next agent inherits current state	Preserved across sessions

The structural problem this exposes: teams routinely version the source code generated by agents but neglect to version the specs that produced it. That inverts the dependency relationship that matters most.

The Practical Spectrum: Three Rigor Levels for Spec-Driven Development

Engineering teams do not adopt spec-driven development in a single step. The three levels below represent a progression from lightweight spec discipline to full spec-as-source, each with distinct enforcement mechanisms, organizational tradeoffs, and named production examples. Teams can enter at any level, but each builds on the practices of the one before it.

Level 1: Spec-First

Spec-first development means writing the specification before implementation begins. The spec is an input artifact that enables parallel development, but drift between spec and implementation accumulates over time without structural prevention.

How it works: Teams write an OpenAPI YAML, Protocol Buffer definition, or structured Markdown spec before writing code. Consumer teams develop against mock servers generated from the spec while the provider team builds the real implementation.

Named example: Stripe maintains a Stripe OpenAPI with versioned spec files in JSON and YAML. Per Stripe docs, developers of third-party SDKs and custom Stripe clients can generate code from the unified v1 and v2 specification files.

Core tradeoff: Spec-first provides high parallelism (consumers develop before providers ship) but carries high drift risk because nothing automatically enforces spec-code parity. Enforcement is social and manual.

Level 2: Spec-Anchored

Spec-anchored development places the spec in a shared, centrally governed repository. Automated contract testing at build and CI time enforces that implementations conform to the spec.

How it works: A central contract repository, owned by neither provider nor consumer teams exclusively, holds the specifications. CI gates fail builds when implementation diverges from the shared spec. Contract testing tools like Pact automatically validate conformance.

Named example: Netflix GraphQL uses federated GraphQL schemas as the shared contract. Each Domain Graph Service is a standalone spec-compliant GraphQL service.

Named example: Shopify's componentization pairs Packwerk, which rejects dependency-graph violations in CI, with Sorbet, which expresses input and output contracts on component boundaries and makes them machine-checkable at the type level.

Core tradeoff: Spec-anchored development provides low drift risk through automated CI gates, but introduces medium organizational friction because provider teams must relinquish sole ownership of the spec.

Level 3: Spec-as-Source

Spec-as-source development treats the spec as the primary artifact to be maintained. Engineers edit specs; machines produce code. Any change to behavior means changing the spec and regenerating.

How it works: Code generation pipelines produce implementation artifacts from the spec, and manual edits to generated code are prohibited or confined to extension points. A martinfowler.com analysis documents this pattern with Tessl, which marks generated code with // GENERATED FROM SPEC - DO NOT EDIT.

Named example: Uber design system uses spec-as-source across seven implementation stacks: UIKit, SwiftUI, Android XML, Android Compose, Web React, Go, and SDUI. A Figma link serves as the spec input, and an agent translates the design into technical specifications for each stack.

Named example: Netflix UDA treats the domain model as the spec; GraphQL schemas, Iceberg tables, and Data Mesh sources are generated projections.

Core tradeoff: Spec-as-source provides near-zero drift risk by construction, but introduces high organizational friction and high debugging complexity because generated code is harder to trace.

How the Three Rigor Levels Compare

The three levels trade drift risk against organizational friction, and the right entry point depends on which failure mode costs the team more.

Open source

augmentcode/auggie★253

Star on GitHub

Dimension	Spec-First	Spec-Anchored	Spec-as-Source
Primary function	Alignment and parallelism	Enforced shared contract	Primary maintained artifact
Spec ownership	Provider team	Shared/central repo	Spec team or tooling
Enforcement	Social/manual	Automated CI gates	Generation or continuous validation
Drift risk	High	Low	Near-zero
Organizational friction	Low	Medium	High
Debugging complexity	Low	Medium	High
Tooling maturity (2025–2026)	High	High	Low to medium
Named examples	Stripe	Netflix (GraphQL Federation)	Netflix (UDA), Uber (design system)

The rigor levels form a progression. Teams that attempt Level 3 without first establishing Level 2's CI gates and contract tests find that their specs become aspirational documents rather than enforced contracts.

The Adoption Sequence

A practical adoption path moves one service through the levels in sequence:

Pick one critical service (authentication, payments, core API)
Write a machine-readable spec covering its behavioral contracts
Add a CI lint gate using Spectral or oasdiff in report-only mode
Promote to the blocking gate after observing violations for one to two sprints. Promote when the violations the lint gate catches are consistently real gaps rather than false positives from teams still iterating on the spec.
Add contract tests to validate consumer-provider compatibility
Test the rebuild: delete generated code, regenerate from spec, confirm tests pass
Scale to other services once the workflow proves reliable

What Generation-Grade Specs Actually Contain

A spec that passes the rebuild test contains more than endpoints and request schemas. Generation-grade specs encode business rules with enforcement levels, architectural decisions with rationale, security constraints with CWE mappings, and generation provenance metadata.

The spec-as-source research captures the distinction: the planning phase "encodes constraints that the implementation must respect... Without this context, even a perfect functional spec may yield code that contradicts organizational standards or architectural decisions."

Documentation-Grade vs. Generation-Grade

The two grades differ on every dimension an agent depends on during regeneration.

Dimension	Documentation-Grade	Generation-Grade
Business rules	In prose comments or an external wiki	Inline, machine-readable, with enforcement levels
Rationale	Absent or in a separate ADR document	Co-located with the rule, linked by ID
Constraints	Described in the description fields	Encoded as executable expressions (CEL, regex, enum)
Generation metadata	None	Model version, timestamp, rebuild trigger
Error semantics	HTTP status codes only	Reason codes, retry semantics, business-rule references
Security constraints	Implicit in implementation	Explicit, with CWE mappings and enforcement levels

Example: OpenAPI with Generation-Grade Extensions

The OpenAPI spec supports specification extensions via the x- prefix. Generation-grade specs use these extensions to carry business logic, decisions, and provenance:

yaml

paths:
  /orders/{id}/cancel:
    post:
      summary: Cancel an order
      x-business-rules:
        - id: BR-042
          rule: "Orders may only be cancelled within 30 minutes of placement"
          rationale: "Fulfillment SLA requires warehouse pick within 45 minutes"
          enforcement: MUST
        - id: BR-043
          rule: "Orders in SHIPPED status cannot be cancelled"
          rationale: "Carrier handoff is irreversible; use /returns"
          enforcement: MUST
      x-architectural-decisions:
        - adr-ref: "ADR-017"
          decision: "Cancellation is idempotent; repeated calls return 200"
          rationale: "Retry safety for mobile clients on unreliable networks"
      parameters:
        - name: id
          in: path
          required: true
          schema:
            type: string
            format: uuid
          x-validation-note: "Must be caller's own order; returns 403 not 404"
      responses:
        '409':
          description: Order cannot be cancelled
          x-business-rule-ref: "BR-042, BR-043"

The x-business-rules entries carry both the rule and the rationale inline. The x-architectural-decisions entry links to a specific ADR with its reasoning. The x-validation-note on the path parameter encodes a security constraint (403 vs. 404 to prevent enumeration) that documentation-grade specs leave entirely implicit.

The Constitution Pattern

Constitutional spec-driven development research introduces a separate constraint layer: a versioned document encoding non-negotiable security requirements with CWE vulnerability mappings. The constitution's constraints and feature-specific logic govern each feature spec.

yaml

# constitution.yaml
version: "2.0.1"
principles:
  - id: SEC-001
    enforcement: MUST
    cwe-mapping: CWE-285
    rule: "All API endpoints require authentication and appropriate authorization checks"
    rationale: "No anonymous access to business data"
  - id: DATA-001
    enforcement: MUST
    rule: "All state mutations emit an audit event"
    rationale: "Supports logging, monitoring, and auditability objectives relevant to SOC 2 Type II"

This separation enforces security and compliance constraints structurally; individual feature specs no longer carry the burden of remembering to include them.

Run the Rebuild Test on Your Most Critical Service This Week

The gap between "we have documentation" and "we have a rebuildable spec" is the gap between intent and execution. Every implicit decision not captured in the spec is one that the next AI agent session will make differently.

The right starting point is the one service where an implicit decision has already caused a production incident or confusing agent output. The first spec element to write is the decision rationale for the behavior that is hardest to explain to a fresh agent. That single entry, "cancellation returns 200 on duplicate requests because retry safety matters more than strict idempotency semantics", is the difference between a spec that describes what the code does and one that explains why. Run the rebuild test. The failures are the roadmap.

The Spec as Source of Truth: Why Codebases Should Be Rebuildable from Documentation

TL;DR

The Agentic SDLC

The Rebuild Test: A Spec Completeness Benchmark

What the Rebuild Test Reveals

The Community Signal

What "Source of Truth" Means for AI Agents

How Agents Actually Consume Specs

Decisions vs. Requirements

The Session Fidelity Problem

The Practical Spectrum: Three Rigor Levels for Spec-Driven Development

Level 1: Spec-First

Level 2: Spec-Anchored

Level 3: Spec-as-Source

How the Three Rigor Levels Compare

The Adoption Sequence

What Generation-Grade Specs Actually Contain

Documentation-Grade vs. Generation-Grade

Example: OpenAPI with Generation-Grade Extensions

The Constitution Pattern

Run the Rebuild Test on Your Most Critical Service This Week

Frequently Asked Questions: Spec-Driven Development and Rebuildable Codebases

Written by

Ani Galstian

Give your codebase the agents it deserves

TL;DR

The Agentic SDLC

The Rebuild Test: A Spec Completeness Benchmark

What the Rebuild Test Reveals

The Community Signal

What "Source of Truth" Means for AI Agents

How Agents Actually Consume Specs

Decisions vs. Requirements

The Session Fidelity Problem

The Practical Spectrum: Three Rigor Levels for Spec-Driven Development

Level 1: Spec-First

Level 2: Spec-Anchored

Level 3: Spec-as-Source

How the Three Rigor Levels Compare

The Adoption Sequence

What Generation-Grade Specs Actually Contain

Documentation-Grade vs. Generation-Grade

Example: OpenAPI with Generation-Grade Extensions

The Constitution Pattern

Run the Rebuild Test on Your Most Critical Service This Week

Frequently Asked Questions: Spec-Driven Development and Rebuildable Codebases

Can current AI tools reliably regenerate a full codebase from a spec?

What spec format works best for AI coding agents?

How does spec-driven development differ from waterfall?

What happens when specs contradict each other at scale?

Does spec-as-source mean developers stop reading code?

Related

Written by

Ani Galstian

Give your codebase the agents it deserves