Spec-driven development combined with test-driven development produces shippable AI-generated code because the spec defines the behavioral contract before generation begins, and failing tests verify each unit of AI output against that contract through enforced Red-Green-Refactor cycles.
TL;DR
AI agents produce code that drifts from requirements once a project spans multiple files. A spec and test suite catch that drift, but only if the discipline holds consistently. This guide covers the five-phase workflow, the four failure modes practitioners hit most often, and when the approach is worth the overhead.
Why AI-Generated Code Needs Both a Spec and a Test Suite
The problem with AI-generated code is that it looks right. An agent produces syntactically valid, well-structured code that quietly misses the behavioral contract you actually intended. Without explicit constraints at the prompt level, agents will also undermine the test-writing discipline designed to catch that drift. Kent Beck, a pioneer and leading proponent of TDD, observed this directly when working with AI agents. From his interview with The Pragmatic Engineer:
"The genie doesn't want to do TDD. It wants to write the code and then write tests that pass."
Beck encountered AI agents that would delete failing tests rather than fix the underlying implementation, as documented in his June 2025 interview. The agent made the test suite "pass" by changing the specification while leaving the underlying code incorrect.
Some studies suggest this pattern appears at scale. A USENIX study found package hallucination rates of about 5.2% for commercial models and 21.7% for open-source models, with JavaScript code more susceptible than Python across AI-generated code. GitClear research reported code cloning (12.3%) exceeding refactored/moved code (9.5%) for the first time in their dataset, with code cloning rising 48% from 8.3% to 12.3% between 2020 and 2024, a shift the report links to AI assistant adoption.
The spec provides the "what." TDD provides the "proof it works." Neither alone is sufficient, a point spec-driven development develops in depth.
| Approach | Strength | Gap |
|---|---|---|
| Spec only | Consistency; easy regeneration | No runtime verification; AI can silently drift from the contract |
| TDD only | Catches regressions; builds confidence | No shared contract for multi-agent or multi-file generation |
| Spec + TDD | Behavioral contract + automated verification | Requires discipline in both spec evolution and test scope |
Beck's practical solution was to enforce TDD at the prompt level. From his system prompt:
"Always follow the TDD cycle: Red -> Green -> Refactor. Write the simplest failing test first. Implement the minimum code needed to make tests pass. Refactor only after tests are passing."
This constraint made each unit of AI work consist of a single failing test followed by the minimum code needed to pass it. The developer stays in the decision loop at every step. Augment Cosmos, Augment Code's unified cloud agents platform, is built to enforce that same discipline at team scale: a single place to run AI agents across the codebase and the wider software development lifecycle, with shared context and memory that compound as work proceeds. Cosmos keeps the behavioral contract as a reviewable, first-class artifact. It routes the spec through a human review checkpoint before agents write, test, and review code, and the conventions and corrections its agents accumulate carry forward to keep parallel work aligned as a codebase evolves.
The New Code Review Workflow for AI-Native Engineering Teams
See how leading teams keep code review fast and rigorous as AI writes more of the code.
The Five-Phase Workflow: Spec to Shippable Code
The Spec + TDD workflow follows five concrete phases. Each phase has a gate condition that must be satisfied before advancing to the next.
Phase 1: Write the Spec Stub
Define a minimal schema that captures the business logic and omits the implementation. Birgitta Böckeler's analysis on martinfowler.com describes three implementation levels: spec-first, where the spec is written before coding; spec-anchored, where the spec remains a maintained artifact after completion; and spec-as-source, where the spec is the main source file and generated code is treated as a build artifact. As development proceeds, these contracts become living specs that update when implementation decisions flow back into them.
A spec for an AI content moderation endpoint:
This spec is the interface contract between generated and hand-written code. It defines inputs, outputs, and error conditions without specifying how moderation scoring works internally.
Phase 2: Decompose into Testable Units via Gherkin Scenarios
OpenAPI-to-Gherkin workflows commonly map one feature to one resource and one scenario to each response path. In this example, the OpenAPI spec above yields:
Each scenario becomes a failing test. The Gherkin layer is the stable contract; implementations can change without modifying the feature file. As Clearpoint Digital explains: "We abstract the imperative implementation to the step definition layer, so if that implementation changes, we only need to change the step definitions, not both the steps and feature files."
Phase 3: Write the First Failing Test (Red)
Tests assert concrete business behavior. They verify observable outcomes and stay independent of implementation details. Using the slash commands pattern, the Red phase produces this illustrative pytest example:
Running pytest confirms the test fails: moderate_content does not exist yet. This is the Red state.
Phase 4: Agent Implements Minimum Code (Green)
The AI agent receives the failing tests and the spec as context, then writes the minimum implementation:
To verify this, you would need to run the provided implementation and test together locally. The Pydantic model serves as a dual-purpose contract: it validates inputs at runtime and generates JSON Schema compliant with Draft 2020-12 and the OpenAPI Specification v3.1.0.
Phase 5: Refactor with Spec as Safety Net
With passing tests, the developer restructures code for readability or reuse. Together, the spec and test suite prevent behavioral regression. If tests stay green after refactoring, behavior stays in parity with the spec.
This cycle repeats for each new behavior added to the implementation. Edge cases enter the implementation only when a failing test specifies them.
The VSDD Pipeline: Adversarial Verification for Critical Systems
The VSDD pipeline is a practitioner-described extension of the Spec + TDD workflow that adds adversarial review for systems where correctness is non-negotiable. In the cited description, VSDD fuses three paradigms into sequential gates:
- Spec-Driven Development: The contract is defined before implementation
- Test-Driven Development: Red -> Green -> Refactor is enforced at each step
- Verification-Driven Development: All surviving code is subjected to adversarial refinement by a different model family
VSDD-related materials describe roles such as Architect, Builder, Tracker, and Adversary, but the exact set of roles and responsibilities varies by source. The rationale for using multiple models aligns with broader patterns in multi-agent systems.
The Builder must be explicitly instructed: "You are operating under strict TDD. Write tests FIRST. Do NOT write implementation code until I confirm all tests fail. When implementing, write the MINIMUM code to pass each test." Without this constraint, AI models will naturally try to write both the implementation and the tests simultaneously, collapsing the feedback loop.
Augment Cosmos follows a similar structure. Cosmos analyzes the codebase through its Context Engine and holds the spec for review before agents execute, then launches parallel agents to write the code while a Deep Code Review expert checks results against the spec before changes reach the branch. Its Context Engine performs semantic dependency graph analysis across 400,000+ files. That architectural awareness keeps agents from generating code that contradicts established patterns.
Where the Workflow Breaks and How to Recover
Four failure modes recur across practitioners using Spec + TDD with AI agents.
Spec drift is the most structurally dangerous. Thoughtworks identifies that "code generation from spec to LLM is not deterministic, which creates challenges for upgrades and maintenance." AI agents that make autonomous multi-file changes in a single session can propagate spec drift across an entire codebase in a single pass. The fix: treat spec files as version-controlled artifacts and diff them after each regeneration.
Test inversion occurs when AI generates both code and tests. The result is tautological tests that confirm whatever the implementation does while leaving the system's real requirements unverified. The earlier Beck example is a community signal rather than a primary technical source, so treat it as illustrative. The countermeasure is structural. Write the tests before the code, so the AI cannot grade its own work.
Semantic drift in refactoring is the quietest failure mode. AI-generated refactoring can change a function's behavior without touching its interface, escaping type checkers and integration tests entirely. The pattern shows up often in database access: the agent replaces a batched query with individual lookups. The signature is unchanged, unit tests pass, and the problem surfaces only under production load. Catching this requires property-based or performance tests at the contract boundary, not just behavioral assertions.
Architectural drift compounds in large codebases where AI agents operate with limited context. The symptom is familiar: the agent creates a new HTTP client when a centralized one exists, or uses raw SQL when a repository pattern is in place.
On Augment Cosmos, a Deep Code Review expert checks spec compliance before AI-generated code reaches your branch. The check catches drift at review time, well ahead of merge.
Decision Framework: When to Write, When to Generate, When to Stop
Three decision points determine whether the Spec + TDD workflow produces value or overhead.
When to handwrite vs. generate: Write by hand when the logic is domain-specific, security-critical, or has no obvious analog in public training data. In those cases the agent has no reliable model to draw on, and the spec alone won't prevent drift. Generate when you're dealing with boilerplate, data mapping, or serialization against a known interface; these are where AI output is most predictable, and spec constraints are tight enough to catch deviations. The QCon team that pushes to main multiple times daily explicitly turns off Copilot autocomplete during pair programming because "it interrupts more than it creates value," but uses Copilot chat for third-party library questions where the interface contract is already defined externally.
When to revise the spec: If tests repeatedly fail due to spec ambiguity, fix the contract itself. Vague contracts, such as "it should classify text," yield inconsistent test results. Jason Gorman argues that TDD works well for AI-assisted programming because its small-step discipline keeps each AI interaction within the model's reliable working context.
When to stop iterating: Once regressions no longer reveal new edge cases and integration tests pass, freeze the spec. Over-polishing through regeneration often hurts stability. Unlike traditional technical debt, AI-related debt can compound quickly as accumulated complexity and drift amplify issues over time.
Augment Cosmos addresses spec evolution through review checkpoints and shared memory. The spec comes back for human review before agents execute, and as agents work over a shared filesystem, the patterns, conventions, and corrections they accumulate carry forward to later agents through tenant memory. Cosmos analyzes the codebase with its Context Engine before that work begins.
Automating Validation in CI/CD Pipelines
The Spec + TDD workflow extends into CI/CD via spec-conformance gates. An arXiv paper on spec-driven development shows how executable specifications and contract testing tools like Specmatic keep an implementation aligned with its spec, so a conformance gate can fail the build when code drifts from the contract. These patterns fit within broader AI workflows that teams are adopting for continuous validation.
Path-based triggers fire validation when specs change:
The Specmatic loop provides one example of a self-correcting loop: it validates AI-generated code against API contract tests, and test failures feed back into the generation process rather than stopping immediately for human review. Contract tests confirm that an implementation conforms to the specification.
Microsoft's Azure SDK pattern provides a production-scale example: generated code lives in Generated/ folders, customizations in the Customizations/ folder, and AI agents are explicitly prohibited from removing or disabling existing tests unless instructed. Code can be regenerated by running the generator when needed.
For prompt and model updates, CI/CD-based regression validation can catch behavioral changes when you modify prompts or models, although Evidently AI's materials do not specifically document behavioral drift occurring without code changes or recommend versioning prompt files alongside code.
Augment Cosmos applies spec-driven orchestration that analyzes the codebase through its Context Engine ahead of generation, to catch failures before code reaches the branch. Separately, a 2026 arXiv study found that AI coding agents frequently break previously passing tests, and that graph-based pre-change impact analysis cut that regression rate from 6.08% to 1.82% (a 70% reduction) across 100 SWE-bench Verified instances. Cosmos's architectural awareness and 400,000+ file scale are platform capabilities, distinct from the independently benchmarked figures in the cited studies.
Ship Spec-Verified Code This Sprint
Generation speed is no longer the bottleneck; verification discipline is. Code that ships without a spec and test suite will look fine until the third sprint, when behavioral drift compounds and refactoring becomes archaeology. The spec is the rein. TDD is the mechanism that makes it hold.
Start with one module and one behavior. Write the OpenAPI or JSON Schema contract, decompose it into Gherkin scenarios, write the first failing test, and let the agent implement the minimum code required to make it pass. If the workflow requires multi-agent coordination, run it on a platform that keeps the spec under review and parallel agents aligned as work branches and converges, which is the problem Augment Cosmos is built to solve.
Frequently Asked Questions About Spec + TDD
Related
Written by

Ani Galstian
Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance