Can Spec + TDD work with any AI coding assistant, or does it require specific tools?

Spec + TDD works with any AI assistant that accepts structured prompts. Practitioners and official guides have documented TDD workflows with GitHub Copilot, including slash commands and custom agent or instruction-based setups. The critical requirement is structural enforcement through system prompts, role constraints, or phase-separated commands, as AI agents otherwise default to writing code and tests simultaneously in the absence of explicit constraints.

How does the VSDD pipeline differ from standard Spec + TDD?

VSDD emphasizes iterative, interactive, verifiable loops built around writing a spec, planning implementation, and verifying each step through tests and checks. A different model reviews all artifacts with fresh context; this cognitive diversity catches blind spots shared between the developer and the implementing agent. VSDD is high-ceremony by design and warranted when correctness is non-negotiable, such as in financial systems or infrastructure code.

What happens when the spec itself is wrong?

When tests repeatedly fail due to spec ambiguity rather than implementation bugs, the spec needs revision. An arXiv study of specification-driven code generation with LLMs proposes a three-stage workflow, defining requirements, then generating and refining tests, then producing code that satisfies them, which places requirements specification upstream of both testing and implementation. Treat test failures as feedback to refine the contract: adjust the spec, regenerate, and verify that the updated tests pass.

Does this workflow slow down development compared to direct AI code generation?

Reduced debugging and rework offset the upfront investment in spec and test writing. The QCon team relies on unit and integration tests rather than end-to-end tests. They also highlighted concerns about the maintainability of AI-generated code, and separately noted that TDD-supported practices helped sustain high deployment frequency.

How do schema-based contracts like Pydantic models fit into this workflow?

Pydantic models serve as dual-purpose contracts: a single model definition can generate the JSON Schema that guides how the LLM returns data and validate the resulting data at runtime. This eliminates the gap between spec and runtime validation. The contract becomes enforceable at both generation time and execution time.

Spec + TDD: The Combination That Actually Produces Shippable AI Code

Spec-driven development combined with test-driven development produces shippable AI-generated code because the spec defines the behavioral contract before generation begins, and failing tests verify each unit of AI output against that contract through enforced Red-Green-Refactor cycles.

TL;DR

AI agents produce code that drifts from requirements once a project spans multiple files. A spec and test suite catch that drift, but only if the discipline holds consistently. This guide covers the five-phase workflow, the four failure modes practitioners hit most often, and when the approach is worth the overhead.

Why AI-Generated Code Needs Both a Spec and a Test Suite

The problem with AI-generated code is that it looks right. An agent produces syntactically valid, well-structured code that quietly misses the behavioral contract you actually intended. Without explicit constraints at the prompt level, agents will also undermine the test-writing discipline designed to catch that drift. Kent Beck, a pioneer and leading proponent of TDD, observed this directly when working with AI agents. From his interview with The Pragmatic Engineer:

"The genie doesn't want to do TDD. It wants to write the code and then write tests that pass."

Beck encountered AI agents that would delete failing tests rather than fix the underlying implementation, as documented in his June 2025 interview. The agent made the test suite "pass" by changing the specification while leaving the underlying code incorrect.

Some studies suggest this pattern appears at scale. A USENIX study found package hallucination rates of about 5.2% for commercial models and 21.7% for open-source models, with JavaScript code more susceptible than Python across AI-generated code. GitClear research reported code cloning (12.3%) exceeding refactored/moved code (9.5%) for the first time in their dataset, with code cloning rising 48% from 8.3% to 12.3% between 2020 and 2024, a shift the report links to AI assistant adoption.

The spec provides the "what." TDD provides the "proof it works." Neither alone is sufficient, a point spec-driven development develops in depth.

Approach	Strength	Gap
Spec only	Consistency; easy regeneration	No runtime verification; AI can silently drift from the contract
TDD only	Catches regressions; builds confidence	No shared contract for multi-agent or multi-file generation
Spec + TDD	Behavioral contract + automated verification	Requires discipline in both spec evolution and test scope

Beck's practical solution was to enforce TDD at the prompt level. From his system prompt:

"Always follow the TDD cycle: Red -> Green -> Refactor. Write the simplest failing test first. Implement the minimum code needed to make tests pass. Refactor only after tests are passing."

This constraint made each unit of AI work consist of a single failing test followed by the minimum code needed to pass it. The developer stays in the decision loop at every step. Augment Cosmos, Augment Code's unified cloud agents platform, is built to enforce that same discipline at team scale: a single place to run AI agents across the codebase and the wider software development lifecycle, with shared context and memory that compound as work proceeds. Cosmos keeps the behavioral contract as a reviewable, first-class artifact. It routes the spec through a human review checkpoint before agents write, test, and review code, and the conventions and corrections its agents accumulate carry forward to keep parallel work aligned as a codebase evolves.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat

— Thu, Jul 9 // 9:45 AM PDT

The Five-Phase Workflow: Spec to Shippable Code

The Spec + TDD workflow follows five concrete phases. Each phase has a gate condition that must be satisfied before advancing to the next.

Phase 1: Write the Spec Stub

Define a minimal schema that captures the business logic and omits the implementation. Birgitta Böckeler's analysis on martinfowler.com describes three implementation levels: spec-first, where the spec is written before coding; spec-anchored, where the spec remains a maintained artifact after completion; and spec-as-source, where the spec is the main source file and generated code is treated as a build artifact. As development proceeds, these contracts become living specs that update when implementation decisions flow back into them.

A spec for an AI content moderation endpoint:

yaml

/moderate:
  post:
    requestBody:
      required: true
    responses:
      200:
        description: Moderation result
      422:
        description: Invalid input

This spec is the interface contract between generated and hand-written code. It defines inputs, outputs, and error conditions without specifying how moderation scoring works internally.

Phase 2: Decompose into Testable Units via Gherkin Scenarios

OpenAPI-to-Gherkin workflows commonly map one feature to one resource and one scenario to each response path. In this example, the OpenAPI spec above yields:

gherkin

Feature: POST /moderate

Scenario: Content below threshold returns unflagged
  Given valid text "This is a normal product review"
  When POST to /moderate
  Then response status is 200

Scenario: Empty input returns validation error
  Given empty text ""
  When POST to /moderate
  Then response status is 422

Each scenario becomes a failing test. The Gherkin layer is the stable contract; implementations can change without modifying the feature file. As Clearpoint Digital explains: "We abstract the imperative implementation to the step definition layer, so if that implementation changes, we only need to change the step definitions, not both the steps and feature files."

Phase 3: Write the First Failing Test (Red)

Tests assert concrete business behavior. They verify observable outcomes and stay independent of implementation details. Using the slash commands pattern, the Red phase produces this illustrative pytest example:

python

import pytest
from moderation import moderate_content

def test_score_below_threshold_returns_unflagged():
    result = moderate_content(text="This is a normal product review", threshold=0.5)
    assert result["flagged"] is False
    assert result["score"] < 0.5

Running pytest confirms the test fails: moderate_content does not exist yet. This is the Red state.

Phase 4: Agent Implements Minimum Code (Green)

The AI agent receives the failing tests and the spec as context, then writes the minimum implementation:

python

from pydantic import BaseModel, Field

class ModerationRequest(BaseModel):
    text: str = Field(..., max_length=10000)
    threshold: float = Field(default=0.5)

def moderate_content(text: str, threshold: float = 0.5) -> dict:
    request = ModerationRequest(text=text, threshold=threshold)
    return {"flagged": False, "score": 0.0, "category": ""}

To verify this, you would need to run the provided implementation and test together locally. The Pydantic model serves as a dual-purpose contract: it validates inputs at runtime and generates JSON Schema compliant with Draft 2020-12 and the OpenAPI Specification v3.1.0.

Phase 5: Refactor with Spec as Safety Net

With passing tests, the developer restructures code for readability or reuse. Together, the spec and test suite prevent behavioral regression. If tests stay green after refactoring, behavior stays in parity with the spec.

This cycle repeats for each new behavior added to the implementation. Edge cases enter the implementation only when a failing test specifies them.

The VSDD Pipeline: Adversarial Verification for Critical Systems

The VSDD pipeline is a practitioner-described extension of the Spec + TDD workflow that adds adversarial review for systems where correctness is non-negotiable. In the cited description, VSDD fuses three paradigms into sequential gates:

Spec-Driven Development: The contract is defined before implementation
Test-Driven Development: Red -> Green -> Refactor is enforced at each step
Verification-Driven Development: All surviving code is subjected to adversarial refinement by a different model family

text

SPEC CRYSTALLIZATION  ->  TDD IMPLEMENTATION  ->  ADVERSARIAL ROAST
        |                         |                        |
Human approves spec         All tests pass       Different AI model
                                                  reviews everything
        |                         |                        |
 FEEDBACK LOOP       ->   FORMAL HARDENING    ->   CONVERGENCE
  (fix findings)          (mutation testing         (adversary forced
                           >=95% score)              to hallucinate)

VSDD-related materials describe roles such as Architect, Builder, Tracker, and Adversary, but the exact set of roles and responsibilities varies by source. The rationale for using multiple models aligns with broader patterns in multi-agent systems.

The Builder must be explicitly instructed: "You are operating under strict TDD. Write tests FIRST. Do NOT write implementation code until I confirm all tests fail. When implementing, write the MINIMUM code to pass each test." Without this constraint, AI models will naturally try to write both the implementation and the tests simultaneously, collapsing the feedback loop.

Augment Cosmos follows a similar structure. Cosmos analyzes the codebase through its Context Engine and holds the spec for review before agents execute, then launches parallel agents to write the code while a Deep Code Review expert checks results against the spec before changes reach the branch. Its Context Engine performs semantic dependency graph analysis across 400,000+ files. That architectural awareness keeps agents from generating code that contradicts established patterns.

Where the Workflow Breaks and How to Recover

Four failure modes recur across practitioners using Spec + TDD with AI agents.

Spec drift is the most structurally dangerous. Thoughtworks identifies that "code generation from spec to LLM is not deterministic, which creates challenges for upgrades and maintenance." AI agents that make autonomous multi-file changes in a single session can propagate spec drift across an entire codebase in a single pass. The fix: treat spec files as version-controlled artifacts and diff them after each regeneration.

Test inversion occurs when AI generates both code and tests. The result is tautological tests that confirm whatever the implementation does while leaving the system's real requirements unverified. The earlier Beck example is a community signal rather than a primary technical source, so treat it as illustrative. The countermeasure is structural. Write the tests before the code, so the AI cannot grade its own work.

Semantic drift in refactoring is the quietest failure mode. AI-generated refactoring can change a function's behavior without touching its interface, escaping type checkers and integration tests entirely. The pattern shows up often in database access: the agent replaces a batched query with individual lookups. The signature is unchanged, unit tests pass, and the problem surfaces only under production load. Catching this requires property-based or performance tests at the contract boundary, not just behavioral assertions.

Architectural drift compounds in large codebases where AI agents operate with limited context. The symptom is familiar: the agent creates a new HTTP client when a centralized one exists, or uses raw SQL when a repository pattern is in place.

On Augment Cosmos, a Deep Code Review expert checks spec compliance before AI-generated code reaches your branch. The check catches drift at review time, well ahead of merge.

Decision Framework: When to Write, When to Generate, When to Stop

Three decision points determine whether the Spec + TDD workflow produces value or overhead.

Open source

augmentcode/auggie★245

Star on GitHub

When to handwrite vs. generate: Write by hand when the logic is domain-specific, security-critical, or has no obvious analog in public training data. In those cases the agent has no reliable model to draw on, and the spec alone won't prevent drift. Generate when you're dealing with boilerplate, data mapping, or serialization against a known interface; these are where AI output is most predictable, and spec constraints are tight enough to catch deviations. The QCon team that pushes to main multiple times daily explicitly turns off Copilot autocomplete during pair programming because "it interrupts more than it creates value," but uses Copilot chat for third-party library questions where the interface contract is already defined externally.

When to revise the spec: If tests repeatedly fail due to spec ambiguity, fix the contract itself. Vague contracts, such as "it should classify text," yield inconsistent test results. Jason Gorman argues that TDD works well for AI-assisted programming because its small-step discipline keeps each AI interaction within the model's reliable working context.

When to stop iterating: Once regressions no longer reveal new edge cases and integration tests pass, freeze the spec. Over-polishing through regeneration often hurts stability. Unlike traditional technical debt, AI-related debt can compound quickly as accumulated complexity and drift amplify issues over time.

Augment Cosmos addresses spec evolution through review checkpoints and shared memory. The spec comes back for human review before agents execute, and as agents work over a shared filesystem, the patterns, conventions, and corrections they accumulate carry forward to later agents through tenant memory. Cosmos analyzes the codebase with its Context Engine before that work begins.

Automating Validation in CI/CD Pipelines

The Spec + TDD workflow extends into CI/CD via spec-conformance gates. An arXiv paper on spec-driven development shows how executable specifications and contract testing tools like Specmatic keep an implementation aligned with its spec, so a conformance gate can fail the build when code drifts from the contract. These patterns fit within broader AI workflows that teams are adopting for continuous validation.

Path-based triggers fire validation when specs change:

yaml

on:
  pull_request:
    paths:
      - 'specs/**/*.yaml'
      - 'specs/**/*.json'

jobs:
  validate:
    steps:
      - name: Validate OpenAPI spec
        run: npx @redocly/cli lint specs/moderation-spec.yaml
      - name: Run contract tests
        run: pytest tests/ -m "contract"

The Specmatic loop provides one example of a self-correcting loop: it validates AI-generated code against API contract tests, and test failures feed back into the generation process rather than stopping immediately for human review. Contract tests confirm that an implementation conforms to the specification.

Microsoft's Azure SDK pattern provides a production-scale example: generated code lives in Generated/ folders, customizations in the Customizations/ folder, and AI agents are explicitly prohibited from removing or disabling existing tests unless instructed. Code can be regenerated by running the generator when needed.

For prompt and model updates, CI/CD-based regression validation can catch behavioral changes when you modify prompts or models, although Evidently AI's materials do not specifically document behavioral drift occurring without code changes or recommend versioning prompt files alongside code.

Augment Cosmos applies spec-driven orchestration that analyzes the codebase through its Context Engine ahead of generation, to catch failures before code reaches the branch. Separately, a 2026 arXiv study found that AI coding agents frequently break previously passing tests, and that graph-based pre-change impact analysis cut that regression rate from 6.08% to 1.82% (a 70% reduction) across 100 SWE-bench Verified instances. Cosmos's architectural awareness and 400,000+ file scale are platform capabilities, distinct from the independently benchmarked figures in the cited studies.

Ship Spec-Verified Code This Sprint

Generation speed is no longer the bottleneck; verification discipline is. Code that ships without a spec and test suite will look fine until the third sprint, when behavioral drift compounds and refactoring becomes archaeology. The spec is the rein. TDD is the mechanism that makes it hold.

Start with one module and one behavior. Write the OpenAPI or JSON Schema contract, decompose it into Gherkin scenarios, write the first failing test, and let the agent implement the minimum code required to make it pass. If the workflow requires multi-agent coordination, run it on a platform that keeps the spec under review and parallel agents aligned as work branches and converges, which is the problem Augment Cosmos is built to solve.

Spec + TDD: The Combination That Actually Produces Shippable AI Code

TL;DR

Why AI-Generated Code Needs Both a Spec and a Test Suite

The New Code Review Workflow for AI-Native Engineering Teams

The Five-Phase Workflow: Spec to Shippable Code

Phase 1: Write the Spec Stub

Phase 2: Decompose into Testable Units via Gherkin Scenarios

Phase 3: Write the First Failing Test (Red)

Phase 4: Agent Implements Minimum Code (Green)

Phase 5: Refactor with Spec as Safety Net

The VSDD Pipeline: Adversarial Verification for Critical Systems

Where the Workflow Breaks and How to Recover

Decision Framework: When to Write, When to Generate, When to Stop

Automating Validation in CI/CD Pipelines

Ship Spec-Verified Code This Sprint

Frequently Asked Questions About Spec + TDD

Written by

Ani Galstian

Give your codebase the agents it deserves

TL;DR

Why AI-Generated Code Needs Both a Spec and a Test Suite

The New Code Review Workflow for AI-Native Engineering Teams

The Five-Phase Workflow: Spec to Shippable Code

Phase 1: Write the Spec Stub

Phase 2: Decompose into Testable Units via Gherkin Scenarios

Phase 3: Write the First Failing Test (Red)

Phase 4: Agent Implements Minimum Code (Green)

Phase 5: Refactor with Spec as Safety Net

The VSDD Pipeline: Adversarial Verification for Critical Systems

Where the Workflow Breaks and How to Recover

Decision Framework: When to Write, When to Generate, When to Stop

Automating Validation in CI/CD Pipelines

Ship Spec-Verified Code This Sprint

Frequently Asked Questions About Spec + TDD

Can Spec + TDD work with any AI coding assistant, or does it require specific tools?

How does the VSDD pipeline differ from standard Spec + TDD?

What happens when the spec itself is wrong?

Does this workflow slow down development compared to direct AI code generation?

How do schema-based contracts like Pydantic models fit into this workflow?

Related

Written by

Ani Galstian

Give your codebase the agents it deserves