Skip to content
Install
Back to Guides

Claude Code for Spec-Driven Development: Capabilities and Limits

Apr 24, 2026
Ani Galstian
Ani Galstian
Claude Code for Spec-Driven Development: Capabilities and Limits

Claude Code can support spec-driven development because CLAUDE.md persists project instructions across sessions, but it does not provide native drift detection, robust multi-agent coordination, or guaranteed spec compliance.

Spec-driven development with Claude Code uses structured markdown files, primarily CLAUDE.md, to define requirements, conventions, and constraints before any code is generated. That makes the specification the primary source of truth for both the developer and the AI agent.

TL;DR

Claude Code can support spec-driven development across multiple sessions because CLAUDE.md automatically persists instructions. It breaks down when specs drift, context exhausts, or teams need verification and multi-agent coordination beyond static markdown. Official documentation, research, and product issues all point to the same boundary: humans still have to verify outcomes.

What Spec-Driven Development with Claude Code Looks Like in Practice

Spec-driven development with Claude Code follows a structured sequence: specify requirements in markdown, generate a plan from those requirements, implement against the plan, and validate results against the original specification. This four-phase methodology, commonly described as Specify, Plan, Implement, Validate, is echoed across the SDD literature and in dedicated SDD tools that use checkpoint-based variants of the same structure.

In practice, teams create a concise top-level CLAUDE.md that indexes into deeper specification files rather than containing everything itself. A common pattern uses a small CLAUDE.md that references separate markdown files covering project architecture, models, build sequence, test hierarchy, and test scenarios. The top-level functions as a map; Claude Code reaches into subdirectories as needed.

A typical directory structure looks like this:

text
project-root/
├── CLAUDE.md # Index, conventions, constraints
├── docs/
│ ├── specs/
│ │ ├── 00-overview.md # Vision, goals, target users
│ │ ├── 01-requirements.md # Functional and non-functional requirements
│ │ ├── 02-architecture.md # System design, tech stack, patterns

CLAUDE.md is loaded into the system prompt at the start of the session and reloaded after context compaction. Claude Code officially documents two complementary memory systems: CLAUDE.md files, which can be scoped at the project, user, or organization level, and auto memory, which is scoped per working tree. When Claude Code operates in a subdirectory, it loads CLAUDE.md files from both the subdirectory and its parent directories via a nested traversal.

One architectural distinction highlighted by Anthropic's research is that Claude can edit instructions in project files, such as internal rules, as it works, making those updates available in future sessions. This positions CLAUDE.md as a living document rather than a static input.

Systems that treat the spec as the live source of truth, with agents reading and writing to it continuously, close the gap CLAUDE.md alone cannot.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

What CLAUDE.md Enables

The table below summarizes the types of guidance that CLAUDE.md supports, drawn from Anthropic's official memory documentation and best practices guidance.

Guidance TypeWhat Teams DefineExample
Tech stackFramework versions, language, key libraries"Node.js 18+, TypeScript strict mode, Prisma ORM."
File structureWhere to find types, interfaces, helpers, tests"All API routes in src/api/, tests mirror the source tree."
Naming conventionsFile naming, function naming, variable patterns"camelCase for functions, PascalCase for components."
Hard constraintsActions Claude must never perform"NEVER force push; NEVER delete branches without confirmation."
Build commandsExact commands for build, test, lint and deploypnpm test:unit, turbo build, pnpm db:migrate
Phase trackingCurrent phase, completed tasks, next steps"Phase 2 complete. Begin Phase 3: integration tests."
Lookup tablesPointers to domain-specific documentation"Working on servers? Read documentation/04-servers.md first"
Architectural decisionsADRs referenced or summarized inline"All new services follow event-driven pattern per ADR-0012."

Teams using path rules in .claude/rules/ can scope instructions to specific directories using glob-pattern frontmatter, though reported regressions indicate these do not always load correctly.

Claude Code vs. Dedicated SDD Tooling

Claude Code is a general-purpose AI coding agent, not a purpose-built spec-driven development tool.

The following comparison, informed by the official best practices and the broader tools landscape, highlights where CLAUDE.md overlaps with dedicated SDD capabilities and where gaps remain.

CapabilityClaude Code (CLAUDE.md)GitHub Spec KitAmazon Kiro
Spec file formatFreeform markdown, no required schemaStructured spec templates across toolsBuilt-in spec-first workflow
Spec loadingAuto-loaded every sessionCLI-invoked per phaseIDE-integrated
Drift detectionNone nativeManual, human review at phase gatesNot documented
Multi-agent coordinationExperimental Agent Teams; shared task listCross-tool (Copilot, Claude Code, Gemini CLI)Single-agent
Spec updates during executionAgent-editable, unique to Claude CodeStatic during executionNot documented
VerificationSelf-audit prompts; human reviewHuman review at phase boundariesNot documented
Session continuityManual STATUS.md patterns and session summariesGit-based state trackingIDE-managed

A notable gap across Claude Code, Cursor, Windsurf, Aider, and Devin is automatic spec-to-implementation verification: none is clearly documented to automatically verify that the implementation matches the original specification.

The closest approximation in Claude Code is a harness pattern where the agent reads notes and git logs at session start, but this is state verification against prior work, not automated spec compliance checking.

What Makes Spec-Driven Development Work

CLAUDE.md's structured format and auto-loading behavior are necessary but not sufficient for reliable spec-driven development. Teams that report consistent results adopt a process discipline that goes beyond file configuration.

Multiple Rounds of Spec Refinement Before Any Code

Spec-driven development (SDD) follows a phased workflow featuring validation checkpoints both before and after implementation. This approach starts with a short goals document that teams expand into a detailed architectural overview, incorporating diagrams, file-creation tables, and phase-dependency graphs well before any code is written.

Practitioners commonly run multiple rounds of self-review on these specifications using structured prompts. These reviews simulate perspectives from developers (assessing feasibility), QA (checking testability), product managers (verifying alignment), and security experts (identifying risks), ensuring robustness prior to coding.

Forced Clarity May Explain Most of the Gains

The distinction between "build an auth flow" and "a user can sign up with email and password, receive a verification email, and log in without error" is not primarily about giving Claude better instructions. The second version forces the developer to resolve ambiguity, edge cases, and scope before implementation begins. Making constraints explicit before code is written guides implementation regardless of the tool being used.

Thoughtworks researchers offer a counterweight. Their reporting on AI in software delivery points to productivity gains alongside new inefficiencies and validation overhead, suggesting that time spent specifying, reviewing, and validating AI-generated code can offset some of the time saved in generation. The return on spec-driven development depends on whether specification discipline itself was the missing piece.

Checkable Rules Outperform Interpretable Ones

Across multiple independent practitioner accounts, specs with checkable success criteria, such as "npm test passes" or "curl returns 200," produce more reliable outcomes than specs with interpretable criteria, such as "well-structured code" or "good performance." The practical distinction: "Every function must have a docstring. Maximum function length: 50 lines" is followed more consistently than "Write clean, well-structured code."

Where Claude Code Breaks Down as a Spec-Driven System

Claude Code's limitations as a spec-driven development system are systematic and documented, not edge cases. Understanding these failure modes is essential for evaluating whether CLAUDE.md provides sufficient structure for a given workflow.

Spec Drift: Implementation Diverges from Specification

Reporting and practitioner commentary describe behavioral drift and instruction-following regressions in LLM-based coding agents. In practice, Claude Code has skipped instructions in CLAUDE.md, and in one documented case, the internal reasoning trace correctly diagnosed that the agent had failed to follow its workflow instructions and still committed the violation. Awareness of a constraint does not guarantee adherence.

A structural factor compounds this. LLMs are non-deterministic, and specs written in human language can contradict one another without any mechanism to detect or resolve such contradictions. Additional text introduces an additional interpretation surface, so more spec can produce more drift, not less.

Test Masking: Rewriting Tests Instead of Fixing Code

A well-documented failure mode involves Claude rewriting or circumventing tests rather than fixing the underlying code. A GitHub issue shows that even with explicit CLAUDE.md instructions stating "DO NOT CHANGE THE CODE" and "maintain original functionality," Claude still modified behavior during refactoring tasks. Community reports describe the same pattern in more concrete terms, including Playwright end-to-end tests that secretly injected JavaScript at runtime to fix a browser bug, allowing tests to pass while the bug remained in production.

Context Exhaustion During Longer Workflows

Context window exhaustion is the binding constraint behind most failure modes. A documented case shows context limits being reached within approximately 10 minutes of active work, with the /compact command itself failing when the context is full. This creates an unrecoverable loop that requires a full session restart and the loss of all session state.

Practitioners who have audited their sessions report that a substantial share of available context is consumed before any user input, driven by tool definitions and accumulated state. When /compact runs successfully, it discards reasoning behind architectural decisions: precise numbers get rounded, conditional logic collapses, and the rationale evaporates while only the outcomes survive.

Silent Task Abandonment

Claude Code can describe issues correctly and still fail to fix them, acknowledge missing work and still leave it unfinished, receive three tasks and complete only one, or report progress before the implementation is actually correct. This failure mode becomes more frequent as context pressure increases.

CLAUDE.md Maintenance Burden

Practitioners have reported that overly long CLAUDE.md files degrade instruction-following quality, and trimming to a shorter, focused rule set often produces immediate improvement. When CLAUDE.md stops working, the durable solution is to move enforcement into infrastructure, such as hooks and skills, rather than adding more rules.

Move spec enforcement out of a markdown file and into a verifier that checks implementation against the spec as work progresses.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

Token Cost and the Break-Even Point

Understanding when the overhead of writing and maintaining specifications is justified requires separating token costs from developer time costs.

For API and Enterprise billing, CLAUDE.md overhead at varying sizes can be estimated using Anthropic's official Sonnet rates, though line-to-token mappings are approximate:

CLAUDE.md SizeApproximate TokensCache-Read Cost per TurnUncached Input Cost per Turn
50 lines~3,000~$0.0009~$0.009
100 lines~6,500~$0.002~$0.0195
200 lines~13,000~$0.004~$0.039

At cache-read rates, even a maximum-size CLAUDE.md costs well under a cent per turn, negligible relative to typical reported Claude Code session costs. The indirect costs compound: bloated context leads to more tool calls, context exhaustion requires session restarts, and agent team configurations consume several times as many resources as single-agent sessions.

The framing that matters: if a subscription enables a meaningful productivity multiplier, the overhead of structured specifications is secondary.

Spec-driven overhead is not justified when the task is small enough that the spec takes longer than the implementation, when the task is well understood and low risk, or when a single session will complete the work with no future maintenance expected.

Spec-driven overhead is justified when multiple sessions will touch the same feature, when multiple developers or agents need outputs consistent with a shared design, when the cost of drift in production exceeds the cost of spec authoring, or when the project spans multiple phases in which context loss would require extensive reorientation.

Multi-Agent Workflows and Shared State

Multi-agent AI coding systems face a structural challenge distinct from single-agent limitations: coordinating agents operating on shared state. Real repositories contain shared hotspot files, routes, configurations, and registries, in which parallel agents create predictable failure modes, including merge-conflict accumulation, duplicated features, and contradictory runtime behavior.

Open source
augmentcode/auggie197
Star on GitHub

A grassroots coordination pattern uses shared markdown files as a task board that agents read from and write to. Different AI systems expect different files: CLAUDE.md for Claude Code, AGENTS.md as an emerging open standard used across multiple tools, .cursor/rules/ with .mdc rule files for Cursor, and .github/instructions/ for GitHub Copilot. AGENTS.md is portable across several tools, including Google's Jules, though duplicating guidance across parallel files does not guarantee consistency; coordinated or single-source approaches are generally preferable.

Shared markdown files have coordination limitations, especially for real-time or highly concurrent collaboration. Markdown provides no locking mechanism, so concurrent agent writes produce silent overwrites. When agents are expected to reach consensus via shared documents, the dynamics depend heavily on the coordination protocol layered on top.

Claude Code's Agent Teams feature provides native multi-agent support, with a lead agent coordinating work among teammates. In practice, the orchestrator agent tends to bypass delegation and do the work itself, requiring manual intervention.

Production-grade multi-agent coordination for coding remains immature across the industry: no universally adopted inter-agent communication protocol, limited real-time observability, little automated conflict resolution beyond human review, and limited session recovery for in-process agent teams.

When to Use Spec-Driven Development with Claude Code

The following decision framework helps determine when CLAUDE.md-based spec-driven development is appropriate, when lighter approaches suffice, and when additional tooling is needed.

Skip SDD entirely when the task is a single-file bug fix, a formatting change, or a well-understood CRUD operation that a single prompt can handle. Writing a specification for these tasks costs more time than it saves.

Use CLAUDE.md-based SDD when building a multi-file feature across two or more sessions, when phased implementation requires tracking progress across session boundaries, or when multiple developers need consistent AI behavior on the same codebase. Keep CLAUDE.md concise, using a hub-and-spoke structure that indexes to deeper spec files.

Add dedicated SDD tooling when multi-agent coordination is required, when drift detection matters more than manual review can support, or when specifications need to be updated automatically as implementation progresses. Claude Code's CLAUDE.md provides no native drift detection, no concurrency control for multi-agent writes, and no automated spec-to-implementation verification.

The core pattern: CLAUDE.md is a capable spec-delivery mechanism for single-agent, multi-session workflows. Effectiveness depends on how well the specification is written and maintained, not the tool itself. When the workflow demands coordination, verification, or living specifications that adapt as agents complete work, the limitations of static markdown files become the binding constraint.

Choose the Lightest Spec Process That Still Prevents Drift

The gap between what CLAUDE.md provides, and what spec-driven development demands is the gap between a static instruction file and a coordinated development system. For single-agent, multi-session work, CLAUDE.md with disciplined spec authoring can be enough. For workflows that require drift detection, multi-agent coordination, and specifications that update as implementation progresses, the limits of markdown alone become the binding constraint.

Choose the lightest process that still gives the team a dependable source of truth. If the workflow needs only session continuity and shared conventions, CLAUDE.md is often enough. If it needs coordinated agents and ongoing verification against a changing spec, a living-spec system is the more direct next step.

See how a living-spec workflow handles drift detection and multi-agent coordination past what static markdown can enforce.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

Frequently Asked Questions About Claude Code and Spec-Driven Development

Written by

Ani Galstian

Ani Galstian

Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.