Spec-driven development in brownfield enterprise codebases is most effective when teams write change-level specifications rather than full-system specs, because undocumented dependencies and scale make comprehensive upfront specifications impractical.
TL;DR
Brownfield codebases break greenfield SDD assumptions because legacy behavior, dependencies, and contracts are rarely documented. The practical approach is to write specs only for the change being made, incrementally grow coverage, and verify against existing tests and production-observed behavior.
Martin Fowler's article on spec-driven development tools evaluates Kiro, spec-kit, and Tessl but does not quantify the effort to introduce them into existing codebases. This matters for teams maintaining repositories with hundreds of thousands of files, 10-15 years of technical debt, and little surviving architectural documentation.
The problem is structural. SDD demos usually assume blank-slate requirements analysis, but brownfield systems already exist, their contracts are often implicit, and their dependencies are rarely fully documented. Teams need a workflow that begins with understanding the current system and writes narrow specs only for the intended change. Intent's Context Engine gives teams working in large codebases the architectural understanding to map dependencies across 400,000+ files through semantic dependency graph analysis before drafting a change-level spec.
Intent keeps your specs in sync with the live codebase.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
What Teams Need Before Starting Brownfield SDD
Brownfield SDD has a lower barrier to entry than greenfield approaches because it starts with an existing codebase rather than a blank page. Three conditions make adoption practical rather than aspirational.
The first condition is a semantic analysis tool capable of building dependency maps across the existing repository. Manual code reading does not scale beyond a few hundred files, and AI-assisted discovery without a dedicated context layer produces incomplete maps that lead to downstream specification gaps.
The second condition is that there must be at least one engineer who understands the existing architecture well enough to review and correct the dependency maps the tool produces. Tribal knowledge cannot be fully automated away; it can be structured and captured. The third condition is a baseline measurement for two DORA metrics: current lead time for changes and change failure rate. Without a pre-adoption baseline, there is no way to assess whether the SDD workflow is producing the expected improvement in merge speed and regression reduction.
Why Spec-Driven Development Breaks in Brownfield Codebases
Spec-driven development breaks in brownfield environments because five interconnected failure modes invalidate the assumptions built into greenfield SDD workflows. Understanding each failure mode is a prerequisite for adapting the approach.
Comprehensive Specification Is Impractical at Scale
Enterprise codebases with 100,000+ files cannot be comprehensively specified without exceeding human review capacity and practical retrieval limits. InfoQ's analysis states directly that when existing applications are large, it becomes impractical for LLMs to create specifications without exceeding context limits, and even if specifications could be generated, they would be too large for effective human review. The resolution is scope reduction: granular specs must stay closest to the area of change.
Tribal Knowledge Silos Block Specification Authoring
Thoughtworks documents the knowledge loss problem in their TW whitepaper: with no architecture decision records and little to no test coverage, there is no safety net for incremental feature development. The engineers who understood architectural intent have departed. Modern engineers are not trained on legacy technologies. Writing specifications requires understanding what the system does, and that understanding has evaporated.
Undocumented Dependencies Create Specification Blind Spots
Legacy systems accumulate what GitHub Engineering describes as technical debt and massive, intimidating codebases. Even dedicated boundary-enforcement tooling struggles: a Packwerk review reveals that the technical debt introduced by privacy checks is still a long way from being paid off, despite explicit tooling designed to enforce dependency boundaries.
Intent's Context Engine maps cross-module relationships across 400,000+ files using semantic dependency graph analysis, narrowing the blind spots that leave boundary specs incomplete for teams tracing legacy modernization workflows.
Implicit Behavioral Contracts Resist Formalization
Brownfield systems contain behavioral expectations between components that were never documented: shared timing assumptions, ordering dependencies, and undocumented error-handling behaviors. These implicit contracts must be discovered before they can be encoded in specs. InfoQ warns that gaps between specification and actual system behavior compound over time, resurfacing in different forms whenever code is regenerated based on an incomplete spec.
AI Performance Degrades in Unhealthy Code
Fowler cites Adam Tornhill's research showing LLMs produce a 30% higher defect risk in less-healthy code, and that the study's less-healthy code was nowhere near as bad as much legacy code is. Kent Beck sharpens the critique via Fowler's blog: writing whole specifications before implementation encodes the assumption that teams will not learn anything during implementation that would change the specification. In a brownfield, every implementation reveals hidden coupling that can invalidate upfront specs.
| Failure Mode | Greenfield Impact | Brownfield Impact |
|---|---|---|
| Scale of specification | Manageable: new system, defined scope | Impractical: 100K+ files exceed review capacity |
| Knowledge availability | The developer defines intent directly | Tribal knowledge lost; original architects departed |
| Dependency visibility | Defined at design time | Undocumented; accumulated over 10-15 years |
| Behavioral contracts | Specified before implementation | Implicit; must be reverse-engineered from production |
| AI code quality | Clean code, lower defect risk | Higher defect risk in unhealthy codebases |
Five Steps Teams Can Use to Apply a Spec-Driven Workflow to Brownfield Codebases
The brownfield SDD workflow differs from greenfield in a fundamental way: specification follows understanding, not the reverse. Instead of writing a comprehensive spec and generating code from it, brownfield SDD requires building an architectural understanding of existing code first, then writing narrow specifications scoped to the intended change.
Step 1: Build Semantic Understanding Across the Existing Codebase
Brownfield SDD begins with understanding the codebase, not with specification authoring. The AI system must understand what exists before anyone can specify what should change. For repositories spanning hundreds of thousands of files, this requires semantic dependency analysis that maps relationships between components, identifies architectural patterns, and surfaces implicit contracts.
Intent's Context Engine processes entire codebases spanning 400,000+ files through semantic dependency graph analysis, building the architectural understanding that enables change-level specification. Without this foundation, specifications are written against an incomplete model of the system, leading to the integration failures Fowler warned about: just because the windows are larger does not mean AI will properly capture everything inside them.
The RPI Loop formalizes this step: an agent scans the codebase, produces a compact markdown summary of only the relevant state, and does not write code. Research and implementation run in separate phases to prevent context contamination.
Step 2: Write Change-Level Specs, Not Full-Codebase Specs
The second step is the paradigm shift that makes brownfield SDD viable: specifications scoped to the delta of the intended change, not the entire system. Rather than attempting to retroactively document an entire legacy codebase, teams write narrow specs covering only what the current change touches. Specification coverage grows organically with each modification, concentrated where it provides the most value: modules under active development.
A change-level spec defines four elements:
- Current behavior: what the system does today, if discoverable from tests or production traffic
- Target behavior: the precise delta from the current state
- Invariants: what must not change in adjacent systems
- Scope boundary: what is explicitly excluded from this change
Intent's living specifications give teams drafting these narrow specs the ability to anchor them to dependency evidence from the live codebase, rather than relying on tribal memory or documentation that may be years out of date.
Step 3: Decompose Against Existing Architecture
The change-level spec must be decomposed into implementation tasks that respect the existing architectural structure. Decomposition in brownfield differs from greenfield because the architecture already exists, constraints are real, and deviations from established patterns create a maintenance burden.
Stripe Minions validates this at enterprise scale: guidance is applied at a scoped or subdirectory level primarily to avoid a massive global rules file that would exceed the model's context. Decomposition must account for the reality that different parts of the same codebase follow different conventions.
Step 4: Execute in Isolated Worktrees
Implementation tasks execute in parallel using Git worktrees, which provide complete filesystem isolation between concurrent workstreams while sharing the underlying repository data. Each specialist agent receives its own directory, branch, and filesystem state.
Boris Cherny, creator and head of Claude Code at Anthropic, describes running many parallel Claude Code sessions and using separate git checkouts for each local session when working on large batch changes, such as codebase-wide migrations. Google patterns further formalize this by assigning specific roles to individual agents, creating systems that are more modular, testable, and reliable.
Intent's multi-agent orchestration executes decomposed brownfield tasks in parallel worktrees, maintaining architectural context across concurrent work streams throughout the execution phase. Resource considerations are real: a 2GB codebase can consume nearly 10GB of disk space in a 20-minute multi-worktree session, requiring per-worktree database instances and worktree-indexed Docker volume names at enterprise scale.
Step 5: Verify Against Spec and Existing Tests
Verification in brownfield serves two purposes: confirming the change matches the spec and confirming the change does not break existing behavior. The second purpose distinguishes brownfield verification from greenfield verification, where no existing behavior needs to be protected. Verification should compare the implementation against both the change-level spec and the existing test suite. That dual check catches the technical debt injection that Fowler's team identified: AI-generated code that includes unrequested features and must integrate with existing systems the AI does not fully understand.
A machine-checkable contract validation step further strengthens verification. The pattern below uses an OpenAPI contract as an example of turning a prose spec into a CI-enforceable artifact:
Runnable validation: swagger-cli 4.0.4 on Node.js 20 validates this contract and fails fast if required OpenAPI fields are missing.
Common failure mode: if responses are omitted, validation fails with a schema validation error identifying the offending path. In brownfield teams, this turns a spec from informal prose into a machine-checkable contract that can run in CI before implementation merges.
Intent's multi-agent orchestration connects verification back to the same architectural map used during discovery, helping confirm that the implemented delta remained within the intended boundary.
Intent runs parallel agents across your brownfield codebase without losing architectural context.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
Three Specification Patterns for Brownfield Codebases
Brownfield specification patterns differ from greenfield patterns because they acknowledge the reality IEEE documented decades ago: legacy code itself is often the only reliable documentation. Three patterns address different layers of brownfield complexity.
Pattern 1: Change Specs (Delta Specifications)
Change specs capture only the behavioral delta of the intended modification. Every bug fix, feature addition, and refactoring becomes an opportunity to add specifications for the code being touched. The discipline requirement from InfoQ: every AI-assisted change must update the spec, not just the code. Direct code modifications without spec updates widen the specification gap, which resurfaces as non-deterministic AI generation failures in subsequent sessions.
Pattern 2: Dependency Boundary Specs (Service Contract Specifications)
Dependency boundary specs formalize implicit contracts at integration points between legacy and modern systems. Required components include machine-readable artifacts such as OpenAPI for REST and Avro or Protobuf for events, plus non-functional concerns such as failure modes, SLOs, and versioning, all tied to a shared vocabulary across teams.
Anti-corruption layers serve as boundary spec implementations. Fowler describes a Backend for Frontend (BFF) as an Anti-Corruption Layer (ACL), creating an isolating layer that maintains the same domain model as the frontend while translating between legacy interfaces and modern systems. GitHub's SAML hardening demonstrates brownfield boundary specification in practice: bootstrap initial schemas from real-world production traffic, A/B test using the Scientist framework, and converge on a minimal schema validated against millions of production requests.
Pattern 3: Migration specs (Incremental Modernization Specifications)
Migration specs define the target state and the incremental steps to reach it from the current state, and are designed to be executed without stopping feature delivery. Three components are required per CircleCI analysis:
- Target state vision: specific enough to validate intermediate steps
- Incremental steps: each is individually deployable and independently valuable
- Integration layer design: the facade that mediates between old and new during transition
Shopify's implementation of the Strangler Fig pattern validates this approach: build a facade, identify independent, extractable modules, migrate incrementally, and continuously monitor. Peer-reviewed research confirms that direct rewrites are rarely feasible in enterprise environments due to risks of functional regression and loss of institutional domain knowledge.
| Pattern | Scope | When to Use | Key Discipline |
|---|---|---|---|
| Change spec | Single modification delta | Bug fixes, feature additions, refactoring | Update spec with every AI-assisted change |
| Dependency boundary spec | Integration point contract | Service extraction, monolith decomposition | Validate against production traffic, not docs |
| Migration spec | Multi-phase architectural change | System modernization, database migration | Each step must be independently deployable |
What Does Not Work: Full-Pipeline SDD for Brownfield Changes
AWS Kiro's mandatory three-phase pipeline creates structural friction for brownfield codebases. Kiro's own product team acknowledged this: not everyone starts from requirements, especially when working on existing brownfield apps where the technical architecture is already mapped out.
Three specific limitations make full-pipeline SDD impractical for routine brownfield changes. First, spec generation and full agent hook execution add significant per-task overhead. For a single-line bug fix in a legacy system, triggering the full pipeline is overhead without value. Second, the agent generally starts from scratch and needs to understand the system in each session, meaning every brownfield session begins with the agent relearning the codebase context. Third, AWS's own case study noted that the approach was demonstrated on a small codebase.
Intent addresses these limitations directly. Living specifications persist architectural context across sessions, eliminating the cold-start problem by maintaining a continuously updated codebase model rather than regenerating understanding from scratch on each task. For teams working on incremental legacy changes, this is more practical than forcing every task through a fresh requirements-first pipeline.
Full-pipeline SDD remains valuable for large greenfield features inside brownfield codebases: building a new service, designing a new API surface, or creating a new subsystem. The distinction is between specifying new work, where upfront specification adds value, and modifying existing work, where change-level specs are more appropriate.
How Teams Should Measure Brownfield SDD Effectiveness
DORA has acknowledged the topic but has not published SDD-specific findings. No standardized, empirically validated metrics exist in peer-reviewed literature. Teams adopting brownfield SDD in 2026 are establishing baselines rather than following mature industry standards. Four metrics adapted from adjacent research provide a starting framework.
Drift Rate
InfoQ identifies drift as the natural state that must be continuously governed. Specification drift measures divergence between specs and actual system behavior over time. Operational proxies teams can instrument immediately, including continuous schema validation failures per sprint, contract test failures indicating implementation divergence, and spec revision frequency.
Regression Rate
Defect density in specced versus unspecced areas of the codebase provides the clearest signal. The industry standard baseline is 1 defect per 1,000 lines. AI-assisted changes with formal specifications should trend below this baseline. SDD's hypothesis is that formal specifications reduce quality degradation, but rigorous brownfield before-and-after data has not been published.
Time-to-Merge vs. Baseline
DORA Lead Time for Changes is the closest proxy. Establish a baseline before SDD adoption, then track quarterly.
| Performance Tier | Lead Time for Changes | Change Failure Rate |
|---|---|---|
| Elite (Top 10%) | < 1 hour | < 2% |
| High (Top 25%) | < 1 day | < 4% |
| Medium (Median) | 1 day to 1 week | 8-16% |
| Low (Bottom 25%) | > 1 month | > 32% |
Source: 2024 DORA benchmarks
Specification Coverage Growth
Measure specifications added per sprint rather than total coverage percentage. Brownfield coverage starts near zero and grows slowly. Meaningful denominators include the percentage of critical-path components with formal specs, the percentage of API endpoints with machine-readable contracts, and the percentage of active-development modules with change-level specs.
Adopt Change-Level Specs Before Your Next Legacy Refactor
The tension at the core of brownfield SDD is scope: comprehensive specifications are impractical at enterprise scale, but unspecified AI-assisted changes introduce compounding drift. Change-level specs resolve this tension by scoping specification effort to the delta of each modification and building coverage organically where it matters most.
The next concrete step is simple: on the next brownfield change, write a change-level spec that defines the current behavior, target behavior, invariants, and scope boundaries before generating code. Measure whether the resulting change merges faster and introduces fewer regressions than the team's baseline.
Intent maps your codebase, writes living specs, and coordinates agents across the full change cycle.
Free tier available · VS Code extension · Takes 2 minutes
Frequently Asked Questions About Spec-Driven Development in Brownfield Codebases
Related Guides
Written by

Molisha Shah
GTM and Customer Champion