How many assumptions does a typical AI-generated feature contain?

A single AI-generated module typically carries 10-30 unstated assumptions spanning dependency versions, schema formats, error-handling strategies, concurrency models, and integration contracts. A reproducibility study of 300 projects found that 31.7% required manual intervention because they did not execute successfully out of the box, including issues such as incomplete or hidden dependencies.

Does adding more code reviewers solve the AI debt problem?

Adding reviewers does not address the structural mismatch. The 18x review gap between 1,500 lines generated per 10 minutes and 500 lines reviewed per hour means that scaling review capacity linearly cannot keep pace with AI output velocity. The problem is categorical: reviewers cannot audit decisions that were never surfaced in the diff.

When should teams start writing specs for AI-generated code?

Introduce formal specifications when interface complexity and coordination needs justify them. For prototype code with no production path, specs add unnecessary overhead. For anything entering production that multiple engineers will modify, specifications help teams coordinate changes and maintain code quality over time.

Can schema validation tools like Pydantic replace living specs?

Schema validation (Pydantic, Zod, OpenAPI) provides Layer 2 contract enforcement, which is necessary but insufficient. Schema validation catches type and format violations at runtime. Living specs operate at Layer 1 (intent definition) and Layer 3 (traceability), capturing the rationale for a decision and ensuring that changes propagate to all dependent modules. Both layers are required for effective debt prevention.

How does AI debt differ from traditional technical debt?

Traditional debt involves deliberate shortcuts where the engineer knows what tradeoff was made. AI debt involves delegated decisions where the AI makes implicit choices without surfacing them. The dynamics are often described as compounding: as debt accumulates, teams can slow down and face greater pressure to rely on rapid AI-assisted code generation, which can add still more debt.

What Happens When AI Technical Debt Compounds (And How Spec-Driven Dev Prevents It)

AI-generated technical debt compounds faster than human-authored debt because LLMs embed unstated assumptions at every decision point. Those assumptions are invisible to code review and standard testing and are compounded by the 3-4x increase in code volume AI tools produce. Spec-driven development (SDD) prevents this compounding by making assumptions explicit before code generation begins, turning implicit decisions into enforceable contracts that block drift at the source.

TL;DR

The compounding problem isn't that AI writes bad code. It's that AI writes code that embeds decisions you never saw, at a volume that overwhelms the review process designed to catch them. The real question is whether your current workflow surfaces those decisions before code reaches production, or discovers them six months later during a debugging session that touches fifteen files.

The Velocity Trap That Turns into a Maintenance Crisis

The velocity gains from AI coding tools are real. What's less discussed is the structural reason they don't last: every feature that ships without an explicit contract embeds decisions the AI made silently, and those decisions compound with each subsequent change. The empirical record tells a different story. He et al. analyzed 806 open-source repositories that adopted Cursor AI using a Difference-in-Differences quasi-experimental design (accepted at MSR 2026). The results are stark: about a 41% increase in code complexity, a 30% increase in static analysis warnings, and transient velocity gains while the debt persists.

Across teams using AI coding tools, the pattern is familiar. The first month feels electric: features ship in hours instead of days. By month three, every change breaks something unexpected. By month six, the team is slower than before adopting AI tooling, and nobody can explain why.

The 2025 Stack Overflow Developer Survey (n=49,000+) quantifies the frustration: 66% of developers report spending more time fixing "almost-right" AI code, while 45% say debugging AI-generated code is more time-consuming. The code looks correct. Tests pass. The debt is invisible until it is not.

This article breaks down the specific mechanism by which AI debt compounds, why code review structurally fails to catch it, and how spec-driven development with living specs prevents accumulation at the source.

How AI Code Embeds 10-30 Unstated Assumptions Per Feature

Traditional technical debt arises from deliberate shortcuts: a human engineer makes a conscious trade-off and, at a minimum, knows which shortcut was taken. AI-generated debt has a fundamentally different causal structure. Researchers studying self-admitted technical debt in AI-assisted development have proposed the category of GIST debt: debt that arises not from deliberate shortcuts but from uncertainty about the behavior or suitability of AI-generated code.

Every AI-generated feature carries unstated assumptions across multiple categories:

Assumption Category	Examples	Visibility in Code Review
Dependency assumptions	Asserts 3 packages needed; execution loads 52	Not visible in diff
Schema assumptions	Expects JSON keys in specific casing/nesting	Visible only if the reviewer knows the contract
Error handling assumptions	Silently swallows exceptions or returns defaults	Appears "correct" in isolation
Concurrency assumptions	Assumes single-threaded execution	Invisible without an architectural context
Integration assumptions	Hardcodes response format from upstream service	Visible only if the reviewer knows the API
Security assumptions	Omits null pointer checks, input validation	Functionally correct but insecure

A 300-project study found that 31.7% required manual intervention after execution failures. Only 10.5% of failures were due to missing dependency declarations. The majority (52.6%) stemmed from fundamental code-generation errors: malformed syntax, incorrect file paths, uninitialized variables, and structural issues. One concrete example from the study: a Python project with 3 claimed dependencies typically loaded 37 packages at runtime, and the paper also describes a representative case that expands from 3 claimed dependencies to 52 packages loaded at runtime.

State-of-the-art Code LLMs in the ambiguity study generate code in over 63% of ambiguous scenarios without seeking clarification. At Google's deployment scale, where public statements indicate that AI now generates a significant share of code ultimately reviewed and accepted by engineers, a substantial fraction of production code may still reflect model-supplied assumptions that require careful human scrutiny.

Each assumption is a potential point of divergence. In one recommender-service test, a single AI-generated module contained assumptions about the response format of an upstream API, the caller's threading model, the orchestrator's error-handling strategy, and the serialization format expected by the data pipeline. None of those assumptions was documented. All of them were wrong in at least one deployment environment.

Why Code Review Structurally Fails to Catch AI Debt

Code review was designed for human developer failure modes: logic errors, inconsistent style and missed edge cases. AI-generated code fails along different axes, ones that are either invisible in a code diff or require contextual depth that reviewer bandwidth cannot sustain. This is not a marginal performance gap. It is a categorical mismatch between the tool and the failure type.

AI Code Looks Correct by Design

The He et al. paper reports a 41% increase in code complexity after Cursor adoption, describing this increase as persistent even when controlling for project velocity dynamics; however, it does not characterize the effect as a "comprehension tax" or attribute it to LLMs generating "structurally valid but semantically opaque code" because training objectives prioritize passing tests over human readability.

A quantitative analysis of the Java defect study found a systemic defect pattern: approximately 90-93% code smells, 5-8% bugs, and about 2% security vulnerabilities. The researchers conclude that "LLM-generated code, even when it passes functional performance tests, is not immediately suitable for production environments."

Review Capacity Cannot Match Generation Velocity

Uber engineering documented this problem directly: "Reviewers are overloaded with the increasing volume of code from AI-assisted code development, and have limited time to identify subtle bugs, security issues, or consistently enforce best practices." Their evaluation of third-party AI code review tools found they suffered from "many false positives, low-value true positives, and being unable to interact with internal systems."

The velocity mismatch is measurable. Practitioner reporting at QCon describes a code reviewer inspecting approximately 500 lines of code per hour, while an agentic service can produce 1,500 lines of code every 10 minutes: an 18x generation-to-review ratio. This is useful as practitioner evidence, but it should be read as operational reporting rather than a primary benchmark.

You Cannot Audit Assumptions You Never Saw

The core problem is structural, not procedural. A reviewer examining a diff sees the generated code. What the reviewer does not see is the set of decisions the AI made silently: why this error handling strategy and not another, why this serialization format, why this threading model. An AI PR study examined 567 agent-assisted pull requests and found that 45.1% required human revisions to align with project-specific standards, reflecting unstated design decisions the AI made without access to architectural patterns or codebase history.

Vendor analysis from OX Security's "Army of Juniors" report (October 2025) identified what they call “phantom bugs” in 20–30% of AI-generated codebases: over-engineered logic for improbable edge cases that degrade performance and waste resources. While this is vendor research rather than peer-reviewed evidence, it aligns with the broader pattern of failures of hidden assumptions described in the academic studies above.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat

— Thu, Jul 9 // 9:45 AM PDT

How Spec-Driven Development Makes Assumptions Explicit

Spec-driven development inverts the traditional documentation relationship. Where traditional documentation describes existing code, SDD makes the specification the source of truth, with code as the derived artifact. Thoughtworks guide characterizes SDD as going beyond vibe coding by separating the design and implementation phases, with requirements specifications formalized into structured Markdown files that require human-in-the-loop validation.

The Four-Layer Prevention Architecture

Several practices work together to help prevent AI debt accumulation:

Layer 1: Specification. Explicit, structured, versioned specs define intent before code generation. Written in structured natural language, living in version control, with assumptions surfaced explicitly. Frameworks and approaches, ranging from GitHub's Spec Kit to Red Hat's spec-driven development, implement this pattern.
Layer 2: Contract enforcement. Schemas and formal contracts make the spec machine-enforceable. Pydantic and Zod provide runtime enforcement, while OpenAPI provides schema definitions that tools can use for runtime validation. The VibeContract paper demonstrates a three-stage pipeline: decompose prompts into discrete tasks, generate and validate contracts for each task, and guide code generation and testing against those contracts.
Layer 3: Traceability. Living specs maintain bidirectional linkage between intent and implementation. When a spec changes, every affected module is identifiable. When code drifts from its spec, the divergence triggers a build failure rather than silently accumulating.
Layer 4: Validation. CI/CD pipelines verify spec compliance beyond compilation, ensuring the AI honors the code's intent at every deployment.

Living Specs vs. Static Documentation

Dimension	Static Documentation	Living Specification
Relationship to code	Describes existing code (post-hoc)	Constrains and generates code (ante-hoc)
Update mechanism	Manual, typically deferred	Updated by agents as implementation proceeds
Enforcement	Advisory only	Build failures when code drifts from the spec
Version control	Often external to the codebase	Lives in version control alongside code
AI interaction	Passive reference	Active input to code generation agents
Failure mode capture	Not addressed	LessonsLearned.md pattern

Intent implements this living spec architecture as a coordinated system. A Coordinator Agent analyzes the codebase and drafts the spec. Implementor Agents execute tasks against that spec in parallel. A Verifier Agent checks results against the spec and flags inconsistencies. When requirements change, updates propagate to all active agents. The spec remains accurate because agents read from and write to it, keeping all humans and agents aligned.

This is the critical distinction from post-hoc documentation. In spec-driven development, living specs can catch implementation drift during refactors that standard code review may miss, including mismatches in response formats, error-handling behavior, and security assumptions. Each was caught and corrected before the code was committed.

Schema Enforcement in Practice

Living specs need machine enforcement to prevent drift. Here is how schema validation converts implicit assumptions into enforceable contracts:

python

# WITHOUT schema enforcement: silent failures
response = openai.chat.completions.create(model="gpt-4o", ...)
data = json.loads(response.choices[0].message.content)  # Breaks on markdown wrapping
score = data["confidence"]  # KeyError if field absent; may be string "0.85" not float

# WITH Pydantic enforcement: assumptions become explicit contracts
from pydantic import BaseModel, Field

class Recommendation(BaseModel):
    title: str = Field(description="Product title")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence score")
    source: str = Field(description="Recommendation source identifier")

# Validation failure triggers structured error with field location and reason
# Instructor/Pydantic AI auto-retry with that error as context

The schema makes every assumption visible: confidence is a float between 0 and 1, not a string; source is required, not optional; the response must contain exactly these fields. Each constraint that would have been an unstated assumption becomes an enforceable contract.

Why AI Debt Compounds Faster Than Human Debt

AI technical debt compounds faster than human-authored debt for a specific, measurable reason: volume. The strongest support in this article comes from He et al.'s study, which links increased code volume, higher complexity, and persistent debt accumulation. Supplementary vendor research from Apiiro reports points in the same direction, reporting that AI-assisted developers produced 3-4x more commits than unassisted peers while generating more security issues. Because this is vendor research, it should be treated as directional evidence rather than a neutral baseline.

The Compounding Loop

The He et al. paper identifies a self-reinforcing feedback loop using panel GMM models: accumulated technical debt subsequently reduces future development velocity, creating a compounding cycle. Here is how that cycle plays out in practice:

Phase 1 (Prototype, no spec): Developers use AI to hack together prompts, outputs, and quick wrappers. Testing is manual, often in notebooks. Assumptions are invisible but harmless because the scope is small.

Phase 2 (Expansion): As functionality broadens, assumptions diverge. One module expects structured JSON; another tolerates free text. Each patch adapts locally, but global coherence erodes. GitClear's longitudinal dataset shows that copy-paste growth rose from 8.3% to 12.3% of all changed lines between 2020 and 2024, while refactoring fell from 25% to less than 10%.

Phase 3 (Integration): Layers of overrides, version mismatches, and silent changes in model outputs create high coupling. Team velocity drops because every change breaks something else. The SlopCodeBench study demonstrates this directly: under repeated editing, agent-generated code deteriorates as each multi-turn edit preserves and extends anti-patterns from prior turns. Pass rates remain stable, while the underlying code becomes increasingly difficult to extend.

Phase 4 (Maintenance): Debugging costs rise as more of the system depends on undocumented assumptions. Practitioner reporting in LeadDev, citing Harness's State of Software Delivery 2025 report, states that 67% of developers spend more time debugging AI-generated code. Because this is summarized practitioner reporting, it is better read as a field signal than as a definitive technical benchmark. OX Security reported that AI coding agents often avoid refactoring and default to tightly coupled, monolithic architectures, though the available evidence does not confirm that this finding is based on a specific experiment with a Dart web application. That result is also best treated as supplementary evidence.

Phase 5 (Erosion): The system reaches a "no-change zone" where developers hesitate to make updates because no specs or reliable tests exist. Rewrite cost exceeds maintenance cost. Martin Fowler's design stamina hypothesis predicts this crossover: the cumulative feature delivery rate of a high-quality codebase overtakes that of a low-quality codebase, and some authors argue that AI tools could accelerate the crossover.

Duplication as a Compounding Multiplier

Each duplicated code block becomes an independent divergence point. When one copy requires a schema change, bug fix, or security patch, N copies must be updated rather than one. N grows with every AI-generated commit that introduces duplication rather than abstraction. GitClear's 2025 report found increased code duplication in 2024, describing a trend toward more code cloning and copy-pasted code as a negative side effect of AI-assisted development.

Open source

augmentcode/augment.vim★611

Star on GitHub

The Recommender System: A Concrete Compounding Example

Here is how debt compounds in a real system, drawn from a pattern repeated across multiple teams:

Early on, developers prototype a generateRecommendations() AI module returning an array of title strings. Quick, functional, ships in a day.
Months later, prompts evolve. The AI starts returning objects with titles and confidence scores. The front end still assumes an array of strings.
Quick fix: the front-end filters out objects and maps to strings. Works locally but breaks analytics and logging because the downstream data no longer aligns with the assumptions about the payload shape.
Different environments run on "patched" branches, each encoding slightly different expectations about the response format. Schema drift multiplies silently.
Debugging logs and data pipelines collapse. Nobody can trace which assumptions are correct because no canonical contract ever existed.
Transition to spec-driven development: define a canonical schema (a Pydantic model specifying exactly which fields are required, their types, and valid ranges), regenerate adapters under that contract, and re-align end-to-end tests. Debt stabilizes because each module validates against the spec and produces predictable results regardless of prompt changes.

The difference is not detection. By the time teams detect this kind of drift, the damage is distributed across services, environments, and deployment branches. The difference is prevention: a living spec that makes the expected response format explicit before the first line of AI-generated code is written.

Decision Points: When to Formalize, Refactor, or Retire

Not every AI-generated module needs a full spec from day one. The following decision criteria separate productive prototyping from debt accumulation:

Formalize a spec when: AI-generated code is promoted to production, where multiple engineers will modify it; the system operates in domains with correctness, security, or regulatory requirements; handoff between teams or engineers is anticipated. Domain experts should review the generated code to ensure it is accurate and aligns with existing project conventions and architectural guidelines before merging.

Refactor when the same assumption failure appears in more than one integration point. A single schema mismatch is a local bug. The same mismatch surfacing across three different services signals that no canonical contract exists, and that each local patch encodes a slightly different version of the fix. That divergence is how compounding starts: not from one bad decision, but from three patches that each make slightly incompatible assumptions about the same contract.

Retire legacy AI code when: version skew between model versions exceeds test coverage. Once prompt regression or SDK changes have accumulated enough patches that no engineer can confidently predict the system's behavior, maintaining is riskier than rebuilding under a spec-first workflow.

Contain debt when: the code is a pure throwaway prototype with an explicit and enforced discard commitment, or a single-engineer time-boxed experiment where the output is learning rather than a maintained artifact.

Approach	Benefits	Risks
Ad-hoc AI iteration	Fast prototyping, low upfront overhead	Explosive debt growth, inconsistent outputs
Strict spec-driven development	Predictable evolution, safer refactoring	Slower experimentation, more upfront work
Hybrid (formalize on observed instability)	Flexible evolution with guardrails	Requires schema validation tooling and discipline

Common Failure Points: Specs Prevent

Failure Point	Mechanism Without Specs	How Specs Prevent It
Schema drift	LLM output format changes silently; downstream breaks	Versioned schema contracts reject non-conforming outputs
Prompt regression	Temperature or instruction changes break pipelines	Spec constraints bound the generation scope explicitly
DTO/reactive code mismatch	Glue code silently adapts to output changes	Validation layer rejects unexpected shapes before propagation
Test blindness	Sample-based evaluation misses subtle breaking changes	Contract tests derived from spec catch payload-level drift
AI regen sprawl	Multiple AIs generate overlapping, unreconciled modules	The canonical spec serves as a single source of truth for all agents

Note: Intent is Augment Code's spec-driven development and agent orchestration platform. Its Coordinator Agent generates tasks from a living spec, Implementor Agents execute in isolated git worktrees, and the Verifier Agent checks implementations against the spec before merge. In one internal test of a service with three overlapping AI-generated summarizers, the Verifier caught interface inconsistencies that would have taken days to debug in a post hoc review.

Make Assumptions Explicit Before Your Next AI-Generated Feature Ships

AI tools increase output velocity and implicit decision-making in equal measure. Every feature that ships without a contract is a bet that nobody will need to modify, extend, or debug it under time pressure. That bet pays off in the first month and fails by the sixth. Teams that define intent up front bound that drift before it spreads.

Pick one AI-generated workflow that crosses module boundaries, write the contract it depends on, and enforce that contract in CI.

What Happens When AI Technical Debt Compounds (And How Spec-Driven Dev Prevents It)

TL;DR

The Velocity Trap That Turns into a Maintenance Crisis

How AI Code Embeds 10-30 Unstated Assumptions Per Feature

Why Code Review Structurally Fails to Catch AI Debt

AI Code Looks Correct by Design

Review Capacity Cannot Match Generation Velocity

You Cannot Audit Assumptions You Never Saw

The New Code Review Workflow for AI-Native Engineering Teams

How Spec-Driven Development Makes Assumptions Explicit

The Four-Layer Prevention Architecture

Living Specs vs. Static Documentation

Schema Enforcement in Practice

Why AI Debt Compounds Faster Than Human Debt

The Compounding Loop

Duplication as a Compounding Multiplier

The Recommender System: A Concrete Compounding Example

Decision Points: When to Formalize, Refactor, or Retire

Common Failure Points: Specs Prevent

Make Assumptions Explicit Before Your Next AI-Generated Feature Ships

Frequently Asked Questions About AI Technical Debt

Written by

Paula Hingel

Give your codebase the agents it deserves

TL;DR

The Velocity Trap That Turns into a Maintenance Crisis

How AI Code Embeds 10-30 Unstated Assumptions Per Feature

Why Code Review Structurally Fails to Catch AI Debt

AI Code Looks Correct by Design

Review Capacity Cannot Match Generation Velocity

You Cannot Audit Assumptions You Never Saw

The New Code Review Workflow for AI-Native Engineering Teams

How Spec-Driven Development Makes Assumptions Explicit

The Four-Layer Prevention Architecture

Living Specs vs. Static Documentation

Schema Enforcement in Practice

Why AI Debt Compounds Faster Than Human Debt

The Compounding Loop

Duplication as a Compounding Multiplier

The Recommender System: A Concrete Compounding Example

Decision Points: When to Formalize, Refactor, or Retire

Common Failure Points: Specs Prevent

Make Assumptions Explicit Before Your Next AI-Generated Feature Ships

Frequently Asked Questions About AI Technical Debt

How many assumptions does a typical AI-generated feature contain?

Does adding more code reviewers solve the AI debt problem?

When should teams start writing specs for AI-generated code?

Can schema validation tools like Pydantic replace living specs?

How does AI debt differ from traditional technical debt?

Related Guides

Written by

Paula Hingel

Give your codebase the agents it deserves