Does the 80% problem apply equally to greenfield and legacy projects?

Osmani explicitly bounds the 80% threshold as "most accessible in greenfield contexts where you control the entire stack." Legacy systems compound the problem because agents miss undocumented quirks and implicit conventions that differ across modules maintained by different teams.

Can better prompts solve the 80% problem?

Structured prompts improve output quality but cannot solve the underlying issue. NFRs require persistent specifications with verification gates, not per-prompt instructions that fall out of context windows.

How does the 80% problem differ from normal technical debt?

Osmani distinguishes comprehension debt: "Unlike technical debt, which announces itself through mounting friction, comprehension debt breeds false confidence. The codebase looks clean. The tests are green." The debt is hidden in omissions rather than visible code problems.

What is the fastest way to detect existing 80% debt in a codebase?

Audit one recent AI-generated PR that caused a production issue. Identify what the agent omitted. Use that audit to create a pre-merge checklist, and measure production incident rates before and after.

Do multi-agent workflows make the 80% problem worse?

When each agent independently generates 80% correct code, combined correctness degrades as the gaps interact. Intent addresses this by sharing a living spec across all agents and running a Verifier that validates against that spec.

The 80% Problem: Why AI Agents Ship Fast But Create Hidden Technical Debt

The AI agent 80% problem is the structural gap between the functional code AI agents reliably produce (roughly 80% of a working solution) and the production-grade remaining 20% (error handling, security, observability, compliance) that compounds into unmaintainable technical debt when left unaddressed. Without sufficient oversight, agents can introduce structural problems such as generating more code than necessary and leaving outdated implementations in place.

TL;DR

AI coding agents ship the visible 80% quickly: CRUD operations, standard patterns, and passing tests. The invisible 20% (non-functional requirements, failure modes, and architectural consistency) is systematically omitted because agents lack persistent context and verification. Retrofitting that 20% costs more than building it correctly because comprehension debt, duplication, and cross-cutting inconsistencies prevent single-point fixes.

Engineering teams that have adopted AI coding agents report a consistent pattern. The first PR looks miraculous: a complete feature, generated in minutes, tests passing. By the fifth, something breaks in production that no test caught. A missing retry, an unhandled null, an authentication check that exists on three endpoints but not the fourth. The agent wrote code that works. The agent did not write code that survives. This gap helps explain why teams shipping faster than ever can also accumulate technical debt faster than ever.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

What the 80% Problem Actually Is

Addy Osmani named the problem in Osmani's January 2026 analysis of the 80% problem, building on Andrej Karpathy's observation that he had shifted to "80% agent coding and 20% edits+touchups." Osmani's insight is that the remaining 20% is not a minor cleanup task. It represents a distinct category of engineering failure: rate limiting, observability hooks, retry logic with backoff, circuit breakers, audit logging, PII handling, and input sanitization. Those non-functional requirements determine whether code survives contact with production traffic, compliance audits, and real-world failure conditions.

Osmani characterizes the iterative failure mode directly: "You're feeling good after using prompts to generate an MVP, then try throwing two or three more prompts at it. This typically leads to a point where small changes, say, fixing a bug, somehow make things worse."

What agents reliably produce:

CRUD operations, standard API patterns, and type definitions
Basic validation and happy-path test coverage
UI component rendering, state management, and database queries

What agents systematically omit:

Error handling for failure modes beyond the happy path; security applied cross-cuttingly
Observability: structured logging, metrics, distributed tracing
Edge cases, compliance requirements, and architectural consistency across files

Where the 80% Problem Hits Hardest

The gap between human-written and AI-generated code follows a consistent pattern across quality dimensions:

Dimension	Human-Written Code	AI-Generated Code
Error handling	Shaped by production experience; handles known failure modes	Happy-path focused; omits retry logic, circuit breakers, graceful degradation
Security posture	Cross-cutting; applied at the architectural level	Per-endpoint; inconsistent across generated files
Observability	Structured logging, metrics, and traces integrated during development	Minimal or absent; no correlation IDs, no alerting hooks
Code reuse	Abstractions refactored over time	Duplicated patterns across files; GitClear 2025 code quality research shows refactored code fell from ~22% to ~10%
Architectural coherence	Informed by ADRs, team conventions, and system context	Context-window-limited; no awareness of decisions in other files

Those gaps surface differently depending on what the agent is building. The following scenarios show where each dimension tends to fail in practice:

Scenario	What the Agent Ships (80%)	What's Missing (20%)	Production Consequence
API endpoint generation	Working CRUD routes, request/response types, and basic validation	Rate limiting, auth middleware, input sanitization, audit logging	Endpoint passes tests but fails security review
Database migration	Schema changes, basic index creation, migration scripts	Rollback strategy, data backfill, zero-downtime migration	Migration locks the production table; rollback requires manual intervention
Authentication flow	Login/logout, token generation, session management	Token refresh edge cases, brute-force protection, audit trail	Auth works in the happy path but fails the pen test
Frontend scaffolding	Component rendering, state management, API integration	Accessibility, error boundaries, loading/empty/error states, responsive edge cases	Passes visual QA but fails accessibility audit
Data pipeline orchestration	ETL job setup, basic scheduling, data transformation	Backpressure handling, dead letter queues, data validation gates, idempotent processing	Pipeline silently drops records under load

What Agent-Generated Code Actually Looks Like

A typical agent-generated payment endpoint passes basic tests but omits everything that determines whether it survives production traffic:

javascript

// Agent-generated payment endpoint
app.post('/api/payments', async (req, res) => {
  const { amount, currency, customerId } = req.body;

  const charge = await stripe.charges.create({
    amount,
    currency,
    customer: customerId,
  });

  await db.payments.insert({
    chargeId: charge.id,
    amount,
    currency,
    customerId,
    status: charge.status,
  });

  res.json({ success: true, chargeId: charge.id });
});

Week 3: Customer reports a duplicate charge after a network timeout.

What's missing:

Idempotency key to prevent duplicate charges on retry
Rate limiting per customer
Input validation (negative amounts? unsupported currencies? missing fields?)
Error handling that distinguishes retryable Stripe errors from permanent failures
Audit logging, structured logging, and authentication middleware

Debt cost:

Week 1: passes all tests.
Week 3: customer reports duplicate charge after network timeout.
Week 6: security audit flags missing rate limiting and audit trail across all payment endpoints.

Agent-generated database migrations add columns, create indexes, and backfill data in a single script: functi,compoundsonal but dangerous at scale.

sql

-- Agent-generated migration
BEGIN;

ALTER TABLE users ADD COLUMN subscription_tier VARCHAR(20) DEFAULT 'free';

CREATE INDEX idx_users_subscription_tier ON users(subscription_tier);

UPDATE users SET subscription_tier = 'premium'
  WHERE id IN (SELECT user_id FROM payments WHERE plan = 'premium');

COMMIT;

What's missing:

Rollback script to reverse the migration
Batched UPDATE to avoid locking a large table
Zero-downtime strategy (adding column as nullable first, backfilling, then adding constraint)
Data validation after backfill
Monitoring for lock wait timeouts during migration

Debt cost:

Week 1: migration runs in 2 seconds on the dev database.
Week 4: migration locks production table with 50M rows for 8 minutes during deploy. Rollback requires manual intervention because no rollback script exists.

A typical agent-generated dashboard component fetches metrics on mount and renders a grid: functional but brittle. Missing: error state handling, loading skeleton, data refresh logic, accessibility attributes, ARIA labels, and an authentication check before fetching sensitive metrics.

Environments Where the Gap Is Most Dangerous

Situation	Why It's Worse	Example
Regulated industries	Missing compliance controls require architectural remediation, not patches	Healthcare app omits HIPAA audit logging
High-scale systems	Race conditions and missing backpressure are invisible until production load	Payment processor duplicates charges
Legacy system integration	Agents miss undocumented conventions and implicit contracts	Generated code breaks the unstated rate limit
Security-critical flows	Per-endpoint security gaps defeat defense-in-depth	Auth middleware on 3 of 4 endpoints
Multi-agent workflows	Each agent's 80% correct output compounds into a lower combined correctness	Agents disagree on the API contract

Why the Last 20% Is Worse Than Writing From Scratch

Retrofitting the missing 20% costs more than building it correctly from the start. Four structural reasons explain why the debt compounds rather than resolves:

Comprehension debt precedes every fix. Before adding error handling, an engineer must reconstruct the intent of code generated without knowledge of the surrounding architecture. The 2025 Stack Overflow developer survey found that roughly 45% of developers report that debugging AI-generated code is more time-consuming than expected.
Code duplication eliminates single-point remediation. GitClear's 211-million-line longitudinal study found that copy/pasted code rose from 8.3% to 12.3% while refactored code fell from ~22% to ~10%. Every duplicated copy requires individual modification.
Security gaps require architectural remediation. The Accenture State of Cybersecurity report documents security investment trailing AI initiative spending, leaving teams retrofitting controls after implementation.
Non-functional requirements are systematically deferred. AI agents optimize for functionality and test passage; verification of security, observability, and compliance requires explicit gates that agents do not impose on themselves.

Each of those failure modes carries a specific retrofitting cost:

Debt Type	Why Retrofitting Costs More
Missing observability	Cannot trace production incidents to the commit that introduced them
Absent security controls	Cross-cutting across every layer; modification requires understanding all affected code paths
Inconsistent error handling	Subtle differences across duplicated implementations defeat automated refactoring
Deferred NFRs	Code requires reconstruction of the original intent before each fix

How the 20% Gap Compounds Into Named Debt Patterns

Pattern	What Happens	Why It's Hidden	Timeline
Silent Schema Drift	Agent adds columns without a migration rollback strategy; the schema diverges from the ORM models	ORM models still compile; tests pass against dev data	Weeks 1–2: works fine. Week 3: data corruption on rollback attempt
Missing Observability Hook	No structured logging, no correlation IDs, no metrics emission	No errors thrown; absence of data is invisible until an incident	First incident: diagnosis takes far longer than in a traced system
Compliance Gap	PII handling, audit logging, and data retention were omitted from the generated code	Functional tests pass; compliance is not tested	Discovered during audit; requires architectural remediation across all affected services
Race Condition Time Bomb	No idempotency keys, no distributed locks, no optimistic concurrency	Single-threaded tests never trigger the race	Works in dev/staging; fails under production concurrency
Dependency Trap	Agent pins to the latest versions without compatibility verification	Build succeeds until transitive dependency updates	Build breaks when transitive dependency updates; resolution requires understanding the entire dependency graph

Root Cause 1: Missing Architectural Context

AI coding agents make framework, database, authentication, and deployment choices within a single interaction, at speeds outpacing any review process, with decisions buried in generated code rather than recorded in any architectural artifact.

The CMU Software and Societal Systems study found that 54% of participants indicated code generation tools often fail to meet specified requirements. Context files (CLAUDE.md, .cursorrules, AGENTS.md) attempt to address this gap, but ETH context file research found that LLM-generated context files reduce task success rates by 3% compared to no context file while increasing inference costs by over 20%. Even human-written context files provide only an average 4% improvement.

When using Augment Code's Intent Context Engine, teams working across large codebases maintain architectural awareness across 400,000+ files because the engine builds dependency graphs that update within seconds of code changes, providing structural context that static files cannot deliver.

Root Cause 2: No Verification Loop (Ephemeral Plans)

AI agents generate implementation plans at the start of a session, use those plans to guide code generation, and then never persist or revalidate them as generation proceeds. The plan exists only in the context window. As the window fills and context rot in long agent sessions progresses, the plan degrades or disappears.

Multi-agent planning architecture research describes an architecture in which planning is delegated to a read-only Planner subagent, enforcing separation between planning and execution at the schema level. At session boundaries, context is lost unless the state is explicitly externalized or persisted by the system.

At mabl's 75-repository agent deployment, context drift accounted for approximately 40% of task failures before remediation, dropping to under 5% after implementing per-repository operating manuals that provided ecosystem context.

Root Cause 3: Agent Hallucinations Compound Across Files

Code hallucination drives incorrect functionality through three documented patterns that cascade across files and sessions:

Hallucinated API contracts propagate to dependent files. Agents generate invalid function signatures, causing downstream code to break. The API Knowledge Conflict study found these constitute 20.41% of observed hallucinations.
Dependency version hallucinations cascade to build failures. LLMs invent nonexistent packages, blocking the entire dependency graph (LLM version hallucination study).
Modification tasks amplify the rate. Reasoning across old/new code states with partial context doubles hallucination risk (code modification study).

Open source

augmentcode/augment-swebench-agent★875

Star on GitHub

These compound predictably: ephemeral plans degrade, hallucinated contracts spread, and subsequent sessions normalize errors as design. Harness layers catch what instructions miss.

Catching the Invisible 20% Before Merge

Five techniques address the 20% gap at different points in the development workflow, from specification through CI.

The 20% Pre-Merge Checklist. Before merging any AI-generated PR, verify: error handling for all failure modes, input validation beyond type checking, observability hooks, security controls, and edge cases. Augment Code's pre-merge verification guide details how to implement these gates in CI.
Spec-Driven Generation. Adopt a spec-driven development approach that includes NFRs as explicit constraints. A structured specification workflow breaks requirements into smaller pieces before generation. Specs with explicit NFRs produce higher-quality output than plain-text requirements.
Debt-Aware Prompting. Explicitly instruct agents to include error handling, logging, and security. Explicit prompting helps, but does not solve the problem. A ScienceDirect study on prompt instability demonstrates "the instability of optimizing NFQCs through ad-hoc prompts in practical software engineering settings."
Automated Debt Detection. Run static analysis, security scanning, and complexity checks on every AI-generated PR. Automated tools can catch surface-level issues, though file-level tools cannot catch cross-service breaking changes.
The 80/20 Review Ritual. For every AI-generated file, spend 20% of review time on functional code and 80% on what's missing. Weave automated tests into the process and run them after each task so that failures can inform the next prompt.

Core Tradeoffs When Mitigating the 80% Problem

Tradeoff	Tension	Practical Guidance
Speed vs. completeness	Comprehensive specs slow initial generation	Use spec-driven generation for production services; skip for throwaway prototypes
Explicit prompts vs. implicit assumptions	Listing every NFR in prompts hits context window limits	Encode NFRs in persistent specs, not per-prompt instructions
Human review depth vs. AI generation volume	Agents generate code faster than humans can review	Focus 80% of review time on the missing 20%, not functional code
Automation vs. architectural judgment	Static analysis catches syntax-level issues, not design flaws	Pair automated scanning with architectural review gates

What Does NOT Work

Approach	Why It Fails
"Just review more carefully."	Review fatigue scales with AI-generated volume
Longer prompts with all NFRs	Context window limits drop later requirements; research confirms instability
Post-hoc security scanning only	Finds symptoms, not root causes; cross-cutting concerns need architectural remediation
Trusting green tests	Agents optimize for test passage; tests may not cover the missing 20%

Adoption Path

Audit one recent AI-generated PR that caused a production incident; identify what the agent omitted.
Build a pre-merge checklist from that audit covering error handling, security, observability, and edge cases.
Require specs with explicit NFRs before any agent generates code for production services.
Run automated debt detection, static analysis, and security scanning on every AI-generated PR in CI.
Measure production incident rates before and after; iterate the checklist quarterly.

Distinguishing from the METR 19% Finding

METR's experienced developer study found that experienced developers took 19% longer (+2% to +39% CI) with AI tools, despite predicting a 24% speedup. Follow-up review: 0% of AI PRs are mergeable as-is.

Dimension	METR Study	Osmani's 80% Problem
Measures	Task completion time	Production-readiness of output
Methodology	Randomized controlled trial	Practitioner observational analysis
Core finding	19% slowdown (CI: +2% to +39%)	Last 20% requires human expertise
Does NOT measure	Code quality, technical debt, mergeability	Task completion speed

A Cursor longitudinal velocity study found that Cursor adoption produces 3–5x velocity gains in the first month, which dissipate after two months, accompanied by persistent increases in static analysis warnings (30%) and code complexity (41%). Speed and quality are not interchangeable metrics.

How Intent Addresses Each Root Cause

Intent's multi-agent architecture maps directly to the three root causes identified above.

For missing architectural context: Intent's Coordinator agent analyzes the codebase through Augment Code's Context Engine, which semantically indexes and maps code, understanding relationships across repos and hundreds of thousands of files.
For the absent verification loop: Intent's Verifier agent checks implementations against a living spec approach that auto-updates as agents complete work and persists across sessions rather than degrading within a context window.
For hallucination compounding: The Coordinator assigns discrete, non-overlapping tasks to Implementor agents. Specifications are reviewed before implementation begins. The Verifier validates outputs against spec constraints, surfacing issues at verification time rather than in production.

Specify the Invisible 20% Before Your Agents Write Code

The 80% problem persists because teams optimize for generation speed while leaving the invisible 20% (non-functional requirements, failure modes, architectural consistency) unspecified. The evidence is consistent across Osmani's practitioner analysis, METR's mergeability findings, and GitClear's longitudinal code quality data. The practical next step is to make the missing 20% explicit before generation begins, then verify that the implementation still matches it at merge time.

The 80% Problem: Why AI Agents Ship Fast But Create Hidden Technical Debt

TL;DR

The Agentic SDLC

What the 80% Problem Actually Is

Where the 80% Problem Hits Hardest

What Agent-Generated Code Actually Looks Like

Environments Where the Gap Is Most Dangerous

Why the Last 20% Is Worse Than Writing From Scratch

How the 20% Gap Compounds Into Named Debt Patterns

Root Cause 1: Missing Architectural Context

Root Cause 2: No Verification Loop (Ephemeral Plans)

Root Cause 3: Agent Hallucinations Compound Across Files

Catching the Invisible 20% Before Merge

Core Tradeoffs When Mitigating the 80% Problem

What Does NOT Work

Adoption Path

Distinguishing from the METR 19% Finding

How Intent Addresses Each Root Cause

Specify the Invisible 20% Before Your Agents Write Code

Frequently Asked Questions: AI Coding Agents and the 80% Production Gap

Written by

Ani Galstian

Give your codebase the agents it deserves

TL;DR

The Agentic SDLC

What the 80% Problem Actually Is

Where the 80% Problem Hits Hardest

What Agent-Generated Code Actually Looks Like

Environments Where the Gap Is Most Dangerous

Why the Last 20% Is Worse Than Writing From Scratch

How the 20% Gap Compounds Into Named Debt Patterns

Root Cause 1: Missing Architectural Context

Root Cause 2: No Verification Loop (Ephemeral Plans)

Root Cause 3: Agent Hallucinations Compound Across Files

Catching the Invisible 20% Before Merge

Core Tradeoffs When Mitigating the 80% Problem

What Does NOT Work

Adoption Path

Distinguishing from the METR 19% Finding

How Intent Addresses Each Root Cause

Specify the Invisible 20% Before Your Agents Write Code

Frequently Asked Questions: AI Coding Agents and the 80% Production Gap

Does the 80% problem apply equally to greenfield and legacy projects?

Can better prompts solve the 80% problem?

How does the 80% problem differ from normal technical debt?

What is the fastest way to detect existing 80% debt in a codebase?

Do multi-agent workflows make the 80% problem worse?

Related Guides

Written by

Ani Galstian

Give your codebase the agents it deserves