The 8 levels of AI-assisted development, as defined by Steve Yegge, map a spectrum from zero AI usage through autocomplete and chat assistants to full agent orchestration, with each level representing a distinct shift in developer trust, tooling, and daily workflow.
TL;DR
Most engineering teams operate at Levels 1-3, where AI shows up as autocomplete, chat, or inline edits. Agentic IDEs push some teams to Levels 4-5. Levels 6-8 require a structural shift from single-agent coding to parallel agent orchestration with spec-driven delegation, along with stronger verification, intent articulation, and coordination skills.
Mapping the Spectrum from Autocomplete to Agent Fleets
Steve Yegge's 8-level framework ranges from no AI use to orchestrating multiple agents at once. In a recent conversation with Gergely Orosz, he put AI coding approaches on a spectrum, and in his essay on coding agents he tracks a progression where trust in the agent gradually increases from zero to the point where it takes over the IDE, spills into the CLI, and then multiplies from there.
The framework works best as a diagnostic. A linear 1-to-8 progression implies every team should keep climbing, but climbing has real costs: verification burden, token spend, and the orchestration skills required to manage parallel agents safely. Teams in small, well-factored codebases often get most of the available ROI at Level 3 and pay a tax if they push further. The framework earns its keep for teams that have plateaued and want to understand what structural change is required to move.
Intent, the workspace built for spec-driven agent orchestration, is designed for teams ready to move beyond the single-agent ceiling that caps Levels 1-5.
Explore how Intent coordinates parallel agents through living specs so teams can progress past the single-agent ceiling.
Free tier available · VS Code extension · Takes 2 minutes
Yegge's 8 Levels: The Complete Spectrum
The definitions below draw on Yegge's framework as discussed by The Pragmatic Engineer and covered by O'Reilly.
| Level | Name | Trust State | Where the Developer Works |
|---|---|---|---|
| 1 | No AI | None | IDE, writing all code |
| 2 | IDE Agent, Permissions On | Low | IDE, carefully reviewing |
| 3 | IDE Agent, YOLO Mode | Growing | IDE, less friction |
| 4 | Watching Agent, Not Diffs | Moderate | IDE, conversation-focused |
| 5 | CLI-First, IDE Abandoned | High | Terminal/CLI |
| 6 | Several Agents in Parallel | High | Multiplexing agents |
| 7 | 10+ Agents by Hand | High (frustrated) | Juggling contexts |
| 8 | Custom Orchestrator | Full | Directing agent infrastructure |
The critical architectural break happens at Level 5. The distinction maps directly to Addy Osmani's conductor-to-orchestrator transition: the conductor model gives you one agent working synchronously against your context window, while the orchestrator model gives you multiple agents with their own context windows, working asynchronously while you plan and check in. Crossing that line requires restructuring how work is decomposed, delegated, and verified.
Levels 1-3: Autocomplete, Chat, and Inline Edits
Levels 1 through 3 differ on three axes: who initiates the interaction, where AI output lands, and whether the developer has to manually bridge the gap between suggestion and code. Each level has its own stuck point.
Level 1: Autocomplete
The developer types; the tool watches editing context and surfaces "ghost text" suggestions accepted with Tab or dismissed with Esc. GitHub Copilot offers ghost text completions and next-edit predictions; Tabnine and Amazon Q offer similar inline experiences.
Stuck point: Accept-fatigue. Suggestions arrive on every keystroke, and developers start tab-completing reflexively. Quality regressions go unnoticed until code review, because the feedback loop sits inside the typing rhythm.
Level 2: Chat Assistants
The developer writes a prompt in a side panel, receives a response, and manually copies or applies the output. GitHub Copilot's Ask Mode operates at this level within the IDE; ChatGPT and Claude web chat function at Level 2 without IDE integration or persistent project context.
Stuck point: Context loss at the copy-paste boundary. Every translation from chat window back to editor drops information the model had and introduces transcription errors.
Level 3: Inline Edits
The AI writes directly to the file. The developer selects code, issues a natural language instruction, and the AI modifies the code in place. GitHub Copilot's Edit Mode and Tabnine's inline actions both operate here.
Stuck point: Scope. Inline edits work well for single-function changes and poorly for cross-file refactors where the model cannot see the dependent call sites. Level 3 is often the right ceiling for small codebases; larger codebases hit this limit fast and either move to Level 4 or regress to Level 2.
| Capability | Level 1 | Level 2 | Level 3 |
|---|---|---|---|
| Developer initiates? | No (passive) | Yes (prompt) | Yes (select + instruct) |
| AI writes to file? | No (ghost text) | No (side panel) | Yes (in-file diff) |
| Manual bridging? | Accept/reject only | Copy/paste required | Review diff, accept/reject |
| Typical tools | Copilot, Tabnine, Amazon Q | ChatGPT, Claude chat, Copilot Ask Mode | Copilot Edit Mode, Tabnine Inline Actions |
Surveys from The Pragmatic Engineer place the majority of engineers at Levels 1-2.
Levels 4-5: Agent Mode and Multi-File Changes
The developer stops authoring code character by character and starts directing an agent that can read, write, and run code across multiple files. Trust increases, diff review decreases, and the conversation itself becomes the primary interface.
Level 4: Watching the Agent, Not the Diffs
Developers stop inspecting every diff and start watching what the agent is doing, letting more code through while focusing on the conversation. Attention shifts from asking whether the code is correct toward asking whether the agent is headed in the right direction.
Cursor 3 moved Cursor toward a unified workspace built around agents that can autonomously explore codebases, edit multiple files, run commands, and fix errors. The tradeoff: agent mode degrades on large monorepos where the index cannot fit relevant context, leading to confident edits based on incomplete understanding.
Windsurf Cascade uses Flow Awareness to track developer actions, including edits, commands, and clipboard contents, to infer intent without requiring the developer to restate context. The tradeoff is surveillance surface: teams in regulated industries often disable Flow Awareness features because the same signals that help the agent also expose sensitive data.
GitHub Copilot Agent Mode operates in VS Code as an autonomous peer programmer that responds to compile and lint errors, monitors test output, and auto-corrects in a loop. The tradeoff: the auto-correct loop can burn substantial tokens on wrong-path tasks before a human intervenes, and the cost stays invisible until the bill arrives.
Level 5: CLI-First, IDE Abandoned
The developer has moved out of the IDE as the primary workspace. Yegge's characterization is direct: developers just want the agent and will look at the code in the IDE later.
GitHub Copilot's coding agent exemplifies this level. A developer assigns an issue, Copilot opens a draft pull request, works asynchronously in a GitHub Actions environment, and requests review when complete. CLI tools like Aider operate with git-native atomicity, where every AI edit is automatically committed. The tradeoff with atomic commits is history hygiene: a day of agent work produces dozens of micro-commits that must be squashed before merge.
Some Level 5 workflows extend the loop further into CI/CD. Reports on agents in CI pipelines describe AI agents operating in sandbox environments for pull requests, navigating codebases, running CLI commands, and analyzing syntax trees before supporting human review.
Levels 6-8: Orchestration, Parallel Agents, and Spec-Driven Delegation
Levels 6-8 represent a categorically different mode of development. Andrej Karpathy's Verifiability essay argues that in the new programming paradigm, the tasks most amenable to automation are those where outputs can be verified, which pushes the developer's job toward specifying objectives and checking results rather than writing every line directly.
Level 6: Several Agents in Parallel
Yegge frames the trend around running multiple AI agents in parallel and orchestrating them. Reports from teams inside OpenAI Codex describe engineers running several agents simultaneously, typically in the single-digit range per developer. The workflow shifts from sequential task completion to what Osmani calls the factory model: spin up many agents in parallel, where one handles a backend refactor, another implements a feature, and another writes integration tests.
Intent approaches this shift through a structured agent model. A Coordinator Agent analyzes the codebase and delegates to Implementor Agents executing tasks in parallel waves, while a Verifier Agent checks results against the living spec. That role separation addresses the coordination failures documented across multi-agent failure taxonomies, where parallel agents break down without clear role boundaries.
See how Intent's agent model keeps parallel execution coordinated through structured roles.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
Level 7: 10+ Agents, Managed by Hand
Coordination quickly becomes confusing and error-prone at this scale. Manual management produces a consistent set of failure modes:
- Spec drift across agents. Without a shared living spec, each agent works from the prompt it was given, and the specs diverge silently. Two agents end up implementing incompatible versions of the same interface.
- Duplicated work. Agents assigned adjacent tasks often reimplement utilities their neighbors already wrote, because no shared index tracks what has been completed.
- Merge conflict storms. Ten agents writing to overlapping files in the same branch produce conflicts that take longer to resolve than the original work would have taken to write by hand.
- Review collapse. Human reviewers cannot keep up with ten parallel PR streams. Review becomes rubber-stamping, and defects that would have been caught at Level 5 ship at Level 7.
Microsoft's internal Project Societas produced 110,000+ lines of code that was 98% AI-generated, with human work shifting from authoring to directing. That scale is unreachable without coordination infrastructure.
Level 8: Build Your Own Orchestrator
Yegge describes this as the point where developers build their own orchestrator. His Gas Town project, a Go-based orchestrator for Claude Code that can manage 20-30 agents in parallel, adds three capabilities on top of raw agent calls: a shared task queue that prevents duplicated work, a coordinator process that assigns tasks based on agent availability, and checkpointing so that a crashed agent can be resumed without losing state. Those are the minimum primitives any Level 8 system needs.
Intent provides those primitives as a product and adds resumable workspace sessions so that a team can pause a multi-agent project and pick it back up the next day without re-seeding context. Living specs sit at the center as the single source of truth, auto-updating as agents complete work, which removes the spec drift that derails Level 7.
| Capability | Level 6 | Level 7 | Level 8 |
|---|---|---|---|
| Agent count | 2-5 parallel | 10+ | Fleet-scale |
| Coordination | Ad-hoc multiplexing | Manual (error-prone) | Systematic orchestration |
| Spec management | Informal | Fragmented across agents | Living specs, single source of truth |
| Verification | Per-agent review | Overwhelmed | Automated verification pipeline |
| Tools | Claude Code swarms, Codex parallel | Manual terminal management | Intent, Factory.ai, custom orchestrators |
Osmani's six-step production line defines the orchestration-level workflow: Plan, Spawn, Monitor, Verify, Integrate, and Retro. Verification has become the bottleneck, taking over from generation, and that is the gap Intent's Verifier Agent is designed to close.
Self-Assessment: Where Is Your Team Today?
Score each statement from 0, never, to 2, consistently. The score works as a directional indicator: a team with a 12 is somewhere around Level 5, rather than precisely at it.
| # | Statement | Score (0-2) |
|---|---|---|
| 1 | Team members use AI autocomplete or chat daily | |
| 2 | Developers accept AI suggestions without reviewing every line | |
| 3 | AI edits files directly; developers review diffs rather than writing code | |
| 4 | Developers describe goals to agents rather than specifying implementation steps | |
| 5 | At least some developers work primarily in terminal/CLI with agents rather than IDEs | |
| 6 | Developers run 2+ agents simultaneously on different tasks | |
| 7 | The team has built specs, AGENTS.md files, or orchestration tooling to coordinate agents | |
| 8 | Verification infrastructure, including automated tests and trust constraints, governs what agents can commit | |
| 9 | Parallel agent work merges cleanly without frequent conflicts or rework | |
| 10 | The team measures success by decision velocity and system reliability rather than lines of code |
Score interpretation:
- 0-4: Levels 1-2. Focus on increasing trust through inline edits and edit mode workflows.
- 5-8: Levels 3-4. Experiment with CLI-first agentic tools.
- 9-13: Level 5. Ready for parallel agent workflows.
- 14-17: Levels 6-7. Invest in spec-driven orchestration and verification infrastructure.
- 18-20: Level 8. Focus on governance, observability, and scaling.
The Skill Shift at Each Level: From Typing to Reviewing to Orchestrating
The verification burden grows at each level, and the skills needed to handle it change with it:
| Dimension | Levels 1-3 | Levels 4-5 | Levels 6-8 |
|---|---|---|---|
| Primary activity | Writing code, reviewing suggestions | Reviewing agent output, approving commands | Decomposing tasks, designing verification systems |
| Core skill | Syntax mastery, prompt engineering | Verification judgment, task framing | Intent articulation, orchestration design |
| Code review task | Symmetric peer review | Asymmetric AI output evaluation | Trust constraint system design |
| Performance metric | Lines of code, PR volume | PR quality, rework rate | Decision velocity, system reliability |
| Time allocation | Majority in construction | 50%+ in evaluation | Primarily async oversight and exception handling |
BCG's analysis of AI workforce impact reinforces the direction: the writing and maintenance of code will be deprioritized, while higher-order systems thinking and proficiency with AI tools grow in importance. Intent's BYOA model supports that shift in practice, letting teams route a planning task to Claude Opus, an implementation task to Codex, and a verification task to a cheaper Haiku-class model inside a single workspace.
How to Progress from Level 5 to Level 6+
ThoughtWorks places "team of coding agents" at Assess stage on its Technology Radar, worth exploring but not yet broadly recommended for production. Gartner predicts 40% of agentic AI projects will be canceled by the end of 2027. The transition requires deliberate preparation.
Phase 1 (Months 1-3): Spec-First Foundation
ThoughtWorks describes spec-driven development as an emerging workflow for AI-assisted and multi-step agentic coding. Spec-writing is the gating skill: decomposing projects into precisely specified, independently verifiable subtasks.
A working spec for agent execution typically includes:
- A goal statement in one sentence describing the user-visible outcome
- A scope boundary listing which files, services, or modules are in and out of scope
- Interface contracts for any function signatures or API shapes the agent must match
- Acceptance tests the agent should be able to run locally before declaring the task complete
- A rollback plan describing how to revert the change if verification fails
A bad spec, by contrast, is a paragraph of prose with no scope boundary and no tests. Agents will accept it, produce plausible code, and fail silently.
From there:
- Establish AGENTS.md, CLAUDE.md, or .cursorrules as team-maintained standards
- Train engineers on the spec structure above before authorizing agent work
- Select a human oversight model explicitly, choosing among human-in-the-loop, human-on-the-loop, or other oversight patterns
- Implement atomic, per-agent git commit discipline
Intent's living specs provide this foundation as a product capability, auto-updating as agents complete work so spec rot never sets in. The Coordinator Agent drafts the spec and generates tasks; developers can stop it at any point to manually edit the spec before agents proceed.
Phase 2 (Months 3-6): Controlled Parallelism
Begin running 2-3 agents in parallel on isolated, well-scoped tasks. A good candidate task meets four tests: it touches a bounded set of files, it has existing test coverage, it does not require coordination with other in-flight work, and it can be reverted with a single git command. Backend refactors, test generation, and documentation updates usually pass all four. Cross-cutting changes like auth refactors or database migrations almost never do.
Establish cost monitoring and token budget governance before authorizing parallel agents. O'Reilly's coverage of agentic coding frames the conductor-to-orchestrator shift as a progression: less experienced developers build confidence driving a single agent before taking on parallel coordination, while senior engineers lead the early parallel operations. The pull-back signal is consistent: when more than 20% of parallel agent output requires manual rework, the tasks are poorly scoped and the team should return to sequential execution.
Intent's isolated git worktrees give each agent its own workspace, preventing the merge conflicts and cross-contamination that plague ad hoc parallel agent setups.
Phase 3 (Months 6+): Orchestration Architecture
Anthropic's engineering team has documented the pattern at scale: a lead agent coordinates the process while delegating to specialized subagents that operate in parallel. Teams at this phase need at least three observability metrics in place before scaling agent count further:
- Spec adherence rate: percentage of agent-produced PRs that match their originating spec without manual correction
- Verification latency: time from agent task completion to human or automated sign-off
- Cost per merged PR: total token spend divided by PRs that actually reach main
Teams should also establish pre-merge verification review processes to address the architectural drift that surfaces as AI adoption scales.
Pitfalls Worth Naming Separately
Two failure modes cut across all three phases. The first is context standardization fragmentation: teams using multiple agent tools maintain parallel context files per tool (CLAUDE.md, .cursorrules, AGENTS.md) without a current interoperability standard, and those files drift apart over time. The second is cost explosion: token usage scales dramatically with parallel agents, and budget governance must precede parallel scaling, never follow it.
Map Your Team's Next Level, Then Build the Prerequisites
The gap between where most teams operate, Levels 1-3, and where the industry is heading, Levels 6-8, is primarily a coordination problem. Tooling is the easier half. Stack Overflow's AI sentiment data shows positive sentiment toward AI tools has declined to 60%, down from above 70% in prior years, which fits a pattern where verification burden grows faster than the skills needed to manage it.
Teams scoring 9-13 on the self-assessment face the most consequential decision: continue optimizing single-agent workflows or begin the structural transition to orchestrated, parallel development. Progression depends on the ability to decompose work into precisely specified, independently verifiable tasks that agents can execute in parallel, which is what a spec-driven workspace is built to support.
With Intent, teams can reach Level 6+ without building orchestration infrastructure from scratch.
Free tier available · VS Code extension · Takes 2 minutes
FAQ
Related
Written by

Paula Hingel
Paula writes about the patterns that make AI coding agents actually work — spec-driven development, multi-agent orchestration, and the context engineering layer most teams skip. Her guides draw on real build examples and focus on what changes when you move from a single AI assistant to a full agentic codebase.