What level are most developers at today?

Most developers operate at Levels 1-2. Stack Overflow's 2025 survey shows many developers either do not use agents or restrict themselves to basic AI tools, and the JetBrains developer ecosystem survey reports that full integration remains rare.

Can a team skip levels in the framework?

Skipping levels makes it harder to manage the coordination challenges Yegge associates with Levels 6-7. Teams that jump to parallel agents without spec-writing skills and verification infrastructure hit the same bottleneck Osmani identifies at orchestration scale, where verification collapses under the volume of generated output.

What is the ROI of moving from Level 3 to Level 5?

The payoff comes from agents handling bounded, well-tested tasks asynchronously while the developer works on something else. For codebases with poor test coverage, undocumented dependencies, or inconsistent patterns, the ROI is often negative: agents generate plausible code that requires more review than manual implementation would have cost.

How do teams measure whether an agent is performing well?

Common indicators include spec adherence rate, verification latency, and cost per merged PR. Teams that only measure whether the agent finished miss the rework tax, which is where most of the hidden cost sits.

Is Level 8 production-ready for most teams?

Not yet for most organizations. Level 8 becomes production-viable when a team has established observability metrics, automated verification in CI, and a spec-management discipline that survives a month without the original author present. Without those prerequisites, Level 8 tends to regress to Level 7 under production pressure.

Steve Yegge's 8 Levels of AI Development: Where's Your Team?

The 8 levels of AI-assisted development, as defined by Steve Yegge, map a spectrum from zero AI usage through autocomplete and chat assistants to full agent orchestration, with each level representing a distinct shift in developer trust, tooling, and daily workflow.

TL;DR

Most engineering teams operate at Levels 1-3, where AI shows up as autocomplete, chat, or inline edits. Agentic IDEs push some teams to Levels 4-5. Levels 6-8 require a structural shift from single-agent coding to parallel agent orchestration with spec-driven delegation, along with stronger verification, intent articulation, and coordination skills.

Mapping the Spectrum from Autocomplete to Agent Fleets

Steve Yegge's 8-level framework ranges from no AI use to orchestrating multiple agents at once. In a recent conversation with Gergely Orosz, he put AI coding approaches on a spectrum. In his essay on coding agents, he tracks a progression where trust in the agent gradually increases from zero to the point where it takes over the IDE, spills into the CLI, and then multiplies from there.

The framework works best as a diagnostic. A linear 1-to-8 progression implies every team should keep climbing, but climbing has real costs: verification burden, token spend, and the orchestration skills required to manage parallel agents safely. Teams in small, well-factored codebases often get most of the available ROI at Level 3 and pay a tax if they push further. The framework earns its keep for teams that have plateaued and want to understand what structural change is required to move.

Cosmos, Augment Code's unified cloud agents platform, composes agent work from three primitives: Environments, Experts, and Sessions. It's built for teams ready to move beyond the single-agent ceiling that caps Levels 1-5.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

Yegge's 8 Levels: The Complete Spectrum

The definitions below draw on Yegge's framework as discussed by The Pragmatic Engineer and covered by O'Reilly.

Level	Name	Trust State	Where the Developer Works
1	No AI	None	IDE, writing all code
2	IDE Agent, Permissions On	Low	IDE, carefully reviewing
3	IDE Agent, YOLO Mode	Growing	IDE, less friction
4	Watching Agent, Not Diffs	Moderate	IDE, conversation-focused
5	CLI-First, IDE Abandoned	High	Terminal/CLI
6	Several Agents in Parallel	High	Multiplexing agents
7	10+ Agents by Hand	High (frustrated)	Juggling contexts
8	Custom Orchestrator	Full	Directing agent infrastructure

The critical architectural break happens at Level 5. The distinction maps directly to Addy Osmani's conductor-to-orchestrator transition: the conductor model gives you one agent working synchronously against your context window, while the orchestrator model gives you multiple agents with their own context windows, working asynchronously while you plan and check in. Crossing that line requires restructuring how work is decomposed, delegated, and verified.

Levels 1-3: Autocomplete, Chat, and Inline Edits

Levels 1 through 3 differ on three axes: who initiates the interaction, where AI output lands, and whether the developer has to manually bridge the gap between suggestion and code. Each level has its own stuck point.

Level 1: Autocomplete

The developer types; the tool watches editing context and surfaces "ghost text" suggestions accepted with Tab or dismissed with Esc. GitHub Copilot offers ghost text completions and next-edit predictions; Tabnine and Amazon Q offer similar inline experiences.

Stuck point: Accept-fatigue. Suggestions arrive on every keystroke, and developers start tab-completing reflexively. Quality regressions go unnoticed until code review, because the feedback loop sits inside the typing rhythm.

Level 2: Chat Assistants

The developer writes a prompt in a side panel, receives a response, and manually copies or applies the output. GitHub Copilot's Ask Mode operates at this level within the IDE; ChatGPT and Claude web chat function at Level 2 without IDE integration or persistent project context.

Stuck point: Context loss at the copy-paste boundary. Every translation from chat window back to editor drops information the model had and introduces transcription errors.

Level 3: Inline Edits

The AI writes directly to the file. The developer selects code, issues a natural language instruction, and the AI modifies the code in place. GitHub Copilot's Edit Mode and Tabnine's inline actions both operate here.

Stuck point: Scope. Inline edits work well for single-function changes and poorly for cross-file refactors where the model cannot see the dependent call sites. Level 3 is often the right ceiling for small codebases; larger codebases hit this limit fast and either move to Level 4 or regress to Level 2.

Capability	Level 1	Level 2	Level 3
Developer initiates?	No (passive)	Yes (prompt)	Yes (select + instruct)
AI writes to file?	No (ghost text)	No (side panel)	Yes (in-file diff)
Manual bridging?	Accept/reject only	Copy/paste required	Review diff, accept/reject
Typical tools	Copilot, Tabnine, Amazon Q	ChatGPT, Claude chat, Copilot Ask Mode	Copilot Edit Mode, Tabnine Inline Actions

Surveys from The Pragmatic Engineer place the majority of engineers at Levels 1-2.

Levels 4-5: Agent Mode and Multi-File Changes

The developer stops authoring code character by character and starts directing an agent that can read, write, and run code across multiple files. Trust increases, diff review decreases, and the conversation itself becomes the primary interface.

Level 4: Watching the Agent, Not the Diffs

Developers stop inspecting every diff and start watching what the agent is doing, letting more code through while focusing on the conversation. Attention shifts from asking whether the code is correct toward asking whether the agent is headed in the right direction.

Cursor 3 moved Cursor toward a unified workspace built around agents that can autonomously explore codebases, edit multiple files, run commands, and fix errors. The tradeoff: agent mode degrades on large monorepos where the index cannot fit relevant context, leading to confident edits based on incomplete understanding.

Windsurf Cascade uses Flow Awareness to track developer actions, including edits, commands, and clipboard contents, to infer intent without requiring the developer to restate context. The tradeoff is surveillance surface: teams in regulated industries often disable Flow Awareness features because the same signals that help the agent also expose sensitive data.

GitHub Copilot Agent Mode operates in VS Code as an autonomous peer programmer that responds to compile and lint errors, monitors test output, and auto-corrects in a loop. The tradeoff: the auto-correct loop can burn substantial tokens on wrong-path tasks before a human intervenes, and the cost stays invisible until the bill arrives.

Level 5: CLI-First, IDE Abandoned

The developer has moved out of the IDE as the primary workspace. Yegge's characterization is direct: developers just want the agent and will look at the code in the IDE later.

GitHub Copilot's coding agent exemplifies this level. A developer assigns an issue, Copilot opens a draft pull request, works asynchronously in a GitHub Actions environment, and requests review when complete. CLI tools like Aider operate with git-native atomicity, where every AI edit is automatically committed. The tradeoff with atomic commits is history hygiene: a day of agent work produces dozens of micro-commits that must be squashed before merge.

Some Level 5 workflows extend the loop further into CI/CD. Reports on agents in CI pipelines describe AI agents operating in sandbox environments for pull requests, navigating codebases, running CLI commands, and analyzing syntax trees before supporting human review.

Levels 6-8: Orchestration, Parallel Agents, and Spec-Driven Delegation

Levels 6-8 represent a categorically different mode of development. Andrej Karpathy's Verifiability essay argues that in the new programming paradigm, the tasks most amenable to automation are those where outputs can be verified, which pushes the developer's job toward specifying objectives and checking results rather than writing every line directly.

Level 6: Several Agents in Parallel

Yegge frames the trend around running multiple AI agents in parallel and orchestrating them. Reports from teams inside OpenAI Codex describe engineers running several agents simultaneously, typically in the single-digit range per developer. The workflow shifts from sequential task completion to what Osmani calls the factory model: spin up many agents in parallel, where one handles a backend refactor, another implements a feature, and another writes integration tests.

Cosmos approaches this shift by composing agent work from its three core primitives. Environments define where each agent runs and what it can touch, Experts define how each agent behaves and which tools it uses, and Sessions turn the resulting work into auditable, replayable records rather than a wall of untracked terminal output. That separation addresses the coordination failures documented across multi-agent failure taxonomies, where parallel agents break down without clear boundaries around scope and behavior.

Level 7: 10+ Agents, Managed by Hand

Coordination quickly becomes confusing and error-prone at this scale. Manual management produces a consistent set of failure modes:

Spec drift across agents. Without a shared living spec, each agent works from the prompt it was given, and the specs diverge silently. Two agents end up implementing incompatible versions of the same interface.
Duplicated work. Agents assigned adjacent tasks often reimplement utilities their neighbors already wrote, because no shared index tracks what has been completed.
Merge conflict storms. Ten agents writing to overlapping files in the same branch produce conflicts that take longer to resolve than the original work would have taken to write by hand.
Review collapse. Human reviewers cannot keep up with ten parallel PR streams. Review becomes rubber-stamping, and defects that would have been caught at Level 5 ship at Level 7.

Microsoft's internal Project Societas produced 110,000+ lines of code that was 98% AI-generated, with human work shifting from authoring to directing. That scale is unreachable without coordination infrastructure.

Level 8: Build Your Own Orchestrator

Yegge describes this as the point where developers build their own orchestrator. His Gas Town project, a Go-based orchestrator for Claude Code that can manage 20-30 agents in parallel, adds three capabilities on top of raw agent calls: a shared task queue that prevents duplicated work, a coordinator process that assigns tasks based on agent availability, and checkpointing so that a crashed agent can be resumed without losing state. Those are the minimum primitives any Level 8 system needs.

Cosmos provides those primitives as a product. Sessions are durable across long-running and parallel work, so a team can pause a multi-agent project and pick it back up days later without re-seeding context. A shared virtual filesystem with tenant and private memory sits underneath, accumulating patterns and corrections as agents work, which cuts into the coordination drift that derails Level 7.

Capability	Level 6	Level 7	Level 8
Agent count	2-5 parallel	10+	Fleet-scale
Coordination	Ad-hoc multiplexing	Manual (error-prone)	Systematic orchestration
Spec management	Informal	Fragmented across agents	Centralized, continuously maintained
Verification	Per-agent review	Overwhelmed	Automated verification pipeline
Tools	Claude Code swarms, Codex parallel	Manual terminal management	Cosmos, Factory.ai, custom orchestrators

Osmani's six-step production line defines the orchestration-level workflow: Plan, Spawn, Monitor, Verify, Integrate, and Retro. Verification has become the bottleneck, taking over from generation, and that's the gap Cosmos is built to close. Human-in-the-loop checkpoints are a platform feature rather than something bolted on after the fact, so teams set the policies for where judgment is required and the platform enforces them.

Self-Assessment: Where Is Your Team Today?

Score each statement from 0, never, to 2, consistently. The score works as a directional indicator: a team with a 12 is somewhere around Level 5, rather than precisely at it.

#	Statement	Score (0-2)
1	Team members use AI autocomplete or chat daily
2	Developers accept AI suggestions without reviewing every line
3	AI edits files directly; developers review diffs rather than writing code
4	Developers describe goals to agents rather than specifying implementation steps
5	At least some developers work primarily in terminal/CLI with agents rather than IDEs
6	Developers run 2+ agents simultaneously on different tasks
7	The team has built specs, AGENTS.md files, or orchestration tooling to coordinate agents
8	Verification infrastructure, including automated tests and trust constraints, governs what agents can commit
9	Parallel agent work merges cleanly without frequent conflicts or rework
10	The team measures success by decision velocity and system reliability rather than lines of code

Score interpretation:

0-4: Levels 1-2. Focus on increasing trust through inline edits and edit mode workflows.
5-8: Levels 3-4. Experiment with CLI-first agentic tools.
9-13: Level 5. Ready for parallel agent workflows.
14-17: Levels 6-7. Invest in spec-driven orchestration and verification infrastructure.
18-20: Level 8. Focus on governance, observability, and scaling.

These bands are directional, not diagnostic. A team's exact score matters less than what changes as it climbs from one band to the next.

The Skill Shift at Each Level: From Typing to Reviewing to Orchestrating

The verification burden grows at each level, and the skills needed to handle it change with it:

Dimension	Levels 1-3	Levels 4-5	Levels 6-8
Primary activity	Writing code, reviewing suggestions	Reviewing agent output, approving commands	Decomposing tasks, designing verification systems
Core skill	Syntax mastery, prompt engineering	Verification judgment, task framing	Intent articulation, orchestration design
Code review task	Symmetric peer review	Asymmetric AI output evaluation	Trust constraint system design
Performance metric	Lines of code, PR volume	PR quality, rework rate	Decision velocity, system reliability
Time allocation	Majority in construction	50%+ in evaluation	Primarily async oversight and exception handling

BCG's analysis of AI workforce impact reinforces the direction: the writing and maintenance of code will be deprioritized, while higher-order systems thinking and proficiency with AI tools grow in importance. Cosmos's model-agnostic BYOK approach supports that shift in practice. Teams can route a planning task to Claude Opus, an implementation task to Codex, and a verification task to a cheaper Haiku-class model inside a single workspace.

How to Progress from Level 5 to Level 6+

ThoughtWorks places "team of coding agents" at Assess stage on its Technology Radar, worth exploring but not yet broadly recommended for production. Gartner predicts 40% of agentic AI projects will be canceled by the end of 2027. The transition requires deliberate preparation.

Phase 1 (Months 1-3): Spec-First Foundation

ThoughtWorks describes spec-driven development as an emerging workflow for AI-assisted and multi-step agentic coding. Spec-writing is the gating skill: decomposing projects into precisely specified, independently verifiable subtasks.

Open source

augmentcode/auggie★258

Star on GitHub

A working spec for agent execution typically includes:

A goal statement in one sentence describing the user-visible outcome
A scope boundary listing which files, services, or modules are in and out of scope
Interface contracts for any function signatures or API shapes the agent must match
Acceptance tests the agent should be able to run locally before declaring the task complete
A rollback plan describing how to revert the change if verification fails

A bad spec, by contrast, is a paragraph of prose with no scope boundary and no tests. Agents will accept it, produce plausible code, and fail silently.

From there:

Establish AGENTS.md, CLAUDE.md, or .cursorrules as team-maintained standards
Train engineers on the spec structure above before authorizing agent work
Select a human oversight model explicitly, choosing among human-in-the-loop, human-on-the-loop, or other oversight patterns
Implement atomic, per-agent git commit discipline

Cosmos provides this foundation as a product capability. An Expert encodes how an agent should behave and which tools it can use before it starts, and the Session that results stays auditable and replayable, so drift doesn't creep in silently mid-task. Developers can review and adjust an Expert's configuration at any point before agents proceed.

Phase 2 (Months 3-6): Controlled Parallelism

Begin running 2-3 agents in parallel on isolated, well-scoped tasks. A good candidate task meets four tests: it touches a bounded set of files, it has existing test coverage, it does not require coordination with other in-flight work, and it can be reverted with a single git command. Backend refactors, test generation, and documentation updates usually pass all four. Cross-cutting changes like auth refactors or database migrations almost never do.

Establish cost monitoring and token budget governance before authorizing parallel agents. O'Reilly's coverage of agentic coding frames the conductor-to-orchestrator shift as a progression: less experienced developers build confidence driving a single agent before taking on parallel coordination, while senior engineers lead the early parallel operations. The pull-back signal is consistent: when more than 20% of parallel agent output requires manual rework, the tasks are poorly scoped and the team should return to sequential execution.

Cosmos's Environments give each agent its own defined space to run and touch, preventing the merge conflicts and cross-contamination that plague ad hoc parallel agent setups.

Phase 3 (Months 6+): Orchestration Architecture

Anthropic's engineering team has documented the pattern at scale: a lead agent coordinates the process while delegating to specialized subagents that operate in parallel. Teams at this phase need at least three observability metrics in place before scaling agent count further:

Spec adherence rate: percentage of agent-produced PRs that match their originating spec without manual correction
Verification latency: time from agent task completion to human or automated sign-off
Cost per merged PR: total token spend divided by PRs that actually reach main

Teams should also establish pre-merge verification review processes to address the architectural drift that surfaces as AI adoption scales.

Pitfalls Worth Naming Separately

Two failure modes cut across all three phases. The first is context standardization fragmentation: teams using multiple agent tools maintain parallel context files per tool (CLAUDE.md, .cursorrules, AGENTS.md) without a current interoperability standard, and those files drift apart over time. The second is cost explosion: token usage scales dramatically with parallel agents, and budget governance must precede parallel scaling, never follow it.

Map Your Team's Next Level, Then Build the Prerequisites

The gap between where most teams operate, Levels 1-3, and where the industry is heading, Levels 6-8, is primarily a coordination problem. Tooling is the easier half. Stack Overflow's AI sentiment data shows positive sentiment toward AI tools has declined to 60%, down from above 70% in prior years, which fits a pattern where verification burden grows faster than the skills needed to manage it.

Teams scoring 9-13 on the self-assessment face the most consequential decision: continue refining single-agent workflows or begin the structural transition to orchestrated, parallel development. Progression depends on the ability to decompose work into precisely specified, independently verifiable tasks that agents can execute in parallel, which is what Cosmos's Environments, Experts, and Sessions are built to support.

Steve Yegge's 8 Levels of AI Development: Where's Your Team?

TL;DR

Mapping the Spectrum from Autocomplete to Agent Fleets

The Agentic SDLC

Yegge's 8 Levels: The Complete Spectrum

Levels 1-3: Autocomplete, Chat, and Inline Edits

Level 1: Autocomplete

Level 2: Chat Assistants

Level 3: Inline Edits

Levels 4-5: Agent Mode and Multi-File Changes

Level 4: Watching the Agent, Not the Diffs

Level 5: CLI-First, IDE Abandoned

Levels 6-8: Orchestration, Parallel Agents, and Spec-Driven Delegation

Level 6: Several Agents in Parallel

Level 7: 10+ Agents, Managed by Hand

Level 8: Build Your Own Orchestrator

Self-Assessment: Where Is Your Team Today?

The Skill Shift at Each Level: From Typing to Reviewing to Orchestrating

How to Progress from Level 5 to Level 6+

Phase 1 (Months 1-3): Spec-First Foundation

Phase 2 (Months 3-6): Controlled Parallelism

Phase 3 (Months 6+): Orchestration Architecture

Pitfalls Worth Naming Separately

Map Your Team's Next Level, Then Build the Prerequisites

FAQ

Written by

Paula Hingel

Give your codebase the agents it deserves

TL;DR

Mapping the Spectrum from Autocomplete to Agent Fleets

The Agentic SDLC

Yegge's 8 Levels: The Complete Spectrum

Levels 1-3: Autocomplete, Chat, and Inline Edits

Level 1: Autocomplete

Level 2: Chat Assistants

Level 3: Inline Edits

Levels 4-5: Agent Mode and Multi-File Changes

Level 4: Watching the Agent, Not the Diffs

Level 5: CLI-First, IDE Abandoned

Levels 6-8: Orchestration, Parallel Agents, and Spec-Driven Delegation

Level 6: Several Agents in Parallel

Level 7: 10+ Agents, Managed by Hand

Level 8: Build Your Own Orchestrator

Self-Assessment: Where Is Your Team Today?

The Skill Shift at Each Level: From Typing to Reviewing to Orchestrating

How to Progress from Level 5 to Level 6+

Phase 1 (Months 1-3): Spec-First Foundation

Phase 2 (Months 3-6): Controlled Parallelism

Phase 3 (Months 6+): Orchestration Architecture

Pitfalls Worth Naming Separately

Map Your Team's Next Level, Then Build the Prerequisites

FAQ

What level are most developers at today?

Can a team skip levels in the framework?

What is the ROI of moving from Level 3 to Level 5?

How do teams measure whether an agent is performing well?

Is Level 8 production-ready for most teams?

Related

Written by

Paula Hingel

Give your codebase the agents it deserves