Agentic engineering maturity is a five-level framework for assessing how systematically engineering organizations adopt AI agents.
TL;DR
Engineering teams often confuse individual AI tool use with organizational agent maturity. Conventional adoption metrics miss the instability that comes with faster shipping. MIT CISR and CMU SEI have published different maturity frameworks, and MIT CISR reports that the greatest financial impact in its enterprise AI maturity framework comes from moving from stage 2 to stage 3.
Why CTOs Need a Maturity Baseline Before Scaling Agents
Engineering leaders face a practical frustration: AI use is spreading faster than organizations can measure, govern, or operationalize it. Developers report productivity gains, but leadership still needs a reliable way to tell whether teams share prompts, version agent instructions, capture audit trails, assign owners, and measure delivery outcomes. DORA 2025 research shows AI adoption can increase both throughput and instability.
The words assistant and agent often get used interchangeably, which makes maturity hard to judge by labels alone. A structured maturity model instead maps agent adoption to observable practices. The five levels below synthesize concepts from MIT CISR, CMU SEI, Microsoft, Gartner, and DORA. Together, they form a self-assessment framework for engineering organizations adopting agentic workflows. Teams evaluating supporting infrastructure often compare categories such as AI coding tools, CI tools, and observability platforms before connecting agents to their pipelines and review process.
The product examples below use Augment Cosmos, a unified cloud agents platform now in public preview that gives agents shared context and memory across the software development lifecycle.
See how Cosmos turns isolated agent runs into governed, replayable team workflows.
Free tier available · VS Code extension · Takes 2 minutes
Level 1: Ad-Hoc (Individual Agents, No Shared Patterns)
Level 1 describes organizations where agent use remains personal instead of institutional. Engineers experiment in isolation, so the organization gets weak reproducibility, weak governance, and little shared learning.
Level 1 operating patterns keep agent use isolated to individual experiments
At this stage, agent activity rarely leaves an auditable path that another engineer or team can reuse. Teams usually show this pattern through experiments, ownership, and review:
- AI experiments run in notebooks with no versioning, no pipeline automation, and no CI/CD integration
- No centralized model registry or feature store exists
- AI initiatives depend on individual champions rather than team-level practices
- No documented evaluation process determines which workflows benefit from AI
- No human-in-the-loop review process governs AI outputs
At Level 1, the first practical need is capture. Augment Cosmos Sessions capture each run as an audit trail that teams can replay, so the organization can reuse workflow knowledge instead of leaving expertise trapped in individual configs.
Level 1 underperforms financially because isolated experimentation does not compound
MIT CISR's 2022 survey of 721 companies found that enterprises in the first two stages of AI maturity had financial performance below industry average, while stages 3 and 4 performed well above. In that research, 28% of enterprises were at this stage.
| Dimension | Level 1 Signal |
|---|---|
| Agent configuration | Per-engineer, no version control |
| Knowledge sharing | Trapped in individual prompts and configs |
| Governance | No AI-specific policies |
| Measurement | No tracking of AI's effect on delivery metrics |
| Organizational design | No shared understanding of what AI can and cannot do |
Level 2: Standardized (AGENTS.md, Shared Rules)
Level 2 introduces shared defaults that make agent behavior more consistent across a team. Version-controlled guidance gives teams repeatability before they add orchestration and shared learning.
AGENTS.md standardizes agent behavior through persistent repository guidance
A repo-level AGENTS.md file gives AI coding agents persistent, project-specific operational guidance before any work begins. OpenAI's Codex documentation describes the core behavior: Codex reads AGENTS.md files before doing any work. By layering global guidance with project-specific overrides, each task starts with consistent expectations regardless of which repository is opened.
The file's hierarchical discovery system enables layered standardization across global, project, and subdirectory scopes:
| Scope | Location | Audience |
|---|---|---|
| Personal global defaults | ~/.codex/AGENTS.md | Individual engineer, applies across all repos |
| Team standards | AGENTS.md at repo root | Version-controlled, shared across all contributors |
| Module-specific overrides | Subdirectory AGENTS.md files | Service or module teams |
| Directory-level overrides | AGENTS.override.md | Optional higher-priority file; when present in a directory, takes precedence over AGENTS.md for that directory |
Before AGENTS.md, major coding tools used their own tool-specific instruction files: GitHub Copilot uses .github/copilot-instructions.md, while Cursor moved from a legacy .cursorrules file to .cursor/rules/ with per-rule .mdc files. This fragmentation forced organizations to maintain engineering standards in multiple locations. AGENTS.md addresses that fragmentation.
Advancing beyond Level 2 requires process change that creates coordinated workflows
The next step is to turn shared standards into coordinated workflows and organizational learning.
MIT CISR identifies the Stage 2 to Stage 3 transition as the highest-value transition in enterprise AI maturity. Financial performance crosses from below-average to above-average at this boundary. MIT Sloan research specifies that organizations making this transition intentionally change processes, broadly and deeply, to facilitate organizational learning with AI.
| Dimension | Level 2: Standardized | Level 3: Orchestrated |
|---|---|---|
| Primary mechanism | Version-controlled guidance | Orchestration across identities, triggers, and pipelines |
| Core operating pattern | Shared defaults through AGENTS.md | Coordinated execution across agents, systems, and delivery workflows |
| Coordination model | Consistent repository guidance | Multi-agent coordination through intent |
| Delivery integration | Repeatability that can later support orchestration | CI/CD-integrated execution with cross-system identity and ownership |
| Main organizational outcome | More consistent agent behavior across a team | Teams can see which agent ran, what triggered it, what systems it touched, and who owns the result |
Level 3: Orchestrated (Multi-Agent, Spec-Driven, CI/CD)
Level 3 shifts from shared rules to coordinated execution across agents, systems, and delivery workflows. At this level, teams need traceable agent identities, event triggers, pipeline integration, and ownership paths for non-human execution.
Level 3 orchestration coordinates agents through parallelism, visibility, and intent
Traditional workflow orchestration follows deterministic paths. Agent orchestration involves non-deterministic routing decisions made by the agents themselves, which makes failures harder to trace and test than failures in conventional CI pipelines. Teams comparing implementation paths often evaluate workflow orchestration platforms and autonomous agents when they move beyond isolated experimentation.
Level 3 CI/CD integration becomes the bottleneck across identities, triggers, and controls
CI/CD integration becomes the practical bottleneck at this level because identity, ownership, and triggering all become cross-system concerns. The organization needs non-human execution paths that fit existing build, test, review, and deployment controls.
A specific operational problem at Level 3 is pipeline authentication through individual user accounts. Token lifecycle and ownership become unmanageable at scale. Running 10 agents across 20 tools generates 200 separate OAuth flows without centralized identity management.
For teams using Augment Cosmos, Service Accounts address that identity and ownership problem: service-account execution gives non-human runs one ownership model. Connected systems supply triggers from tickets, incident alerts, and PR submissions, and teams do not rewire new agents into the stack because the platform already connects to the build, test, review, and deployment pipeline.
Level 3 governance detects failure patterns before faster delivery increases incidents
Level 3 governance stays close to delivery work.
CMU SEI's AI Adoption Maturity Model, released with Accenture in 2026, defines capability areas that organizations build as they mature, including experimentation, responsible AI, and AI architecting.
DORA's research identifies the delivery risk introduced earlier: AI adoption can raise throughput while increasing instability. DORA tracks the deployment rework rate metric as one of its instability measures. Organizations that track throughput and stability separately can identify emerging delivery risks before they surface as incidents.
With Augment Cosmos, teams use shared context and reusable records across workflows, so agents draw on previous runs instead of starting from scratch.
See how Cosmos centralizes agent identity, event triggers, and audit trails for orchestrated CI/CD execution.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
Level 4: Systematic (Shared Memory, Experts, Learning Flywheel)
Level 4 turns repeated agent activity into accumulated team capability. Corrections, context, and specialization persist across sessions, so each run can draw on prior work.
Shared memory turns repeated agent sessions into accumulated organizational context
Without project memory, every agent session starts cold. Teams must reconstruct institutional knowledge from scratch, including architectural decisions, prior debugging approaches, and team conventions. The CoALA (Cognitive Architectures for Language Agents) paper maps four memory types to agent implementations:
| Memory Type | What Is Stored | Agent Example |
|---|---|---|
| Working | Current goals, intermediate reasoning | Active context window |
| Procedural | Rules determining behavior | AGENTS.md, system prompts |
| Semantic | Facts about the world | Facts about a user or codebase |
| Episodic | Sequences of past behaviors | Past agent actions, prior debugging approaches |
Augment Cosmos's organizational knowledge layer addresses this memory problem at team scope. Organization-level shared memory persists context across sessions and across the team rather than scoping it per engineer or per repository. That reduces repeated context reconstruction in shared cross-session agent workflows.
Specialized agents create durable domain competence through narrow scope and persistent corrections
Specialization at Level 4 separates general agent access from durable domain competence. Teams narrow scope, preserve corrections, and reuse learned behavior across the team.
The Agentic Software Engineering: Foundational Pillars paper presents the Structured Agentic Software Engineering (SASE) vision and outlines several foundational pillars for the future of software engineering.
Augment Cosmos Experts fit this narrow-scope pattern. Each expert combines narrow task scope, shared memory that persists across runs, and coaching-based feedback that distills important information into stored knowledge.
The learning flywheel compounds agent performance through stored corrections across sessions
The learning flywheel converts one corrected run into a better future run. The outcome is compounding improvement across future sessions.
The learning flywheel follows a four-stage architecture:
- Execute
- Coach
- Distill
- Improve
The sequence turns corrected runs into stored improvements that carry into future sessions.
Augment Cosmos's learning flywheel applies that sequence to coaching-based agent improvement. The flywheel distills each corrected session into stored knowledge, so those corrections carry into future runs instead of disappearing between executions.
Level 5: Autonomous (Agents Execute, Humans Steer, Knowledge Compounds)
Level 5 keeps humans responsible for oversight while agents execute more work inside governed boundaries. Human-on-the-loop oversight becomes the operating model for increasingly capable workflows, and governance risk concentrates as agent scope expands.
Level 5 operating models expand agent execution while keeping humans in control
The governance shift at Level 5 moves from human-in-the-loop to human-on-the-loop. Forrester's AEGIS Framework, a security model that defines 39 controls across six domains for securing enterprise AI agents, treats human oversight as a core requirement at this stage.
Augment Cosmos human-in-the-loop controls enforce policy-based approval boundaries at the handoffs where teams require human judgment. Teams set the policies for where human judgment is required, and Cosmos enforces them.
Level 5 knowledge compounding reinforces human and machine learning together
Knowledge compounding means corrected agent behavior, stored context, and human feedback keep accumulating across workflows. At this stage, the execute, coach, distill, improve pattern becomes an operating model in which human and machine learning reinforce each other.
MIT Sloan describes a centralized AI structure in which a global data science and analytics team builds enterprise AI capabilities and works with business units and centers of excellence to scale and operate AI solutions.
Level 5 autonomous execution concentrates governance risk when institutional understanding lags
Level 5 risk concentration comes from autonomous execution outpacing institutional understanding. The failure mode includes bad output and the loss of knowledge needed to diagnose and correct it.
Self-Assessment Matrix
The self-assessment matrix turns the five maturity levels into observable operating signals. Use it to compare current agent practices across configuration, knowledge sharing, governance, measurement, and organizational design rather than relying on perception alone.
| Dimension | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
|---|---|---|---|---|---|
| Agent config | Per-engineer | Repo-root AGENTS.md | CI/CD-integrated, service accounts | Expert-scoped, memory-backed | Self-improving, org-wide |
| Knowledge sharing | Individual prompts | Shared rules files | Spec-driven coordination | Cross-agent shared memory | Compounding knowledge base |
| Governance | None | Informal policies | Documented AI-specific policies | Named agent owner with accountability | Tiered auto-approval, human pull-in for judgment |
| Measurement | No AI metrics | Ad-hoc tracking | DORA metrics with throughput/stability split | Agent behavior audit trails | Strategic dashboards, risk-aware merge policies |
| Organizational design | Individual champions | Small specialist group | AI director with cross-functional authority | Cross-functional transformation squads | Bifurcated: factory team + operations team |
Organizations overestimate agentic maturity through definitional confusion and weak measurement
Organizations can overestimate their maturity when definitions blur, perceptions outrun evidence, and teams track shipping speed without tracking instability. Three specific error patterns cause this:
- Agentwashing: Organizations often conflate AI assistants with AI agents, which can lead teams to overestimate their maturity level.
- Perception gap: Objective delivery metrics, particularly DORA's five metrics including the deployment rework rate, provide a more accurate signal.
- Throughput-only tracking: Organizations that measure only shipping speed miss the instability signal described in the DORA research.
Maturity advancement consumes organizational effort in process, governance, data, and integration
Mature agentic engineering practices require more than model selection and prompt engineering. Teams often quantify that operational burden with ROI frameworks before expanding deployment.
DORA 2025 frames the takeaway directly: AI returns depend far more on the strength of a team's delivery system than on the tools it adopts.
How Coordination, Memory, and Governance Map to Levels 3-5
Levels 3-5 are where coordination, memory, and governed execution become infrastructure requirements, up to self-improving development tools that let agents extend the platform itself.
| Level | Organizational Requirement | Capability |
|---|---|---|
| 3: Orchestrated | Multi-agent coordination with CI/CD integration | Event bus triggers agents from software development lifecycle events; Service Accounts provide non-human identities; AGENTS.md support standardizes behavior |
| 3: Orchestrated | Spec-driven agent execution | Agents respond to structured specifications rather than step-by-step prompts |
| 4: Systematic | Persistent organizational memory | Organizational knowledge layer shares context across sessions and team members |
| 4: Systematic | Specialized domain agents | Expert Registry with coaching-based feedback; corrections persist across runs |
| 4: Systematic | Continuous improvement loops | Learning flywheel: Execute, Coach, Distill, Improve |
| 5: Autonomous | Tiered approval governance | Auto-approval for low-risk changes; line-by-line correctness analysis for medium risk; human pull-in for judgment calls |
| 5: Autonomous | Self-improving system | Agents work on the platform itself, building automations, specifying experts, and debugging existing workflows |
Benchmark Your Organization, Then Build the System
Leaders need to balance speed with control. Individual developers can move quickly with AI tools, but organizations get durable gains only when they can measure instability, standardize behavior, and turn individual wins into versioned instructions, audit trails, named owners, and measured delivery outcomes.
For engineering organizations at Level 2 or above, the concrete next step is to baseline current workflows against the maturity matrix, then split measurement into throughput and stability. That exposes where AI adoption is creating downstream disorder, which teams need shared standards, and where orchestration or memory infrastructure becomes necessary. Only after that baseline should leaders expand agent scope or automate more of delivery.
See how Cosmos enforces policy-based approvals so agents execute inside the boundaries your team sets.
Free tier available · VS Code extension · Takes 2 minutes
FAQ
Related
Written by

Ani Galstian
Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance