How does this model differ from TMMi?

The agent-native QA maturity model refits TMMi's five-level maturity architecture around agent capability adoption, while TMMi organizes maturity around process areas and goals. It keeps TMMi's lowest-rating assessment logic but replaces maturity-level process areas with compilation validation, live DOM selector checks, outcome metrics, and coordination policy. TMMi assumes deterministic human-authored tests; this model assumes agent-generated tests against AI-rewritten applications.

Why is Stage 5 defined by governance instead of automation percentage?

Stage 5 is defined by governance because multi-agent performance does not scale with agent count. Research shows that agent systems plateau as teams add agents, while unmanaged coordination amplifies errors. Governance keeps agent count from becoming a proxy for maturity.

Can a team skip stages to reach multi-agent orchestration faster?

A team should not assume it can shortcut its way to multi-agent orchestration, because TMMi maturity levels require teams to attain the goals of the relevant process areas. The lowest rating among supporting goals constrains the overall rating. A team running multi-agent orchestration without compilation gates or selector validation operates with bolted-on capability that exposes agent output to unchecked error propagation. Teams must fully achieve each verification gate before they govern the higher stage's capability in production.

What is the single highest-priority verification control?

Compilation and build validation is the highest-priority control because it gates the earliest and most common failure mode. Empirical studies show unresolved symbol errors account for 30.68% of invalid LLM-generated tests, and without a filter, valid pass rates for generated tests can be as low as 24%. These assurance gates make generative test authoring more practical for review.

How do existing AI governance frameworks fit this model?

AI governance frameworks establish organizational risk management, while this maturity model adds testing-specific capability stages. The model turns governance into QA controls: compilation validation, DOM selector checks, outcome metrics, behavioral drift detection, and coordination policy. Teams can apply broader governance frameworks above these gates, while the QA maturity score comes from the lowest unmet verification control.

QA Maturity Model for Agent-Native Teams: 5 Stages

The QA maturity model for agent-native teams is a five-stage framework for mapping agent capability adoption to verification controls. Agent-native QA requires controls that compare AI output confidence with AI output correctness. Classical models assume humans write tests against a stable application. Agent-native QA inverts both assumptions: agents generate the tests, and AI coding tool gaps appear as applications shift under AI-driven changes.

TL;DR

Agent-native QA can face broken automated tests when AI tools rewrite applications while agents generate tests. TMMi assumes human-authored tests against stable code, so process maturity misses verification risk. This five-stage model teaches how to gate agent capability with compilation, DOM, drift, and governance controls.

A QA organization can have strong process maturity while its agent-generated tests remain unproven. The model focuses on that proof: generated tests, generated selectors, and generated root-cause analysis must still match the live system.

The TMMi framework is a testing-specific framework with 16 process areas across a five-level staged architecture. 42% of surveyed organizations have reached Level 3. Developers feel the gap when agent-generated changes outpace proof that failing tests reflect real defects and not stale automation. The Augment Code Context Engine cuts hallucinations by 40% compared with limited-context tools through semantic dependency analysis, semantic filtering, and dynamic model routing. It analyzes entire codebases across 400,000+ files, giving teams architectural-level understanding before they accept AI output. For the higher stages of the model, Augment Cosmos, the Unified Cloud Agents Platform, runs specialized agents with shared context and memory across the software development lifecycle and gives teams the policy surface needed to govern multi-agent orchestration. Score readiness with this model through compilation gates, live DOM checks, outcome metrics, drift detection, and coordination policy.

Why Classical TMMi Breaks for Agent-Native Teams

Classical TMMi breaks for agent-native teams because its staged architecture assumes stable human test artifacts measured against planned-release applications. Agent-native engineering produces tests against code changed by AI coding tools.

TMMi advances organizations through Initial, Managed, Defined, Measured, and Optimization levels, and the TMMi reference model requires teams to satisfy all goals at one level before progressing to the next. Its agent-era limitation is the gate it does not define. Every process area assumes a deterministic relationship between test design and test outcome.

That assumption collapses when generated artifacts fail before execution. An empirical study on invalid LLM-generated unit tests traced compile-time causes, with unresolved symbol errors accounting for 30.68% of invalid tests and parameter mismatch errors for 17.25%. Traditional automation is deterministic by design. Agents add verification boundaries teams must manage.

The table below contrasts the assumptions classical TMMi makes against the realities agent-native teams operate under.

Assumption	Classical TMMi	Agent-Native Reality
Test authorship	Human-written, version-controlled	Agent-generated, non-deterministic
Application stability	Changes through planned releases	Rewritten continuously by AI coding tools
Output reliability	Same test, same result	Variable test counts and coverage per run
Primary risk	Insufficient process discipline	Confident-but-wrong AI output
Top-level driver	Continuous process improvement	Verification controls at scale

A maturity model centered on staged process goals does not score human oversight of AI-generated-code defects.

The Verification Gap: The Central Design Principle

The verification gap is the distance between how confident AI output appears and how correct it actually is. It organizes agent-native QA maturity because every failure mode in agent-generated code traces back to it.

A functional clustering study on LLM code generation found that without verification, "the error rate of returned answers [is] roughly 65%," which a verifier reduces to 2%.

Verification must catch three failure categories. Compilation failures stop tests before execution. Robustness gaps appear when generated code misses checks. A study on LLM code robustness found that 35.2% of LLM-generated code is less robust than human-written code, with "90% of which are related to missing conditional checks." Package hallucination turns dependency suggestions into a security vector.

Each failure category maps to a specific verification need and a concrete risk if left ungated.

Failure Category	Verification Need	Risk If Ungated
Compilation failures	Build and compilation validation	Invalid generated tests block execution
Robustness gaps	Conditional-check review	Less robust code reaches release paths
Package hallucination	Dependency verification	Hallucinated imports become security vectors
Selector drift	Live DOM validation	UI tests trigger hallucinated interactions
Behavioral drift	Drift detection	Agent behavior degrades silently

A USENIX Security analysis of package hallucinations across coding models found rates averaging 19.7% across the board, with commercial models at 5.2% and open-source models at 21.7%. A UTSA study on hallucinated packages generated 2.23 million code samples and found 440,445 referenced hallucinated packages. The attack is direct: an adversary creates a malicious package matching a commonly hallucinated name, and any agent that imports it without verification pulls in the exploit.

This maps onto TMMi's structure. PA 4.3 (Advanced Reviews) addresses verification at Level 4 in the classical model. The agent-native model promotes an equivalent control to every level, because the verification gap exists from the first agent-generated test forward.

The Five Stages of Agent-Native QA Maturity

The five stages of agent-native QA maturity map TMMi's staged architecture to agent capability adoption. Teams progress from manual testing through governed multi-agent orchestration, and each stage has a verification gate tied to the agent capability in use.

Each stage retains the TMMi assessment principle that maturity equals "the lowest rating of its supporting process areas." The table applies that dependency to agent-native QA.

Stage	Name	Agent Capability	Verification Gate
1	Manual	Ad hoc, no agent adoption	None (no AI output to verify)
2	Assisted	Generative test authoring	Compilation and build validation
3	Self-Healing	Locator adaptation, failure classification	Selector validation against live DOM
4	Orchestrated	Autonomous lifecycle, root cause analysis	Outcome metrics and drift detection
5	Governed	Multi-agent orchestration under policy	Coordination policy and autonomy calibration

Stage 1: Manual Testing Without Agent Adoption

Stage 1 describes teams executing tests manually with no agent involvement. It corresponds to TMMi Level 1 (Initial), where testing is ad hoc and unmanaged. Since no AI output reaches QA, the verification gap does not yet apply. The defining limitation is boundary mismatch: manual QA remains valid only while no agent-generated code reaches the QA boundary.

Once engineers adopt coding agents that handle the AI-native engineering lifecycle, spanning "scoping and prototyping to implementation, testing, review, and even operational triage," manual QA teams face agent-generated change volume before they have agent-native QA capability.

Stage 2: AI-Assisted Test Authoring With Compilation Gates

Stage 2 introduces generative test authoring, where agents convert user stories, requirements, or legacy scripts into executable tests. Teams use a mandatory compilation and build validation gate to control that output. This maps to TMMi Level 2 (Managed), where a fundamental test approach is established and controlled.

Stage 2 capability is measurable at the build boundary. GitHub Copilot Testing for .NET generates tests from project configuration and the chosen test framework. The compilation gate is non-negotiable because Meta's TestGen-LLM deployment found that 75% of test cases built correctly, 57% passed reliably, and 25% increased coverage.

A Stage 2 gate works only when generated tests clear the same build boundary every time. Each element below verifies a different part of that boundary.

Stage 2 Gate Element	What It Verifies	Failure Boundary
Project configuration	Test generation matches the project setup	Generated tests target the wrong structure
Chosen test framework	Agent output uses executable test conventions	Tests cannot run under the framework
Compilation validation	Generated code builds before review	Invalid tests fail before execution
Pass-reliability filtering	Generated tests behave reliably	Unstable recommendations reach production review
Production recommendation review	Teams review accepted tests before use	Generated coverage is trusted too early

Meta's filter-based assurance architecture is a Stage 2 pattern. During Meta's Instagram and Facebook test-a-thons, 73% of TestGen-LLM test improvements were accepted by developers. Compilation and pass-reliability filters make generated tests reviewable before production use. Teams comparing generation approaches can evaluate AI coding assistants against enterprise testing benchmarks before expanding agent-authored coverage.

Stage 3: Self-Healing Tests With Live Validation

Stage 3 adds self-healing capability, where agents classify failures and adapt selectors. Teams gate that capability by validating selectors against the live DOM. This corresponds to TMMi Level 3 (Defined), where testing integrates into the development lifecycle and organizational standards apply across projects.

Self-healing addresses a specific instability: automated tests can stop matching the live system when application code changes. Stage 3 requires every selector to be checked against the running interface before execution.

Stage 3 validation separates useful locator adaptation from hallucinated UI interaction. The controls below define how each agent behavior is gated.

Stage 3 Control	Agent Behavior	Required Validation
Failure classification	Agent identifies why a test failed	Teams match the failure type to the live system
Locator adaptation	Agent proposes a changed selector	Teams check the selector against the running interface
Batch selector verification	Agent validates selectors before execution	Teams keep live DOM verification in the execution path
Timeout prevention	Agent avoids unreachable UI paths	Teams block hallucinated interactions
Review gate	Team reviews UI-related changes	Teams review selector changes before wider rollout

Stage 3 validation has a measured failure boundary. A multi-agent selector validation study found that teams must validate LLM-generated selectors against the live DOM before execution, because unverified selector inference leads to hallucinated UI interactions and cascading timeouts. The study attributed 108 selector-timeout failures to the Coder agent, which bypassed batch selector verification tools.

Teams can apply this gate during selector-change review. Augment's code review provides automated pull request analysis that scored a 59% F-score on a published code review benchmark, compared with 49% for the nearest competitor. Teams setting up CI review gates with GitHub Actions can use selector-change review as a practical Stage 3 gate.

Stage 4: Orchestrated Lifecycle With Outcome Metrics

Stage 4 introduces autonomous test execution, root cause analysis, and risk-based prioritization. Outcome metrics and behavioral drift detection gate this stage. This maps to TMMi Level 4 (Measured), where measurement applies thoroughly across projects and advanced reviews are in use.

At Stage 4, teams measure outcomes and not activity counts. Teams calculate defect escape rate as bugs found after release over total bugs found. This shifts measurement toward released quality and away from test volume. Teams defining engineering velocity metrics for AI-enhanced workflows around this shift can separate agent activity from released defects, resolved failures, and drift signals.

Stage 4 metrics must distinguish agent workload from release-quality outcomes. The mapping below shows the activity metric each measurement area should drop and the outcome control that replaces it.

Measurement Area	Activity Metric to Avoid	Outcome Control
Test execution	Test volume	Released quality
Root cause analysis	Raw triage count	Resolution workflow connection
Risk prioritization	Agent activity	Released defects and resolved risks
Drift monitoring	One-time pass rate	Behavioral drift detection
Advanced review	Review presence	Review outcome across projects

Salesforce's TF Triage Agent provides an example of AI-powered test-failure triage in a large-scale testing environment. The example fits Stage 4 because it connects failure classification to resolution workflows, which makes triage outcomes more important than raw agent activity. The drift detection gate matters because agent behavior degrades silently, which the CSA Agentic NIST AI RMF Profile addresses through its AG-MG.2 control for "Behavioral Drift Detection and Remediation."

Stage 5: Governed Multi-Agent Orchestration

Stage 5 defines mature AI workflow orchestration platforms operating under explicit policy. Coordination policy and autonomy calibration gate this stage. This corresponds to TMMi Level 5 (Optimization), focused on continuous improvement and proactive defect prevention. Stage 5 maturity depends on explicit policy coverage over delegation, escalation, and human review.

Stage 5 governance controls agent count, escalation, and review. Automation volume alone does not establish maturity, so each governance element below maps to a specific risk it contains.

Governance Element	Maturity Signal	Risk Controlled
Autonomy calibration	Agents operate within defined boundaries	Automation percentage becomes a false maturity proxy
Delegation chain monitoring	Agent handoffs are visible	Errors propagate unchecked
Human-in-the-loop policy	Human judgment is required at defined points	Agents make unsupported decisions
Coordination policy	Multi-agent work is centrally controlled	Independent agents amplify errors
Policy-controlled escalation	Agents know when to reach out	Exceptions bypass review

This inverts the intuitive scaling assumption. Research on multi-agent scaling shows that system "performance does not scale linearly with agent count but exhibits a pattern of diminishing returns." A separate agent scaling study reports that homogeneous agent systems plateau at roughly 4 agents and heterogeneous systems at roughly 8, beyond which "adding more agents results in diminishing returns and wasted compute resources."

Unmanaged coordination degrades quality through measured error propagation. A benchmark coordination study found that multi-agent improvement varies widely across task types, ranging from large gains to net regressions, and that "independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4×." Governance defines the boundary between unchecked handoffs and centrally controlled escalation.

Stage 5 organizations define maturity through autonomy calibration, delegation chain monitoring, and human-in-the-loop policy enforcement. Agents on a governed platform "know when to reach out," with teams setting "the policies for where human judgment is required."

Augment Cosmos is the Unified Cloud Agents Platform built for this stage. It runs specialized agents with shared context, shared memory, and policy-controlled escalation across the entire software development lifecycle, so platform teams compose Environments (where agents run and what they can touch), Experts (how agents behave and what events they subscribe to), and Sessions (auditable, replayable workflows) without stitching together separate services. Teams applying human-in-the-loop policy enforcement with Cosmos reduce eight software-development interruptions to three checkpoints: prioritization, spec and intent review, and contextual understanding through deep code review. Reference Experts shipped with the platform, including PR Author, Deep Code Review, E2E Testing, and Incident Response, give Stage 5 teams a governance baseline before they add custom agents on top.

Assessing Your Position: An Agent-Native QA Maturity Assessment

An agent-native QA maturity assessment determines your stage by identifying the lowest unmet verification gate. Teams tend to rate themselves by their most sophisticated agent capability, while this model rates them by their weakest verification control. An organization running multi-agent orchestration without selector validation against the live DOM sits at Stage 2 with a Stage 5 capability bolted on.

Open source

augmentcode/augment.vim★610

Star on GitHub

The assessment questions below map each unmet gate to the maximum stage a team can claim.

Assessment Question	If No, Maximum Stage
Do generated tests pass a compilation and build gate before merge?	Stage 1
Are agent-generated selectors validated against the live DOM before execution?	Stage 2
Do you measure defect escape rate as an outcome, not test count as activity?	Stage 3
Do you detect behavioral drift in agent test output over time?	Stage 3
Is multi-agent coordination governed by explicit human-in-the-loop policy?	Stage 4

Apply the gates in order, starting from the lowest-level control, rather than from the most advanced deployed capability.

Confirm that generated tests pass compilation and build validation before merge.
Verify that agent-generated selectors are checked against the live DOM before execution.
Replace test-count reporting with defect escape rate and other outcome measurements.
Detect behavioral drift in agent test output over time.
Review whether multi-agent coordination follows explicit human-in-the-loop policy.

This mirrors the TMMi assessment method, where assessors examine "the approach used, how well processes are documented, how well they are put into practice, and how fruitful their application was." Applied to agent-native QA, the assessment asks whether each verification gate is fully achieved, largely achieved, or only partially achieved.

Agent-native lag appears when teams add agents before they add build gates, DOM checks, drift detection, or human-in-the-loop policy. The TMMi survey found 42% of users at Level 3 but only 11% reaching Level 5.

How QA Roles Evolve Across the Maturity Stages

QA roles evolve from test execution toward verification system design and agent orchestration as teams progress through the maturity stages. Each stage shifts human leverage from doing the testing to governing the agents that do it.

The individual contributor role shifts as agent-generated output increases. An arXiv analysis of the evolving QA role maps the QA engineer's emphasis from "writing and executing tests" toward "designing verification systems and oracles for high-throughput agent output." As agents generate test volume, humans design the verification systems that catch confident-but-wrong output.

Role evolution follows the same gate sequence as the maturity model.

Manual execution remains the focus before agent output reaches the QA boundary.
Compilation-gate design becomes necessary when agents generate tests.
Selector-validation review becomes necessary when agents self-heal locators.
Outcome-metric ownership becomes necessary when agents triage and prioritize work.
Coordination policy becomes necessary when multiple agents operate under policy.

The strategic measure changes alongside the role. Governance-driven stages move QA from pass-rate reporting toward concrete risk checks: escaped defects, selector drift, behavioral drift, and unmanaged agent handoffs. Testing success depends on whether verification gates reduce those risks. Teams running these workflows on Cosmos benefit from sessions that are shared by default and an expert registry that lets a verification pattern built by one engineer compound for the whole organization.

Map Your Verification Gates Before Scaling Agent Count

The agent-native QA maturity model resolves the tension between agent capability and verification controls. Run the assessment table against the current pipeline, identify the lowest unmet gate, and close that gate before adding agent capability above it. Map the gates already in place before scaling agent count: build validation, live DOM checks, outcome metrics, drift detection, and human-in-the-loop policy.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat

— Thu, Jul 9 // 9:45 AM PDT

QA Maturity Model for Agent-Native Teams: 5 Stages

TL;DR

Why Classical TMMi Breaks for Agent-Native Teams

The Verification Gap: The Central Design Principle

The Five Stages of Agent-Native QA Maturity

Stage 1: Manual Testing Without Agent Adoption

Stage 2: AI-Assisted Test Authoring With Compilation Gates

Stage 3: Self-Healing Tests With Live Validation

Stage 4: Orchestrated Lifecycle With Outcome Metrics

Stage 5: Governed Multi-Agent Orchestration

Assessing Your Position: An Agent-Native QA Maturity Assessment

How QA Roles Evolve Across the Maturity Stages

Map Your Verification Gates Before Scaling Agent Count

The New Code Review Workflow for AI-Native Engineering Teams

Frequently Asked Questions

Written by

Molisha Shah

Give your codebase the agents it deserves

TL;DR

Why Classical TMMi Breaks for Agent-Native Teams

The Verification Gap: The Central Design Principle

The Five Stages of Agent-Native QA Maturity

Stage 1: Manual Testing Without Agent Adoption

Stage 2: AI-Assisted Test Authoring With Compilation Gates

Stage 3: Self-Healing Tests With Live Validation

Stage 4: Orchestrated Lifecycle With Outcome Metrics

Stage 5: Governed Multi-Agent Orchestration

Assessing Your Position: An Agent-Native QA Maturity Assessment

How QA Roles Evolve Across the Maturity Stages

Map Your Verification Gates Before Scaling Agent Count

The New Code Review Workflow for AI-Native Engineering Teams

Frequently Asked Questions

How does this model differ from TMMi?

Why is Stage 5 defined by governance instead of automation percentage?

Can a team skip stages to reach multi-agent orchestration faster?

What is the single highest-priority verification control?

How do existing AI governance frameworks fit this model?

Related Reading

Written by

Molisha Shah

Give your codebase the agents it deserves