Skip to content
Book demo
Back to Guides

QA Maturity Model for Agent-Native Teams: 5 Stages

Jun 30, 2026
Molisha Shah
Molisha Shah
QA Maturity Model for Agent-Native Teams: 5 Stages

The QA maturity model for agent-native teams is a five-stage framework for mapping agent capability adoption to verification controls. Agent-native QA requires controls that compare AI output confidence with AI output correctness. Classical models assume humans write tests against a stable application. Agent-native QA inverts both assumptions: agents generate the tests, and AI coding tool gaps appear as applications shift under AI-driven changes.

TL;DR

Agent-native QA can face broken automated tests when AI tools rewrite applications while agents generate tests. TMMi assumes human-authored tests against stable code, so process maturity misses verification risk. This five-stage model teaches how to gate agent capability with compilation, DOM, drift, and governance controls.

A QA organization can have strong process maturity while its agent-generated tests remain unproven. The model focuses on that proof: generated tests, generated selectors, and generated root-cause analysis must still match the live system.

The TMMi framework is a testing-specific framework with 16 process areas across a five-level staged architecture. 42% of surveyed organizations have reached Level 3. Developers feel the gap when agent-generated changes outpace proof that failing tests reflect real defects and not stale automation. The Augment Code Context Engine cuts hallucinations by 40% compared with limited-context tools through semantic dependency analysis, semantic filtering, and dynamic model routing. It analyzes entire codebases across 400,000+ files, giving teams architectural-level understanding before they accept AI output. For the higher stages of the model, Augment Cosmos, the Unified Cloud Agents Platform, runs specialized agents with shared context and memory across the software development lifecycle and gives teams the policy surface needed to govern multi-agent orchestration. Score readiness with this model through compilation gates, live DOM checks, outcome metrics, drift detection, and coordination policy.

Why Classical TMMi Breaks for Agent-Native Teams

Classical TMMi breaks for agent-native teams because its staged architecture assumes stable human test artifacts measured against planned-release applications. Agent-native engineering produces tests against code changed by AI coding tools.

TMMi advances organizations through Initial, Managed, Defined, Measured, and Optimization levels, and the TMMi reference model requires teams to satisfy all goals at one level before progressing to the next. Its agent-era limitation is the gate it does not define. Every process area assumes a deterministic relationship between test design and test outcome.

That assumption collapses when generated artifacts fail before execution. An empirical study on invalid LLM-generated unit tests traced compile-time causes, with unresolved symbol errors accounting for 30.68% of invalid tests and parameter mismatch errors for 17.25%. Traditional automation is deterministic by design. Agents add verification boundaries teams must manage.

The table below contrasts the assumptions classical TMMi makes against the realities agent-native teams operate under.

AssumptionClassical TMMiAgent-Native Reality
Test authorshipHuman-written, version-controlledAgent-generated, non-deterministic
Application stabilityChanges through planned releasesRewritten continuously by AI coding tools
Output reliabilitySame test, same resultVariable test counts and coverage per run
Primary riskInsufficient process disciplineConfident-but-wrong AI output
Top-level driverContinuous process improvementVerification controls at scale

A maturity model centered on staged process goals does not score human oversight of AI-generated-code defects.

The Verification Gap: The Central Design Principle

The verification gap is the distance between how confident AI output appears and how correct it actually is. It organizes agent-native QA maturity because every failure mode in agent-generated code traces back to it.

A functional clustering study on LLM code generation found that without verification, "the error rate of returned answers [is] roughly 65%," which a verifier reduces to 2%.

Verification must catch three failure categories. Compilation failures stop tests before execution. Robustness gaps appear when generated code misses checks. A study on LLM code robustness found that 35.2% of LLM-generated code is less robust than human-written code, with "90% of which are related to missing conditional checks." Package hallucination turns dependency suggestions into a security vector.

Each failure category maps to a specific verification need and a concrete risk if left ungated.

Failure CategoryVerification NeedRisk If Ungated
Compilation failuresBuild and compilation validationInvalid generated tests block execution
Robustness gapsConditional-check reviewLess robust code reaches release paths
Package hallucinationDependency verificationHallucinated imports become security vectors
Selector driftLive DOM validationUI tests trigger hallucinated interactions
Behavioral driftDrift detectionAgent behavior degrades silently

A USENIX Security analysis of package hallucinations across coding models found rates averaging 19.7% across the board, with commercial models at 5.2% and open-source models at 21.7%. A UTSA study on hallucinated packages generated 2.23 million code samples and found 440,445 referenced hallucinated packages. The attack is direct: an adversary creates a malicious package matching a commonly hallucinated name, and any agent that imports it without verification pulls in the exploit.

This maps onto TMMi's structure. PA 4.3 (Advanced Reviews) addresses verification at Level 4 in the classical model. The agent-native model promotes an equivalent control to every level, because the verification gap exists from the first agent-generated test forward.

The Five Stages of Agent-Native QA Maturity

The five stages of agent-native QA maturity map TMMi's staged architecture to agent capability adoption. Teams progress from manual testing through governed multi-agent orchestration, and each stage has a verification gate tied to the agent capability in use.

Each stage retains the TMMi assessment principle that maturity equals "the lowest rating of its supporting process areas." The table applies that dependency to agent-native QA.

StageNameAgent CapabilityVerification Gate
1ManualAd hoc, no agent adoptionNone (no AI output to verify)
2AssistedGenerative test authoringCompilation and build validation
3Self-HealingLocator adaptation, failure classificationSelector validation against live DOM
4OrchestratedAutonomous lifecycle, root cause analysisOutcome metrics and drift detection
5GovernedMulti-agent orchestration under policyCoordination policy and autonomy calibration

Stage 1: Manual Testing Without Agent Adoption

Stage 1 describes teams executing tests manually with no agent involvement. It corresponds to TMMi Level 1 (Initial), where testing is ad hoc and unmanaged. Since no AI output reaches QA, the verification gap does not yet apply. The defining limitation is boundary mismatch: manual QA remains valid only while no agent-generated code reaches the QA boundary.

Once engineers adopt coding agents that handle the AI-native engineering lifecycle, spanning "scoping and prototyping to implementation, testing, review, and even operational triage," manual QA teams face agent-generated change volume before they have agent-native QA capability.

Stage 2: AI-Assisted Test Authoring With Compilation Gates

Stage 2 introduces generative test authoring, where agents convert user stories, requirements, or legacy scripts into executable tests. Teams use a mandatory compilation and build validation gate to control that output. This maps to TMMi Level 2 (Managed), where a fundamental test approach is established and controlled.

Stage 2 capability is measurable at the build boundary. GitHub Copilot Testing for .NET generates tests from project configuration and the chosen test framework. The compilation gate is non-negotiable because Meta's TestGen-LLM deployment found that 75% of test cases built correctly, 57% passed reliably, and 25% increased coverage.

A Stage 2 gate works only when generated tests clear the same build boundary every time. Each element below verifies a different part of that boundary.

Stage 2 Gate ElementWhat It VerifiesFailure Boundary
Project configurationTest generation matches the project setupGenerated tests target the wrong structure
Chosen test frameworkAgent output uses executable test conventionsTests cannot run under the framework
Compilation validationGenerated code builds before reviewInvalid tests fail before execution
Pass-reliability filteringGenerated tests behave reliablyUnstable recommendations reach production review
Production recommendation reviewTeams review accepted tests before useGenerated coverage is trusted too early

Meta's filter-based assurance architecture is a Stage 2 pattern. During Meta's Instagram and Facebook test-a-thons, 73% of TestGen-LLM test improvements were accepted by developers. Compilation and pass-reliability filters make generated tests reviewable before production use. Teams comparing generation approaches can evaluate AI coding assistants against enterprise testing benchmarks before expanding agent-authored coverage.

Stage 3: Self-Healing Tests With Live Validation

Stage 3 adds self-healing capability, where agents classify failures and adapt selectors. Teams gate that capability by validating selectors against the live DOM. This corresponds to TMMi Level 3 (Defined), where testing integrates into the development lifecycle and organizational standards apply across projects.

Self-healing addresses a specific instability: automated tests can stop matching the live system when application code changes. Stage 3 requires every selector to be checked against the running interface before execution.

Stage 3 validation separates useful locator adaptation from hallucinated UI interaction. The controls below define how each agent behavior is gated.

Stage 3 ControlAgent BehaviorRequired Validation
Failure classificationAgent identifies why a test failedTeams match the failure type to the live system
Locator adaptationAgent proposes a changed selectorTeams check the selector against the running interface
Batch selector verificationAgent validates selectors before executionTeams keep live DOM verification in the execution path
Timeout preventionAgent avoids unreachable UI pathsTeams block hallucinated interactions
Review gateTeam reviews UI-related changesTeams review selector changes before wider rollout

Stage 3 validation has a measured failure boundary. A multi-agent selector validation study found that teams must validate LLM-generated selectors against the live DOM before execution, because unverified selector inference leads to hallucinated UI interactions and cascading timeouts. The study attributed 108 selector-timeout failures to the Coder agent, which bypassed batch selector verification tools.

Teams can apply this gate during selector-change review. Augment's code review provides automated pull request analysis that scored a 59% F-score on a published code review benchmark, compared with 49% for the nearest competitor. Teams setting up CI review gates with GitHub Actions can use selector-change review as a practical Stage 3 gate.

Stage 4: Orchestrated Lifecycle With Outcome Metrics

Stage 4 introduces autonomous test execution, root cause analysis, and risk-based prioritization. Outcome metrics and behavioral drift detection gate this stage. This maps to TMMi Level 4 (Measured), where measurement applies thoroughly across projects and advanced reviews are in use.

At Stage 4, teams measure outcomes and not activity counts. Teams calculate defect escape rate as bugs found after release over total bugs found. This shifts measurement toward released quality and away from test volume. Teams defining engineering velocity metrics for AI-enhanced workflows around this shift can separate agent activity from released defects, resolved failures, and drift signals.

Stage 4 metrics must distinguish agent workload from release-quality outcomes. The mapping below shows the activity metric each measurement area should drop and the outcome control that replaces it.

Measurement AreaActivity Metric to AvoidOutcome Control
Test executionTest volumeReleased quality
Root cause analysisRaw triage countResolution workflow connection
Risk prioritizationAgent activityReleased defects and resolved risks
Drift monitoringOne-time pass rateBehavioral drift detection
Advanced reviewReview presenceReview outcome across projects

Salesforce's TF Triage Agent provides an example of AI-powered test-failure triage in a large-scale testing environment. The example fits Stage 4 because it connects failure classification to resolution workflows, which makes triage outcomes more important than raw agent activity. The drift detection gate matters because agent behavior degrades silently, which the CSA Agentic NIST AI RMF Profile addresses through its AG-MG.2 control for "Behavioral Drift Detection and Remediation."

Stage 5: Governed Multi-Agent Orchestration

Stage 5 defines mature AI workflow orchestration platforms operating under explicit policy. Coordination policy and autonomy calibration gate this stage. This corresponds to TMMi Level 5 (Optimization), focused on continuous improvement and proactive defect prevention. Stage 5 maturity depends on explicit policy coverage over delegation, escalation, and human review.

Stage 5 governance controls agent count, escalation, and review. Automation volume alone does not establish maturity, so each governance element below maps to a specific risk it contains.

Governance ElementMaturity SignalRisk Controlled
Autonomy calibrationAgents operate within defined boundariesAutomation percentage becomes a false maturity proxy
Delegation chain monitoringAgent handoffs are visibleErrors propagate unchecked
Human-in-the-loop policyHuman judgment is required at defined pointsAgents make unsupported decisions
Coordination policyMulti-agent work is centrally controlledIndependent agents amplify errors
Policy-controlled escalationAgents know when to reach outExceptions bypass review

This inverts the intuitive scaling assumption. Research on multi-agent scaling shows that system "performance does not scale linearly with agent count but exhibits a pattern of diminishing returns." A separate agent scaling study reports that homogeneous agent systems plateau at roughly 4 agents and heterogeneous systems at roughly 8, beyond which "adding more agents results in diminishing returns and wasted compute resources."

Unmanaged coordination degrades quality through measured error propagation. A benchmark coordination study found that multi-agent improvement varies widely across task types, ranging from large gains to net regressions, and that "independent agents amplify errors 17.2× through unchecked propagation, while centralized coordination contains this to 4.4×." Governance defines the boundary between unchecked handoffs and centrally controlled escalation.

Stage 5 organizations define maturity through autonomy calibration, delegation chain monitoring, and human-in-the-loop policy enforcement. Agents on a governed platform "know when to reach out," with teams setting "the policies for where human judgment is required."

Augment Cosmos is the Unified Cloud Agents Platform built for this stage. It runs specialized agents with shared context, shared memory, and policy-controlled escalation across the entire software development lifecycle, so platform teams compose Environments (where agents run and what they can touch), Experts (how agents behave and what events they subscribe to), and Sessions (auditable, replayable workflows) without stitching together separate services. Teams applying human-in-the-loop policy enforcement with Cosmos reduce eight software-development interruptions to three checkpoints: prioritization, spec and intent review, and contextual understanding through deep code review. Reference Experts shipped with the platform, including PR Author, Deep Code Review, E2E Testing, and Incident Response, give Stage 5 teams a governance baseline before they add custom agents on top.

Assessing Your Position: An Agent-Native QA Maturity Assessment

An agent-native QA maturity assessment determines your stage by identifying the lowest unmet verification gate. Teams tend to rate themselves by their most sophisticated agent capability, while this model rates them by their weakest verification control. An organization running multi-agent orchestration without selector validation against the live DOM sits at Stage 2 with a Stage 5 capability bolted on.

Open source
augmentcode/augment.vim610
Star on GitHub

The assessment questions below map each unmet gate to the maximum stage a team can claim.

Assessment QuestionIf No, Maximum Stage
Do generated tests pass a compilation and build gate before merge?Stage 1
Are agent-generated selectors validated against the live DOM before execution?Stage 2
Do you measure defect escape rate as an outcome, not test count as activity?Stage 3
Do you detect behavioral drift in agent test output over time?Stage 3
Is multi-agent coordination governed by explicit human-in-the-loop policy?Stage 4

Apply the gates in order, starting from the lowest-level control, rather than from the most advanced deployed capability.

  1. Confirm that generated tests pass compilation and build validation before merge.
  2. Verify that agent-generated selectors are checked against the live DOM before execution.
  3. Replace test-count reporting with defect escape rate and other outcome measurements.
  4. Detect behavioral drift in agent test output over time.
  5. Review whether multi-agent coordination follows explicit human-in-the-loop policy.

This mirrors the TMMi assessment method, where assessors examine "the approach used, how well processes are documented, how well they are put into practice, and how fruitful their application was." Applied to agent-native QA, the assessment asks whether each verification gate is fully achieved, largely achieved, or only partially achieved.

Agent-native lag appears when teams add agents before they add build gates, DOM checks, drift detection, or human-in-the-loop policy. The TMMi survey found 42% of users at Level 3 but only 11% reaching Level 5.

How QA Roles Evolve Across the Maturity Stages

QA roles evolve from test execution toward verification system design and agent orchestration as teams progress through the maturity stages. Each stage shifts human leverage from doing the testing to governing the agents that do it.

The individual contributor role shifts as agent-generated output increases. An arXiv analysis of the evolving QA role maps the QA engineer's emphasis from "writing and executing tests" toward "designing verification systems and oracles for high-throughput agent output." As agents generate test volume, humans design the verification systems that catch confident-but-wrong output.

Role evolution follows the same gate sequence as the maturity model.

  1. Manual execution remains the focus before agent output reaches the QA boundary.
  2. Compilation-gate design becomes necessary when agents generate tests.
  3. Selector-validation review becomes necessary when agents self-heal locators.
  4. Outcome-metric ownership becomes necessary when agents triage and prioritize work.
  5. Coordination policy becomes necessary when multiple agents operate under policy.

The strategic measure changes alongside the role. Governance-driven stages move QA from pass-rate reporting toward concrete risk checks: escaped defects, selector drift, behavioral drift, and unmanaged agent handoffs. Testing success depends on whether verification gates reduce those risks. Teams running these workflows on Cosmos benefit from sessions that are shared by default and an expert registry that lets a verification pattern built by one engineer compound for the whole organization.

Map Your Verification Gates Before Scaling Agent Count

The agent-native QA maturity model resolves the tension between agent capability and verification controls. Run the assessment table against the current pipeline, identify the lowest unmet gate, and close that gate before adding agent capability above it. Map the gates already in place before scaling agent count: build validation, live DOM checks, outcome metrics, drift detection, and human-in-the-loop policy.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat
Thu, Jul 9 // 9:45 AM PDT

Frequently Asked Questions

Written by

Molisha Shah

Molisha Shah

Molisha is an early GTM and Customer Champion at Augment Code, where she focuses on helping developers understand and adopt modern AI coding practices. She writes about clean code principles, agentic development environments, and how teams are restructuring their workflows around AI agents. She holds a degree in Business and Cognitive Science from UC Berkeley.


Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.