Skip to content
Install
Back to Guides

Why AI Agent Metrics Lie: What CTOs Should Track

May 9, 2026
Molisha Shah
Molisha Shah
Why AI Agent Metrics Lie: What CTOs Should Track

CTOs should track delivery outcomes instead of agent activity, because review queues, instability, and spec misalignment determine whether AI investments translate into shipped features. Adoption percentage, PR volume, and lines of generated code can all rise sharply while feature completion rate, change failure rate, and lead time stay flat or get worse. The metrics below explain which signals mislead engineering leaders, why they break at the system level, and what to measure instead.

TL;DR

AI adoption rates, PR volume, and lines of generated code can climb sharply while feature delivery, stability, and ROI stay flat. DORA 2024 reports that a 25% increase in AI adoption correlates with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. The DORA 2025 follow-up shows the throughput relationship has flipped positive, while AI adoption continues to correlate with higher software delivery instability. CTOs measuring ROI should track delivery velocity, change failure rate, review queue health, and R&D allocation rather than activity counters that look successful on dashboards while masking ongoing stability problems.

The Productivity Gap 12 Months In

Engineering leaders keep running into the same frustrating pattern. Twelve months into an AI coding agent rollout, dashboards show higher adoption, more merged pull requests, and more generated code. The outcomes that matter, feature delivery, stability, and return on engineering spend, do not reliably improve. Boards ask why a multi-million-dollar tooling investment has not moved the roadmap, and the available metrics do not answer the question.

The research base in this article points to the same conclusion. DORA 2024, DORA 2025, and the METR randomized controlled trial all show that local coding speed and perceived productivity can diverge sharply from organizational delivery outcomes, especially on the stability side. The METR study, conducted in early 2025 on pre-agentic tooling, found developers using AI tools took 19% longer to complete real tasks while believing they had been 20% faster, a 39-percentage-point gap between perception and measurement. Faros AI's 2025 dataset of enterprise teams shows pull request volume rising 98% per developer with no measurable improvement in DORA metrics over the same period; Faros AI's 2026 follow-up reports continued volume growth alongside widening quality gaps.

This guide explains which agent metrics mislead CTOs, why they break at the system level, and which metrics better reflect real delivery performance. Activity rises while delivery stability lags, review queues absorb local coding gains, and perceived productivity diverges from measured outcomes. Each section pairs a misleading metric with the system-level signal that exposes what the dashboard hides, then shows how to baseline delivery throughput, review queue health, spec alignment, and human steering before scaling agent adoption further. Cosmos, Augment Code's operating system for agentic software development (currently in public preview), is built around these system-level signals rather than the activity counters that mislead AI ROI reporting.

See how Cosmos surfaces review bottlenecks across the SDLC before AI-generated volume turns into delivery drag.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

The Dashboard That Tells You Nothing

Measuring AI ROI requires tracking review throughput, stability, and delivery outcomes rather than relying on code output alone. Twelve months into an AI coding agent rollout, the quarterly dashboard can show higher adoption, more pull requests merged per developer, and thousands of new lines of code generated daily while leaving the investment case unresolved.

The problem is structural. The metrics most commonly surfaced on AI adoption dashboards measure inputs and activity rather than organizational delivery outcomes. The gap between those two categories is where engineering budgets disappear.

The distinction becomes clearer when the dashboard metrics and the delivery metrics are compared directly.

Metric TypeExampleWhat It MeasuresWhat It Misses
Activity metricAdoption percentageTool usage rateReview capacity, batch size, stability constraints
Activity metricPR volumeVisible code outputReview delay, PR size, downstream bottlenecks
Activity metricLines of code generatedCode volumePerceived productivity, code quality, delivery outcomes
Delivery metricDelivery throughputShipped organizational outputDoes not rely on code production alone
Delivery metricDelivery stabilitySystem reliability after changesDetects quality deterioration hidden by output growth

These categories look similar on a dashboard, but they answer different questions.

The Three Metrics That Make Agent Adoption Look Successful

Three dashboard metrics make agent adoption look successful because they are easy to collect at the code-production layer, while review, integration, and stability constraints determine whether that activity becomes shipped software. Each metric below records visible output while missing the mechanism that governs organizational delivery.

Adoption percentage

Adoption percentage fails as a productivity metric because the same usage rate can produce different outcomes once review capacity, batch size, and stability constraints are included. The DORA 2024 research program found that a 25% increase in AI adoption correlates with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. AI lets developers generate code faster, which produces larger batch sizes that are slower to review and more prone to system instability. Two teams with identical adoption rates will have radically different delivery outcomes depending on their underlying engineering practices, so adoption percentage implies an equivalence that does not exist.

The DORA 2025 report updates this picture. The throughput relationship reversed and is now positive, but AI adoption continues to correlate with higher software delivery instability. The report's framing is direct: AI acts as an amplifier, magnifying existing organizational strengths and weaknesses, with stability being the dimension where weak foundations show up first.

PR volume

PR volume fails as a delivery metric because more merged pull requests can coincide with slower review, larger changes, and flat organizational outcomes. The Faros AI 2025 Productivity-Reliability Paradox report quantifies the disconnect across a dataset of 10,000 developers across 1,255 teams:

MetricChange After AI Adoption (Faros AI 2025 dataset)
Pull requests merged per developer+98%
PR review time (median)+91%
Average PR size+154%
Bug counts per developer+9%
Organizational DORA metricsNo measurable improvement

Faros AI's 2026 follow-up dataset, drawn from 22,000 developers across 4,000 teams, reports a smaller PR-size increase of 51.3% but the same directional pattern: volume up, review and quality lagging, and organization-level gains harder to detect than team-level activity.

PR volume rose sharply across both datasets while delivery throughput at the organization level remained flat, because the same dynamic that inflates the dashboard number masks the delivery problem underneath.

Within Cosmos, deep code review evaluates PRs against codebase context, architectural patterns, and team standards as part of a coordinated SDLC rather than a final-stage gate, so review attention concentrates on the changes most likely to introduce instability. The same review approach achieves a 59% F-score on an AI code review benchmark that includes both precision and recall.

Lines of code generated

Lines of code generated fail as a productivity metric because code volume has little relationship to perceived productivity and can reward patterns associated with lower code quality. The Copilot productivity study reports that net lines of code had a Spearman correlation coefficient of approximately ρ ≈ 0.09 with perceived productivity: essentially zero. The Stanford Software Engineering Productivity group, studying enterprise engineering organizations since 2022, has stated that traditional metrics including lines of code, story points, and commit counts do not accurately measure engineering productivity.

Optimizing for lines generated rewards the behavior most likely to degrade long-term system health.

Cosmos is built on top of the Context Engine, which processes entire codebases across 400,000+ files through semantic dependency graph analysis and reduces hallucinations by 40% in environments where context quality becomes the limiting factor for large codebase analysis. Volume of generated code is treated as an input to delivery rather than a measure of it.

Why These Metrics Hide Flat Organizational Productivity

Activity metrics hide flat organizational productivity because code velocity improves one stage of the system while review delay, waiting time, and rework determine end-to-end delivery. Three mechanisms explain why optimizing code velocity fails to improve, and often harms, delivery stability and end-to-end velocity.

  1. Coding is approximately 21% of lead time. If AI reduces active development time by 50%, the maximum theoretical improvement to overall lead time is approximately 10.5%, before accounting for any review overhead the increased PR volume introduces.
  2. The review pipeline inverts the gains. AI tensions research states directly: "Velocity gains for an individual author frequently translate into a significantly increased cognitive load for the reviewer."
  3. The perception gap is systematic. The METR randomized controlled trial, conducted in early 2025 with 16 experienced developers across 246 real tasks on large open-source repositories (22,000+ stars, 1M+ lines of code), found developers using AI tools took 19% longer to complete tasks. Before starting, developers predicted they would be approximately 24% faster. After finishing, they believed they had been approximately 20% faster, leaving a perception-reality gap of approximately 39 percentage points. METR has since announced a follow-up study with newer agentic tooling, so the absolute number may shift, but the directional gap between perception and measurement is the relevant signal for ROI reporting.
Failure pointMechanismDelivery effect
Coding share of lead timeFaster authoring changes only part of total elapsed timeOverall lead time improvement is structurally capped
Review pipelineReviewer cognitive load rises as authoring speed risesThroughput gains are absorbed before production
Perception gapDevelopers feel faster even when measured task time worsensSelf-reported productivity creates false positives

All three of the most common AI ROI measurement approaches, adoption rate, developer satisfaction, and self-reported productivity, would have produced false positives in the exact population where the RCT showed objectively negative results.

Measuring Delivery Velocity Across the Organization

Delivery velocity is the metric category that matters because customer-visible outcomes depend on shipped, stable software produced by the organization rather than local code output produced by individual developers. The measurements below track whether working software reaches users faster and more reliably at the team or organizational level.

The DX Core 4 framework, co-authored by Abi Noda and Laura Tacho (CTO of DX) in collaboration with Nicole Forsgren (founder of DORA), Margaret-Anne Storey (co-author of SPACE), and Michaela Greiler (co-author of DevEx), organizes measurement across four intentionally oppositional pillars: Speed, Effectiveness, Quality, and Impact.

The oppositional architecture resists AI-inflated output metric distortions: speed is valuable, but speed paired with declining effectiveness produces no net gain.

The primary metric: epic or feature completion rate per engineer-month. CTO-level metrics should represent business impact and engineering organization-level effectiveness, treating individual output as a secondary signal. Feature completion rate is the lowest-granularity metric that still reflects organizational delivery.

Pair throughput with stability. Throughput improvements without a simultaneous stability metric can obscure deterioration until it shows up in customer-facing failures, which is exactly the pattern DORA 2025 surfaces.

Metric CategoryWhat to TrackWhy It Matters
Delivery throughputEpics/features completed per engineer-monthCaptures organizational output rather than activity
Delivery stabilityChange failure rate, MTTRDetects quality degradation from AI-inflated volume
Flow efficiencyRatio of active work time to total elapsed timeReveals where pipeline constraints absorb AI speed gains
R&D allocation% engineering time on new capabilities vs. maintenanceShows whether AI-freed time flows to value creation or expanded maintenance

The R&D allocation ratio captures whether time freed from AI-assisted coding is redirected to higher-value work or absorbed by expanded maintenance burdens. Teams comparing measurement systems can review enterprise AI coding tools for implementation patterns beyond raw output counts.

Explore how Cosmos coordinates agents and review across the SDLC so delivery throughput and stability move together instead of against each other.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

Review Bottleneck Reduction: Is the Human Queue Actually Shrinking?

Review bottleneck reduction measures whether AI deployment decreases the time code spends waiting for human review, because faster authoring only improves delivery when the queue also shrinks. The evidence in this section indicates that many organizations are shifting the constraint downstream. If the queue grows, the organization has shifted the bottleneck rather than removed it.

A three-tier measurement stack captures review queue health:

Measurement layerMetricSignal
Queue healthTime-to-first-review (p50 and p90)Growing p90 even with stable p50 signals a tail of neglected PRs
Queue healthPR cycle time (open to merge)Any sustained increase indicates bottleneck formation
Queue healthReview queue depth (open PRs awaiting first review, weekly snapshot)Growing trend week-over-week means PR creation outpaces review capacity
Queue healthPRs opened vs. merged ratioWidening gap confirms the review queue is the system constraint
Senior engineer loadPRs reviewed per person per week, segmented by seniorityShows whether review burden is concentrating by role
Senior engineer loadPercentage of AI-generated PRs assigned to senior or staff engineers vs. distributed across the teamShows whether AI review work is being pushed to the most experienced engineers
Senior engineer loadCoding-to-reviewing time ratio by roleShows whether senior capacity is being reallocated from building to reviewing
System-level validation (DORA Four Keys)Lead time for changesValidates whether review improvements reach delivery
System-level validation (DORA Four Keys)Deployment frequencyValidates whether review improvements reach production cadence
System-level validation (DORA Four Keys)Change failure rateValidates whether higher throughput is degrading stability
System-level validation (DORA Four Keys)Mean time to recoveryValidates whether downstream reliability is holding after faster authoring

When PR merge volume increases 98% but these four metrics remain unchanged, review queues and downstream pipeline constraints are absorbing the throughput gain before it reaches production.

Cosmos, currently in public preview, is Augment Code's operating system for agentic software development: a platform where developers, agents, codebases, tools, and memory coordinate across the SDLC rather than firing one agent at a time into an unchanged review pipeline. Within Cosmos, deep code review and human-in-the-loop policies shift defect detection earlier in the loop, so review attention concentrates at three checkpoints (prioritization, spec and intent review, and contextual code evolution) instead of spreading thinly across every PR. That structural change is what allows authoring volume gains to reach production instead of stalling in the review queue.

Spec Alignment Rate: Does Agent Output Match What Was Planned?

Spec alignment rate measures the degree to which AI-generated code fulfills its stated requirements, because executable code can still conflict with the intended work and create hidden rework. Output can compile, pass tests, and still miss the requirement, so this section focuses on requirement fit beyond functional correctness alone. LLM hallucination research examines cases where generated code conflicts with stated requirements.

Compilation and test passage do not indicate specification compliance. A coding agent benchmark evaluated execution correctness and specification compliance as complementary, separately assessed metrics, suggesting that functional correctness alone does not capture full task adherence.

Constitutional constraints research discusses using constitutional constraints and traceability from principles to code, but specific compliance rates by specification strategy are not established here.

Limiting spec principles per AI request to 3-5 relevant items achieves 96% compliance, compared to 78% when the full specification is included.

Engineering teams can apply three measurement approaches:

  1. Semantic alignment scoring: Compute cosine similarity between requirement text embeddings and corresponding generated code embeddings, per traceability research. Track as Trace Coverage: the percentage of AI-generated code blocks with valid traceability identifiers.
  2. Spec-derived test pass rate: Run BDD or acceptance tests derived from the specification against the implementation, per the spec-driven paper. Track whether violations trigger code fixes or spec revisions, since each reveals a different root cause.
  3. Rework rate on accepted code: A longitudinal analysis cites research finding that of AI-suggested code initially accepted, 18.16% is later deleted and 6.62% is heavily rewritten. These figures are directional and remain independently unverified, but the pattern is consistent: initial acceptance does not mean specification compliance.

The Steering Ratio: Human Time on Judgment vs. Human Time on Execution

The steering ratio measures the distribution of human engineering time between judgment-heavy work and manual execution, because agent adoption changes where attention is spent before it changes output. Human attention must stay concentrated on judgment-heavy decisions if agent output is to improve delivery outcomes. No validated empirical optimal ratio exists in the peer-reviewed literature, and this section treats steering ratio as an emerging measurement approach rather than a settled benchmark.

Open source
augmentcode/augment-swebench-agent872
Star on GitHub

A Springer HAI chapter distinguishes human-centric, symbiotic, and AI-centric modes corresponding to different distributions of workload. A CHI 2026 paper identifies that step-by-step human approval across all decisions "can severely limit agent autonomy when desirable and may also disengage the user if their only form of supervision is repetitive." The optimal steering ratio is a dynamic allocation that varies by task type and risk level.

Steering questionWhat to measureWhat the signal means
Is judgment staying with engineers?ADRs and design documents created per sprint relative to AI-generated code accepted per sprintDeclining artifact creation relative to rising agent output signals a shift toward cognitive offloading rather than deliberate steering
Is higher throughput still controlled?Throughput paired with change failure rateThroughput rising with CFR stable or declining indicates effective human steering; throughput rising with CFR also rising signals that execution volume has outpaced judgment quality
Is oversight calibrated to risk?Proportion of irreversible decisions that received explicit human reviewIrreversible decisions should receive human review before implementation, while reversible decisions can be delegated

Track steering artifacts relative to agent output. Monitor architecture decision records (ADRs) and design documents created per sprint relative to AI-generated code accepted per sprint. Declining artifact creation relative to rising agent output may reflect changing patterns of how users rely on AI systems.

Pair change failure rate with throughput. Throughput rising with CFR stable or declining indicates effective human steering of AI output. Throughput rising with CFR also rising signals that execution volume has outpaced judgment quality, the same pattern DORA 2025 identifies at industry scale.

Calibrate oversight by decision reversibility. Irreversible decisions require human review before implementation. Reversible decisions can be delegated to agents. Measure the proportion of irreversible decisions that received explicit human review.

Cosmos enforces where human judgment is required through configurable human-in-the-loop escalation policies, so engineering attention concentrates on irreversible decisions while routine, reversible work is delegated to agents. Teams configure the policies once and Cosmos routes execution through agents while flagging the checkpoints that require explicit review, which is the structural difference between agent activity and steered delivery.

Start Replacing Activity Metrics This Quarter

Replacing activity metrics this quarter is the practical next step because dashboards can improve while review queues, stability, and maintenance burden worsen underneath them.

Establishing a baseline for delivery throughput, stability, review queue health, and R&D allocation before expanding agent usage further is the practical next step this quarter. Then compare those system-level metrics against adoption and PR volume so the team stops treating activity as proof of value. The work breaks down into three concrete moves:

  1. Baseline delivery throughput and stability.
  2. Measure review queue health.
  3. Compare those metrics against adoption and PR volume.

Cosmos shortens the gap between activity metrics and delivery metrics by moving deep code review, human-in-the-loop policies, and shared system services earlier in the SDLC, so defects surface before final-stage review instead of accumulating until the end of the loop.

Because Cosmos coordinates agents, the codebase, and review attention as a single system, the metrics that matter (delivery throughput, change failure rate, and review queue health) move together rather than against each other.

See how Cosmos coordinates agents, review, and human judgment so activity gains turn into shipped, stable features.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

FAQ

Written by

Molisha Shah

Molisha Shah

Molisha is an early GTM and Customer Champion at Augment Code, where she focuses on helping developers understand and adopt modern AI coding practices. She writes about clean code principles, agentic development environments, and how teams are restructuring their workflows around AI agents. She holds a degree in Business and Cognitive Science from UC Berkeley.


Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.