Does higher AI agent adoption percentage indicate better engineering productivity?

Higher adoption percentage does not reliably indicate better engineering productivity. DORA 2024 research found a 25% increase in AI adoption correlated with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. DORA 2025 reports that the throughput relationship has reversed and is now positive, but AI adoption continues to correlate with higher delivery instability, so adoption percentage on its own does not capture the stability cost.

Why do developers report feeling more productive with AI agents when objective studies show mixed results?

The METR randomized controlled trial found a 39-percentage-point gap between perceived and actual productivity: developers who were 19% slower believed they were 20% faster. Self-reported productivity is structurally biased toward false positives because AI tools reduce perceived effort on individual tasks while introducing review, integration, and rework overhead that developers do not attribute to the tool. The METR study tested pre-agentic tooling in early 2025, and a follow-up study using newer tools is underway, but the perception-versus-measurement gap is the durable finding for ROI reporting.

What is the minimum measurement window for evaluating AI agent impact on engineering productivity?

Rigorous longitudinal evaluation windows for measuring AI agent impact on engineering productivity remain an open methodological question. Organizations that already deployed without capturing baselines should use non-adopting teams as controls, acknowledging self-selection bias.

Can PR volume serve as a useful proxy for delivery velocity if paired with other metrics?

PR volume has limited value even when paired with other metrics, because AI-generated PRs create disproportionate downstream review burden. The Faros AI 2025 dataset shows +98% PR volume alongside +91% review time and +154% average PR size, with no measurable improvement in organizational DORA metrics; the 2026 follow-up reports a smaller +51.3% PR-size figure on a different dataset, with the same directional pattern. PR volume as a metric treats human-written and AI-generated pull requests as identical units while they impose different review costs.

How should CTOs report AI agent ROI to the board?

CTOs should report delivery velocity (feature completion rate per engineer-month), paired with delivery stability (change failure rate, MTTR), review queue health (PR cycle time trends), and R&D allocation ratio (percentage of engineering time on new capabilities vs. maintenance). Reporting adoption percentage or PR volume without these pairing metrics produces misleading signals.

Why AI Agent Metrics Lie: What CTOs Should Track

CTOs should track delivery outcomes instead of agent activity, because review queues, instability, and spec misalignment determine whether AI investments translate into shipped features. Adoption percentage, PR volume, and lines of generated code can all rise sharply while feature completion rate, change failure rate, and lead time stay flat or get worse. The metrics below explain which signals mislead engineering leaders, why they break at the system level, and what to measure instead.

TL;DR

AI adoption rates, PR volume, and lines of generated code can climb sharply while feature delivery, stability, and ROI stay flat. DORA 2024 reports that a 25% increase in AI adoption correlates with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. The DORA 2025 follow-up shows the throughput relationship has flipped positive, while AI adoption continues to correlate with higher software delivery instability. CTOs measuring ROI should track delivery velocity, change failure rate, review queue health, and R&D allocation rather than activity counters that look successful on dashboards while masking ongoing stability problems.

The Productivity Gap 12 Months In

Engineering leaders keep running into the same frustrating pattern. Twelve months into an AI coding agent rollout, dashboards show higher adoption, more merged pull requests, and more generated code. The outcomes that matter, feature delivery, stability, and return on engineering spend, do not reliably improve. Boards ask why a multi-million-dollar tooling investment has not moved the roadmap, and the available metrics do not answer the question.

The research base in this article points to the same conclusion. DORA 2024, DORA 2025, and the METR randomized controlled trial all show that local coding speed and perceived productivity can diverge sharply from organizational delivery outcomes, especially on the stability side. The METR study, conducted in early 2025 on pre-agentic tooling, found developers using AI tools took 19% longer to complete real tasks while believing they had been 20% faster, a 39-percentage-point gap between perception and measurement. Faros AI's 2025 dataset of enterprise teams shows pull request volume rising 98% per developer with no measurable improvement in DORA metrics over the same period; Faros AI's 2026 follow-up reports continued volume growth alongside widening quality gaps.

This guide explains which agent metrics mislead CTOs, why they break at the system level, and which metrics better reflect real delivery performance. Activity rises while delivery stability lags, review queues absorb local coding gains, and perceived productivity diverges from measured outcomes. Each section pairs a misleading metric with the system-level signal that exposes what the dashboard hides, then shows how to baseline delivery throughput, review queue health, spec alignment, and human steering before scaling agent adoption further. Cosmos, Augment Code's operating system for agentic software development (currently in public preview), is built around these system-level signals rather than the activity counters that mislead AI ROI reporting.

See how Cosmos surfaces review bottlenecks across the SDLC before AI-generated volume turns into delivery drag.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

The Dashboard That Tells You Nothing

Measuring AI ROI requires tracking review throughput, stability, and delivery outcomes rather than relying on code output alone. Twelve months into an AI coding agent rollout, the quarterly dashboard can show higher adoption, more pull requests merged per developer, and thousands of new lines of code generated daily while leaving the investment case unresolved.

The problem is structural. The metrics most commonly surfaced on AI adoption dashboards measure inputs and activity rather than organizational delivery outcomes. The gap between those two categories is where engineering budgets disappear.

The distinction becomes clearer when the dashboard metrics and the delivery metrics are compared directly.

Metric Type	Example	What It Measures	What It Misses
Activity metric	Adoption percentage	Tool usage rate	Review capacity, batch size, stability constraints
Activity metric	PR volume	Visible code output	Review delay, PR size, downstream bottlenecks
Activity metric	Lines of code generated	Code volume	Perceived productivity, code quality, delivery outcomes
Delivery metric	Delivery throughput	Shipped organizational output	Does not rely on code production alone
Delivery metric	Delivery stability	System reliability after changes	Detects quality deterioration hidden by output growth

These categories look similar on a dashboard, but they answer different questions.

The Three Metrics That Make Agent Adoption Look Successful

Three dashboard metrics make agent adoption look successful because they are easy to collect at the code-production layer, while review, integration, and stability constraints determine whether that activity becomes shipped software. Each metric below records visible output while missing the mechanism that governs organizational delivery.

Adoption percentage

Adoption percentage fails as a productivity metric because the same usage rate can produce different outcomes once review capacity, batch size, and stability constraints are included. The DORA 2024 research program found that a 25% increase in AI adoption correlates with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. AI lets developers generate code faster, which produces larger batch sizes that are slower to review and more prone to system instability. Two teams with identical adoption rates will have radically different delivery outcomes depending on their underlying engineering practices, so adoption percentage implies an equivalence that does not exist.

The DORA 2025 report updates this picture. The throughput relationship reversed and is now positive, but AI adoption continues to correlate with higher software delivery instability. The report's framing is direct: AI acts as an amplifier, magnifying existing organizational strengths and weaknesses, with stability being the dimension where weak foundations show up first.

PR volume

PR volume fails as a delivery metric because more merged pull requests can coincide with slower review, larger changes, and flat organizational outcomes. The Faros AI 2025 Productivity-Reliability Paradox report quantifies the disconnect across a dataset of 10,000 developers across 1,255 teams:

Metric	Change After AI Adoption (Faros AI 2025 dataset)
Pull requests merged per developer	+98%
PR review time (median)	+91%
Average PR size	+154%
Bug counts per developer	+9%
Organizational DORA metrics	No measurable improvement

Faros AI's 2026 follow-up dataset, drawn from 22,000 developers across 4,000 teams, reports a smaller PR-size increase of 51.3% but the same directional pattern: volume up, review and quality lagging, and organization-level gains harder to detect than team-level activity.

PR volume rose sharply across both datasets while delivery throughput at the organization level remained flat, because the same dynamic that inflates the dashboard number masks the delivery problem underneath.

Within Cosmos, deep code review evaluates PRs against codebase context, architectural patterns, and team standards as part of a coordinated SDLC rather than a final-stage gate, so review attention concentrates on the changes most likely to introduce instability. The same review approach achieves a 59% F-score on an AI code review benchmark that includes both precision and recall.

Lines of code generated

Lines of code generated fail as a productivity metric because code volume has little relationship to perceived productivity and can reward patterns associated with lower code quality. The Copilot productivity study reports that net lines of code had a Spearman correlation coefficient of approximately ρ ≈ 0.09 with perceived productivity: essentially zero. The Stanford Software Engineering Productivity group, studying enterprise engineering organizations since 2022, has stated that traditional metrics including lines of code, story points, and commit counts do not accurately measure engineering productivity.

Optimizing for lines generated rewards the behavior most likely to degrade long-term system health.

Cosmos is built on top of the Context Engine, which processes entire codebases across 400,000+ files through semantic dependency graph analysis and reduces hallucinations by 40% in environments where context quality becomes the limiting factor for large codebase analysis. Volume of generated code is treated as an input to delivery rather than a measure of it.

Why These Metrics Hide Flat Organizational Productivity

Activity metrics hide flat organizational productivity because code velocity improves one stage of the system while review delay, waiting time, and rework determine end-to-end delivery. Three mechanisms explain why optimizing code velocity fails to improve, and often harms, delivery stability and end-to-end velocity.

Coding is approximately 21% of lead time. If AI reduces active development time by 50%, the maximum theoretical improvement to overall lead time is approximately 10.5%, before accounting for any review overhead the increased PR volume introduces.
The review pipeline inverts the gains. AI tensions research states directly: "Velocity gains for an individual author frequently translate into a significantly increased cognitive load for the reviewer."
The perception gap is systematic. The METR randomized controlled trial, conducted in early 2025 with 16 experienced developers across 246 real tasks on large open-source repositories (22,000+ stars, 1M+ lines of code), found developers using AI tools took 19% longer to complete tasks. Before starting, developers predicted they would be approximately 24% faster. After finishing, they believed they had been approximately 20% faster, leaving a perception-reality gap of approximately 39 percentage points. METR has since announced a follow-up study with newer agentic tooling, so the absolute number may shift, but the directional gap between perception and measurement is the relevant signal for ROI reporting.

Failure point	Mechanism	Delivery effect
Coding share of lead time	Faster authoring changes only part of total elapsed time	Overall lead time improvement is structurally capped
Review pipeline	Reviewer cognitive load rises as authoring speed rises	Throughput gains are absorbed before production
Perception gap	Developers feel faster even when measured task time worsens	Self-reported productivity creates false positives

All three of the most common AI ROI measurement approaches, adoption rate, developer satisfaction, and self-reported productivity, would have produced false positives in the exact population where the RCT showed objectively negative results.

Measuring Delivery Velocity Across the Organization

Delivery velocity is the metric category that matters because customer-visible outcomes depend on shipped, stable software produced by the organization rather than local code output produced by individual developers. The measurements below track whether working software reaches users faster and more reliably at the team or organizational level.

The DX Core 4 framework, co-authored by Abi Noda and Laura Tacho (CTO of DX) in collaboration with Nicole Forsgren (founder of DORA), Margaret-Anne Storey (co-author of SPACE), and Michaela Greiler (co-author of DevEx), organizes measurement across four intentionally oppositional pillars: Speed, Effectiveness, Quality, and Impact.

The oppositional architecture resists AI-inflated output metric distortions: speed is valuable, but speed paired with declining effectiveness produces no net gain.

The primary metric: epic or feature completion rate per engineer-month. CTO-level metrics should represent business impact and engineering organization-level effectiveness, treating individual output as a secondary signal. Feature completion rate is the lowest-granularity metric that still reflects organizational delivery.

Pair throughput with stability. Throughput improvements without a simultaneous stability metric can obscure deterioration until it shows up in customer-facing failures, which is exactly the pattern DORA 2025 surfaces.

Metric Category	What to Track	Why It Matters
Delivery throughput	Epics/features completed per engineer-month	Captures organizational output rather than activity
Delivery stability	Change failure rate, MTTR	Detects quality degradation from AI-inflated volume
Flow efficiency	Ratio of active work time to total elapsed time	Reveals where pipeline constraints absorb AI speed gains
R&D allocation	% engineering time on new capabilities vs. maintenance	Shows whether AI-freed time flows to value creation or expanded maintenance

The R&D allocation ratio captures whether time freed from AI-assisted coding is redirected to higher-value work or absorbed by expanded maintenance burdens. Teams comparing measurement systems can review enterprise AI coding tools for implementation patterns beyond raw output counts.

Explore how Cosmos coordinates agents and review across the SDLC so delivery throughput and stability move together instead of against each other.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline

···

$ cat build.log | auggie --print --quiet \

"Summarize the failure"

Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42

Fix: npm install lodash @types/lodash

Review Bottleneck Reduction: Is the Human Queue Actually Shrinking?

Review bottleneck reduction measures whether AI deployment decreases the time code spends waiting for human review, because faster authoring only improves delivery when the queue also shrinks. The evidence in this section indicates that many organizations are shifting the constraint downstream. If the queue grows, the organization has shifted the bottleneck rather than removed it.

A three-tier measurement stack captures review queue health:

Measurement layer	Metric	Signal
Queue health	Time-to-first-review (p50 and p90)	Growing p90 even with stable p50 signals a tail of neglected PRs
Queue health	PR cycle time (open to merge)	Any sustained increase indicates bottleneck formation
Queue health	Review queue depth (open PRs awaiting first review, weekly snapshot)	Growing trend week-over-week means PR creation outpaces review capacity
Queue health	PRs opened vs. merged ratio	Widening gap confirms the review queue is the system constraint
Senior engineer load	PRs reviewed per person per week, segmented by seniority	Shows whether review burden is concentrating by role
Senior engineer load	Percentage of AI-generated PRs assigned to senior or staff engineers vs. distributed across the team	Shows whether AI review work is being pushed to the most experienced engineers
Senior engineer load	Coding-to-reviewing time ratio by role	Shows whether senior capacity is being reallocated from building to reviewing
System-level validation (DORA Four Keys)	Lead time for changes	Validates whether review improvements reach delivery
System-level validation (DORA Four Keys)	Deployment frequency	Validates whether review improvements reach production cadence
System-level validation (DORA Four Keys)	Change failure rate	Validates whether higher throughput is degrading stability
System-level validation (DORA Four Keys)	Mean time to recovery	Validates whether downstream reliability is holding after faster authoring

When PR merge volume increases 98% but these four metrics remain unchanged, review queues and downstream pipeline constraints are absorbing the throughput gain before it reaches production.

Cosmos, currently in public preview, is Augment Code's operating system for agentic software development: a platform where developers, agents, codebases, tools, and memory coordinate across the SDLC rather than firing one agent at a time into an unchanged review pipeline. Within Cosmos, deep code review and human-in-the-loop policies shift defect detection earlier in the loop, so review attention concentrates at three checkpoints (prioritization, spec and intent review, and contextual code evolution) instead of spreading thinly across every PR. That structural change is what allows authoring volume gains to reach production instead of stalling in the review queue.

Spec Alignment Rate: Does Agent Output Match What Was Planned?

Spec alignment rate measures the degree to which AI-generated code fulfills its stated requirements, because executable code can still conflict with the intended work and create hidden rework. Output can compile, pass tests, and still miss the requirement, so this section focuses on requirement fit beyond functional correctness alone. LLM hallucination research examines cases where generated code conflicts with stated requirements.

Compilation and test passage do not indicate specification compliance. A coding agent benchmark evaluated execution correctness and specification compliance as complementary, separately assessed metrics, suggesting that functional correctness alone does not capture full task adherence.

Constitutional constraints research discusses using constitutional constraints and traceability from principles to code, but specific compliance rates by specification strategy are not established here.

Limiting spec principles per AI request to 3-5 relevant items achieves 96% compliance, compared to 78% when the full specification is included.

Engineering teams can apply three measurement approaches:

Semantic alignment scoring: Compute cosine similarity between requirement text embeddings and corresponding generated code embeddings, per traceability research. Track as Trace Coverage: the percentage of AI-generated code blocks with valid traceability identifiers.
Spec-derived test pass rate: Run BDD or acceptance tests derived from the specification against the implementation, per the spec-driven paper. Track whether violations trigger code fixes or spec revisions, since each reveals a different root cause.
Rework rate on accepted code: A longitudinal analysis cites research finding that of AI-suggested code initially accepted, 18.16% is later deleted and 6.62% is heavily rewritten. These figures are directional and remain independently unverified, but the pattern is consistent: initial acceptance does not mean specification compliance.

The Steering Ratio: Human Time on Judgment vs. Human Time on Execution

The steering ratio measures the distribution of human engineering time between judgment-heavy work and manual execution, because agent adoption changes where attention is spent before it changes output. Human attention must stay concentrated on judgment-heavy decisions if agent output is to improve delivery outcomes. No validated empirical optimal ratio exists in the peer-reviewed literature, and this section treats steering ratio as an emerging measurement approach rather than a settled benchmark.

Open source

augmentcode/augment-swebench-agent★872

Star on GitHub

A Springer HAI chapter distinguishes human-centric, symbiotic, and AI-centric modes corresponding to different distributions of workload. A CHI 2026 paper identifies that step-by-step human approval across all decisions "can severely limit agent autonomy when desirable and may also disengage the user if their only form of supervision is repetitive." The optimal steering ratio is a dynamic allocation that varies by task type and risk level.

Steering question	What to measure	What the signal means
Is judgment staying with engineers?	ADRs and design documents created per sprint relative to AI-generated code accepted per sprint	Declining artifact creation relative to rising agent output signals a shift toward cognitive offloading rather than deliberate steering
Is higher throughput still controlled?	Throughput paired with change failure rate	Throughput rising with CFR stable or declining indicates effective human steering; throughput rising with CFR also rising signals that execution volume has outpaced judgment quality
Is oversight calibrated to risk?	Proportion of irreversible decisions that received explicit human review	Irreversible decisions should receive human review before implementation, while reversible decisions can be delegated

Track steering artifacts relative to agent output. Monitor architecture decision records (ADRs) and design documents created per sprint relative to AI-generated code accepted per sprint. Declining artifact creation relative to rising agent output may reflect changing patterns of how users rely on AI systems.

Pair change failure rate with throughput. Throughput rising with CFR stable or declining indicates effective human steering of AI output. Throughput rising with CFR also rising signals that execution volume has outpaced judgment quality, the same pattern DORA 2025 identifies at industry scale.

Calibrate oversight by decision reversibility. Irreversible decisions require human review before implementation. Reversible decisions can be delegated to agents. Measure the proportion of irreversible decisions that received explicit human review.

Cosmos enforces where human judgment is required through configurable human-in-the-loop escalation policies, so engineering attention concentrates on irreversible decisions while routine, reversible work is delegated to agents. Teams configure the policies once and Cosmos routes execution through agents while flagging the checkpoints that require explicit review, which is the structural difference between agent activity and steered delivery.

Start Replacing Activity Metrics This Quarter

Replacing activity metrics this quarter is the practical next step because dashboards can improve while review queues, stability, and maintenance burden worsen underneath them.

Establishing a baseline for delivery throughput, stability, review queue health, and R&D allocation before expanding agent usage further is the practical next step this quarter. Then compare those system-level metrics against adoption and PR volume so the team stops treating activity as proof of value. The work breaks down into three concrete moves:

Baseline delivery throughput and stability.
Measure review queue health.
Compare those metrics against adoption and PR volume.

Cosmos shortens the gap between activity metrics and delivery metrics by moving deep code review, human-in-the-loop policies, and shared system services earlier in the SDLC, so defects surface before final-stage review instead of accumulating until the end of the loop.

Because Cosmos coordinates agents, the codebase, and review attention as a single system, the metrics that matter (delivery throughput, change failure rate, and review queue health) move together rather than against each other.

See how Cosmos coordinates agents, review, and human judgment so activity gains turn into shipped, stable features.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

Why AI Agent Metrics Lie: What CTOs Should Track

TL;DR

The Productivity Gap 12 Months In

See how Cosmos surfaces review bottlenecks across the SDLC before AI-generated volume turns into delivery drag.

The Dashboard That Tells You Nothing

The Three Metrics That Make Agent Adoption Look Successful

Adoption percentage

PR volume

Lines of code generated

Why These Metrics Hide Flat Organizational Productivity

Measuring Delivery Velocity Across the Organization

Explore how Cosmos coordinates agents and review across the SDLC so delivery throughput and stability move together instead of against each other.

Review Bottleneck Reduction: Is the Human Queue Actually Shrinking?

Spec Alignment Rate: Does Agent Output Match What Was Planned?

The Steering Ratio: Human Time on Judgment vs. Human Time on Execution

Start Replacing Activity Metrics This Quarter

See how Cosmos coordinates agents, review, and human judgment so activity gains turn into shipped, stable features.

FAQ

Written by

Molisha Shah

Give your codebase the agents it deserves

TL;DR

The Productivity Gap 12 Months In

See how Cosmos surfaces review bottlenecks across the SDLC before AI-generated volume turns into delivery drag.

The Dashboard That Tells You Nothing

The Three Metrics That Make Agent Adoption Look Successful

Adoption percentage

PR volume

Lines of code generated

Why These Metrics Hide Flat Organizational Productivity

Measuring Delivery Velocity Across the Organization

Explore how Cosmos coordinates agents and review across the SDLC so delivery throughput and stability move together instead of against each other.

Review Bottleneck Reduction: Is the Human Queue Actually Shrinking?

Spec Alignment Rate: Does Agent Output Match What Was Planned?

The Steering Ratio: Human Time on Judgment vs. Human Time on Execution

Start Replacing Activity Metrics This Quarter

See how Cosmos coordinates agents, review, and human judgment so activity gains turn into shipped, stable features.

FAQ

Does higher AI agent adoption percentage indicate better engineering productivity?

Why do developers report feeling more productive with AI agents when objective studies show mixed results?

What is the minimum measurement window for evaluating AI agent impact on engineering productivity?

Can PR volume serve as a useful proxy for delivery velocity if paired with other metrics?

How should CTOs report AI agent ROI to the board?

Related

Written by

Molisha Shah

Give your codebase the agents it deserves