We benchmarked 7 AI code review tools on large open-source projects. Here are the results.

We benchmarked 7 AI code review tools on large open-source projects. Here are the results.

December 11, 2025

TL;DR

We evaluated seven of the most widely used AI code review tools on the only public benchmark for AI-assisted code review. Augment Code Review, powered by GPT-5.2, delivered the strongest performance by a significant margin. Its reviews were both higher precision and substantially higher recall, driven by a uniquely powerful Context Engine that consistently retrieved the right files, dependencies, and call sites. While many tools generate noisy comments or miss critical issues due to limited context, Augment stood out as the only system that reliably reasoned across the full codebase and surfaced the kinds of problems human reviewers actually care about. In short: if you want AI reviews that feel like a senior engineer rather than a lint bot, Augment Code Review is the clear leader and generally available for all Augment Code customers today.

The real test: catching bugs without creating noise

AI code review tools are marketed as fast, accurate, and context-aware. But developers know the real test is whether these systems catch meaningful issues without overwhelming PRs with noise. To understand how these tools perform in practice, we evaluated seven leading AI review systems using the only public dataset for AI code review. The results make clear that while many tools struggle with noisy or incomplete analysis, one tool—Augment Code Review—stands significantly above the rest.

How code review quality is measured

A review comment is considered correct if it matches a golden comment: a ground-truth issue that a competent human reviewer would be expected to catch. Golden comments reflect real correctness or architectural problems, not stylistic nits.

Each tool’s comments are labeled as:

  • True positives: match a golden comment
  • False positives: incorrect or irrelevant comments
  • False negatives: golden comments the tool missed

From these classifications we compute:

  • Precision: How trustworthy the tool is
  • Recall: How comprehensive it is
  • F-score: Overall quality

High precision keeps developers engaged; high recall is what makes a tool genuinely useful. Only systems with strong context retrieval can achieve both.

Benchmark results

Here are the results, sorted by F-score—the single best measure of overall review quality:

ToolPrecisionRecallF-score
Augment Code Review65%55%59%
Cursor Bugbot60%41%49%
Greptile45%45%45%
Codex Code Review68%29%41%
CodeRabbit36%43%39%
Claude Code23%51%31%
GitHub Copilot20%34%25%
Scatter plot showing precision versus recall trade-off for AI code review tools, with F-Score indicated by circle size and percentage labels. Augment Code Review leads with 59% F-Score at approximately 65% precision and 57% recall (shown in blue). Other tools cluster in lower performance ranges: Cursor BugBot (49% F-Score at 60% precision, 41% recall), Greptile (45% at 45% precision, 45% recall), Codex Code Review (41% at 70% precision, 30% recall), CodeRabbit (39% at 35% precision, 45% recall), Claude Code (31% at 22% precision, 50% recall), and GitHub Copilot (25% at 20% precision, 35% recall). Augment Code Review demonstrates superior balance of precision and recall compared to competitors.

Augment Code Review delivers the highest F-score by a meaningful margin, and importantly, it is one of the very few tools that achieves both high precision and high recall. Achieving this balance is extremely difficult: tools that push recall higher often become noisy, while tools tuned for precision usually miss a significant number of real issues. For example, Claude Code now reaches roughly 51% recall—close to Augment’s recall—but its precision is much lower, leading to a high volume of incorrect or low-value comments. This signal-to-noise tradeoff is the core challenge in AI code review.

Developers will not adopt a tool that overwhelms PRs with noise. By maintaining strong precision while also achieving the highest recall in the evaluation, Augment provides materially higher signal and a far more usable review experience in practice.

Why recall is the hardest frontier, and why Augment leads

Precision can be dialed up with filtering and conservative heuristics, but recall requires something fundamentally harder: correct, complete, and intelligent context retrieval.

Most tools fail to retrieve:

  • dependent modules needed to evaluate correctness
  • type definitions influencing nullability or invariants
  • caller/callee chains across files
  • related test files and fixtures
  • historical context from previous changes

These gaps lead to missed bugs—not because the model can’t reason about them, but because the model never sees the relevant code.

Augment Code Review succeeds because it consistently surfaces the right context.

Our retrieval engine pulls in the exact set of files and relationships necessary for the model to reason about cross-file logic, API contracts, concurrency behavior, and subtle invariants. This translates directly into higher recall without sacrificing precision.

Why some tools perform better (and why Augment performs best)

Across all seven tools, three factors determined performance—and Augment excelled in each.

1. A superior Context Engine

This is the differentiator. Augment consistently retrieved the correct dependency chains, call sites, type definitions, tests, and related modules—the raw material needed for deep reasoning. No other system demonstrated comparable accuracy or completeness in context assembly.

2. Best combination of model, prompts, and tools

Starting with a strong underlying agent is a key requirement for good code review. A well designed agent loop, context engineering, specialized agent tools, and evaluations go a long way in building agents that know how to navigate your codebase, the web, etc. and collect the necessary information for a comprehensive review.

3. Purpose-built code review tuning

Augment applies domain-specific logic to suppress lint-level clutter and focus on correctness issues. This keeps the signal high while avoiding the spammy behavior common in other tools.

And, Augment Code Review is tuned over time. We are able to compute whether each comment posted by Augment is addressed by a human developer. This data helps us specialize and tune our agent tools, prompts, and context to continually improve our code review service.

Together, these factors produce the highest precision, the highest recall, and the strongest overall F-score.

The benchmark dataset

The benchmark spans 50 pull requests across five multi-million line open-source codebases, including Sentry, Grafana, Cal.com, Discourse, and Keycloak. These repositories represent real-world engineering complexity: multi-module architectures, cross-file invariants, deep dependency trees, and nontrivial test suites. Evaluating AI reviewers on this kind of code is the only way to determine whether they behave like senior engineers—or shallow linters.

How we improved the dataset

The original public dataset was invaluable, but incomplete. Many PRs contained multiple meaningful issues that were missing from the golden set, making recall and precision impossible to measure accurately. We expanded and corrected the golden comments by reviewing each PR manually, verifying issues, and validating them against tool outputs. We also adjusted severity so that trivial suggestions do not inflate scores or penalize tools unfairly.

All corrected data and scripts are open-source on GitHub.

Conclusion

AI code review is moving fast, but the gap between tools is wider than marketing pages suggest. Most systems struggle to retrieve the context necessary to catch meaningful issues, leading to low recall and reviews that feel shallow or noisy. The defining challenge in AI code review isn’t generation—it’s context: assembling the right files, dependencies, and invariants so the model can reason like an experienced engineer.

Augment Code Review is the only tool in this evaluation that consistently meets that standard. Our Context Engine enables recall far above the rest of the field, and its precision keeps the signal high. Augment produces reviews that feel substantive, architectural, and genuinely useful—closer to a senior teammate than an automated assistant. As codebases grow and teams demand deeper automation, the tools that master context will define the next era of software development. By that measure, Augment Code is already well ahead.

Akshay Utture

Akshay Utture

Akshay Utture builds intelligent agents that make software development faster, safer, and more reliable. At Augment Code, he leads the engineering behind the company’s AI Code Review agent, bringing research-grade program analysis and modern GenAI techniques together to automate one of the most time-consuming parts of the SDLC. Before Augment, Akshay spent several years at Uber advancing automated code review and repair systems, and conducted research across AWS’s Automated Reasoning Group, Google’s Android Static Analysis team, and UCLA. His work sits at the intersection of AI, software engineering, and programming-language theory.

Loading...