How we built a high-quality AI code review agent

The only durable moat in AI code review is review quality.

Features, UX, and pricing matter — but none of them matter if developers don’t trust the feedback. If a tool produces noisy comments or misses real bugs, engineers quickly learn to ignore it.

We believe the long-term development workflow will shift toward AI-native review where

Humans review specifications and architecture, and
AI reviews implementation details in pull requests.

This architecture reflects a broader shift we’re seeing in AI-native engineering teams. Humans are increasingly responsible for defining intent — specifications, architecture, and constraints — while agents handle the detailed execution work. Code review is one of the clearest examples of this shift.

But this model only works if one condition is met:

AI code review must outperform the average developer reviewer.

Developers need to trust that the agent will consistently catch real issues without producing noisy or incorrect feedback. When that bar is met, AI review naturally becomes the default layer of inspection for pull requests.

Augment Code Review quality

On independent benchmarks, Augment Code Review ranks first or second across 12 popular AI Code Review tools [Code Review Bench offline - #1 (results below), Qodo - #2].

Metric	Augment	Next best
F1 score	53.8% (#1)	BugBot: 44.9%
Recall	62.8% (#1)	Copilot: 53.3%
Precision	47.0% (#2)	Graphite: 75.0%*

*Graphite achieved higher precision but found only 12 of 137 bugs (8.8% recall), generating 0.3 comments per PR. Augment generates 3.6 comments per PR, making it the highest-precision system among tools that produce meaningful review coverage.

Compared to human reviewers (based on production data):

Metric	Augment	Human
Bugs fixed (comment posted and addressed by author) per PR	1.03	0.54
True positive rate (comments addressed)	45%	50%
Most common comment category	correctness bug	nitpick / meta comment

Augment prevents more bugs than human reviewers while maintaining a comparable true-positive rate. This raises a natural question:

What does it actually take to build a high-quality AI code review agent?

In our experience, three components are essential:

Context beyond the pull request
Careful agent system design
Rigorous evaluation loops

The rest of this post walks through how these pieces come together.

Context beyond the PR

Most AI code review tools operate almost entirely on the PR diff, and rely on pattern-based grep-search to gather relevant code context outside the diff.

That approach breaks down quickly in large, messy codebases.

Why diff-only review fails

Consider a simple pull request that removes a permission check for Service A:

Determining whether the change is correct requires answering several questions:

How is authentication handled in this repository?
Which other services interact with Service A?
How do those services validate tokens issued by Service A?
Are there any implicit security assumptions about that permission?

These answers rarely exist in the diff itself and can only be partially answered by using grep based tools on surrounding tokens.

Code context: Augment’s Context Engine

A high-quality review requires retrieving code such as:

Token validation logic in other services
Auth middleware
Historical patterns of permission checks
Related APIs and service integrations

Augment’s Context Engine acts as a semantic code search system that can answer questions like these and pull in the relevant snippets with high precision. So instead of guessing, the agent is able to reason about the change with all necessary information available.

Augment agents natively come with the context engine tool. For teams not using Augment directly, the same capability can be accessed through the Augment Context MCP, which exposes the context engine as an MCP service usable by other agents.

The most important context: what’s not in the code

Many of the highest-impact review comments rely on knowledge that isn’t present in the repository at all. Examples include:

“Never log sensitive fields of type X.”
“All API responses must remain backward compatible.”
“This subsystem cannot make synchronous network calls.”

This type of tribal knowledge often lives in:

Slack discussions and Google docs
Incident retrospectives
Senior engineers’ heads

If it isn’t encoded somewhere structured and agent-accessible, AI cannot reason about it.

The cultural shift

Teams building AI-native workflows must adopt a simple rule:

Any recurring review issue tied to tribal knowledge should be documented in guidelines.

These guidelines become machine-readable constraints for the code review agent.

Augment incorporates this context in the user-prompt through:

Repository-level review guidelines
Hierarchical directory-scoped guidelines

This piece is different from all the others in this article because it driven more by the quality of users’ inputs than the tool’s design.

Agent design: tools, prompts, models, guardrails

Even with strong context retrieval, review quality depends heavily on the design of the agent system. We found that four parts of the system had the biggest impact on review quality:

The tools the agent uses to gather context and post reviews
The system prompt that defines the review philosophy
The model–prompt pairing
The guardrails around the agent’s behavior

The diagram below shows what the core agent-harness looks like:

Tools

The review agent needs to navigate a repository the way a human reviewer would. To enable that, we designed a set of tools that that lets the agent explore code safely and efficiently.

Tools for semantic code-context retrieval, file browsing, and symbol search.
Minimal functional overlap between tools to avoid confusing the model about which one to invoke.
Deterministic injection of large inputs like PR diffs and existing review comments, rather than retrieving them through tools.
MCP tool integrations to pull in context from systems like Linear, Jira, and Notion.
Observability around tool usage so we can understand how the agent navigates a repository during review.

Prompts

Our system prompt defines the review philosophy. The most important role of the prompt is tuning the precision vs. recall tradeoff, which determines whether the agent produces a few highly reliable comments or attempts to catch every possible issue.

A system prompt should:

Either focus on high signal-to-noise issues OR focus on a thorough review catching all issues.
Specify which comment categories to avoid (eg. stylistic or nitpick comments).
Include recommended tool usage patterns.
Outline the steps in the review workflow.

Models

Models differ in how they:

Interpret instructions
Use tools
Trade off precision vs recall

This means prompts, toolsets, guardrails must often be tuned for each model. Treating the model as a drop-in component rarely produces high-quality results. High review quality requires continuous benchmarking and careful pairing between models, tools, and prompts. For Augment Code Review, the GPT model series has consistently performed the best so far.

Guardrails

Agents can occasionally do weird things like commenting on the wrong PR, or modifying PR description. To prevent this, we implemented several guardrails:

Narrow tool operations: (eg. a code review agent’s Github tool shouldn’t be able to push new commits)
Restricted shell access
Making as many components deterministic as possible (such as retrieving PRs, constructing API calls, etc.)

Corner cases

There are variety of corner cases to deal with, including:

Large pull requests, affected by context rot and context window limits
Cross-repo context (can be done via Augment Context MCP)
Non-code files
Forked-repo workflows
Running subsequent rounds of review: doing incremental reviews and dealing with existing review comments

Evals: the non-negotiable foundation

You cannot improve quality without measurement. We rely on two complementary evaluation systems:

Offline benchmarks
Online production metrics

Offline evals: fast iteration

Offline evaluations allow rapid iteration on agent design changes such as: new prompts, new tools, new models. They run locally and provide fast feedback before shipping improvements to production.

Code review benchmark

10 PRs from 5 open-source repositories (millions of LOC)
Ground truth (golden comments):
- Bugs found by human reviewers
- Manually annotated valid issues that were found by agents but missed by human reviewers.

We invested a lot of time upfront manually verifying that our set of golden comments is as close to perfect as possible.

The offline evaluation is run by:

Creating fresh copies of each PR
Running the code review tool on these
Comparing generated comments to the golden comments using an LLM-as-judge

Measuring quality in offline benchmarks

Three metrics matter most.

Metric	Definition	Why it matters
Precision	True positives / (TP + FP)	Developers ignore tools with poor precision
Recall	Percentage of expected bugs caught	Ensures meaningful review coverage
F-score	Harmonic mean of precision and recall	Best single metric for optimizing quality

F-score acts as the primary hill-climbing metric for offline improvements.

Offline benchmarks are fast — but they cannot perfectly represent production reality.

Online evals: measuring real-world performance

Production metrics answer the question:

Is the agent actually helping real teams catch bugs?

We monitor several signals, but we discuss the three most important ones.

Metric	Analogous offline metric	Purpose
Bugs fixed per PR	Recall	Measures real-world bug prevention and review coverage
Percentage of comments addressed	Precision	Measures trust and signal-to-noise of review comments
Distribution of comments	None	Ensures feedback skews toward high-impact issue types

Bringing it all together

For decades, code review has been one of the most human parts of the software development lifecycle.

But as AI systems gain access to richer code context and stronger evaluation loops, something surprising is happening:

Review is becoming one of the first outer-loop engineering tasks where agents consistently outperform humans.

That shift changes how teams structure the development workflow. Humans move earlier in the process — defining specifications and architecture — while agents handle the detailed inspection of code changes.

In other words: code review may be the first truly AI-native step in the outer-loop of the SDLC.