Review the spec, not the code

Your team merged 30 PRs yesterday. How many did you actually read?

Not skim. Read. Understood the tradeoffs, checked the edge cases, thought about how the change interacts with code three directories away.

If you're honest, the number is small. Maybe two or three. Maybe zero.

That’s not a discipline problem. It’s a math problem.

AI-generated pull requests now account for a huge share of committed code, and teams with high AI adoption are merging dramatically more PRs than they were a year ago. Review time is up, PR volume is up, and AI agents are touching more and more of your stack.

You are producing code faster than you can understand it.

The bottleneck moved, and most teams didn’t notice

For decades, the constraint on shipping software was writing it. Getting the logic right, getting the types right, getting the tests to pass. Our entire workflow — sprints, code review, CI — was designed around that constraint.

That constraint is gone. Writing code is the cheap part now.

But most teams are still organized around the old bottleneck:

Staring at diffs.
Doing line-by-line reviews of code an agent wrote in forty seconds.
Treating the pull request as the primary quality gate.

The data is already telling you this doesn’t work:

AI-generated PRs wait far longer before anyone even picks them up for review.
Once review starts, they move faster — but the acceptance rate is dramatically lower than for human-written code.
Reviewers have learned to deprioritize them. Nearly 70% of AI PRs get rejected.

That’s not a tooling gap. Reviewers don’t trust the code, so they avoid it. The queue grows. The PRs age.

There’s a subtler problem too: AI-generated code can pass its own tests, because the same model wrote the tests to match the implementation rather than the spec. Traditional review assumes tests are an independent check. With agent-generated code, they often aren’t.

Code review has always been a probabilistic filter. It was never perfect. In an AI-heavy world, it’s also no longer the main bottleneck.

What if you reviewed the plan instead?

Think about how you actually catch the important bugs:

The architectural mistakes.
The wrong abstraction.
The feature that technically works but solves the wrong problem.

You catch those by understanding intent:

What is this change trying to do?
What are the constraints?
What does success look like?

Those questions live upstream of the code, not in it.

When an agent writes code from a spec, you have a choice:

Review the code — every line, every file, every diff — and hope you catch what matters in a wall of generated text.
Review the spec, approve the plan, define the acceptance criteria, and let deterministic verification handle the rest.

One of these scales. The other doesn’t.

The job is writing contracts, not code

Here’s what a morning looks like when you work this way:

You open a migration task.
Instead of reading the diff an agent produced, you read the spec it worked from.
The spec says:
All monetary amounts use the Money type.
No API endpoint returns raw error messages to clients.
Every state transition in the checkout flow has an explicit test.
The verification report says all three hold.

You’re done.

Compare that to the old morning:

You open a 400-line diff.
You skim the first hundred lines.
You leave a comment about a variable name.
You approve it because the tests pass and you have a meeting in ten minutes.

Which version actually caught the architectural mistake?

This isn’t about writing better prompts. Prompts are throwaway. The durable work is:

Writing specs precise enough to be verified.
Defining constraints tight enough to prevent drift.
Encoding acceptance criteria that capture what “correct” means in your domain.

Those rules don’t live in a diff. They live in the contract between intent and implementation.

What this actually looks like in practice

Imagine a service migration with four agents working in parallel.

You spend ten minutes on the spec:

Which endpoints move.
Which contracts must not change.
What breaks if you get it wrong.
What invariants must always hold.

The agents:

Generate the code.
Run the tests.
Check each other’s work against the spec.

One agent introduces a dependency you don’t allow in that service.

The spec has a constraint: no new runtime dependencies without explicit approval.

Verification catches it. The agent tries a different approach. The constraint holds. You never see the bad version.

If you’d been reviewing diffs, you might have caught that dependency — or you might not have, buried in line 340 of a 500-line PR.

The spec catches it every time, because it doesn’t rely on your attention span at 4pm on a Thursday.

Any team can start working this way:

Write specs before writing code.
Define acceptance criteria before generating diffs.
Treat verification as something machines do against contracts humans wrote.

The tooling is catching up fast, but the shift is structural, not tool-dependent.

Trust is layered, not binary

You might be thinking: “I can’t just stop reading code. What if the agent does something wrong?”

It will.

Agents deviate. They hallucinate. They misunderstand edge cases.

The answer isn’t to read every line — you already can’t do that at this volume. The answer is to stack verification layers so no single failure is catastrophic:

Tests that run on every change.
Type systems that catch contract violations at compile time.
Custom linters that enforce your organization’s invariants.
Scoped permissions so an agent fixing a date parser can’t touch your auth system.
A separate agent that tries to break what the first agent built.

None of these are perfect. That’s the point.

You stack imperfect filters until the gaps don’t align. The holes in your type checker aren’t the same as the holes in your linter, which aren’t the same as the holes in your adversarial test agent.

That’s how you get confidence without reading every line.

The uncomfortable question

If you’re spending your best engineering hours reading AI-generated diffs, you’re optimizing the wrong thing.

You’re doing work that feels productive — “I reviewed 15 PRs today” — while the decisions that actually determine whether your system works are happening before the code exists, with less scrutiny than they deserve.

Ask yourself:

When was the last production incident caused by a bug you would have caught in a diff?
Now ask: when was the last one caused by a wrong assumption, a missing requirement, or a constraint nobody wrote down?

The second category is where the real damage happens. And it’s the category that line-by-line review almost never catches.

Five questions that tell you where you stand

You don’t need to change everything tomorrow. Start by asking:

What percentage of your review comments are things a linter could catch?

If it’s above half, you’re doing machine work with human hours.

When was the last time a code review caught an architectural mistake?

Not a typo. Not a missing test. A real design error. If you can’t remember, your reviews aren’t catching the expensive bugs anyway.

Do your PRs have a written spec, or just a title and a diff?

If there’s no spec, there’s no intent to review. You’re reviewing implementation in a vacuum.

If an agent submitted a PR to your service right now, what would it verify itself against?

If the answer is “nothing explicit,” your contracts aren’t defined well enough for humans or agents to check.

How much of your review time goes to code you trust versus code you don’t?

If you’re spending equal time on a utility function and a payments endpoint, your review process doesn’t reflect where risk actually lives.

The teams that figure this out first will ship with more confidence, because their quality gates will be in the right place — where the decisions are, not where the diffs are.

Code isn’t the artifact that matters anymore. The intent is.