Auggie tops SWE-Bench Pro

We ran Auggie on Scale AI's SWE-bench Pro benchmark. It scored 51.80%, the highest of any agent tested.

We also ran Cursor, Claude Code, and OpenAI's Codex on the same 731 problems. Three of those four agents used the exact same model, Claude Opus 4.5. The results weren't the same.

Bar chart showing SWE-bench Pro benchmark results on public dataset. Auggie with Claude Opus 4.5 leads at 51.80%, followed by Cursor with Claude Opus 4.5 at 50.21%, Claude Code with Claude Opus 4.5 at 49.75%, Codex with GPT-5.2-codex at 46.47%, and SWE-Agent with Claude Opus 4.5 (Scale baseline) at 45.89%. Note indicates SWE-Agent baseline is from Scale AI leaderboard (Jan 2026), with all other evaluations run by Augment Code using identical harness.

Auggie CLI takes the top spot on SWE-bench Pro, solving 51.80% of real-world software engineering tasks.

For reference, Scale's own leaderboard shows Claude Opus 4.5 hitting 45.89% when run through SWE-Agent, their standard scaffold. Auggie beats that by nearly 6 points with the same underlying model.

Same model, different results

The spread surprised us more than the absolute score.

Auggie, Cursor, and Claude Code all ran Opus 4.5. Same weights. Same capabilities. Same training data. Yet Auggie solved 15 more problems than Cursor and 17 more than Claude Code out of 731 total.

That gap comes from agent architecture.

The difference comes down to what context the agent sees before it starts writing code. SWE-bench Pro problems require understanding code that isn't in the immediate file. The agent has to find it, and finding the right code in a large repository is a retrieval problem.

Auggie uses Augment's Context Engine (lanching soon as an MCP), which builds a semantic index of the full codebase. One SWE-bench Pro problem requires fixing BCrypt handling in Ansible. The relevant code spans several layers, from high-level filters down to low-level utility functions. Text-based search tools like grep find the top-level APIs easily, but most agents stopped there. The actual fix was needed in a low-level utility that a test called directly. Augment's Context Engine found it because it understands semantic relationships, not just matching keywords.

Cursor and Claude Code have their own retrieval systems. They're good. But on these problems, they retrieved less useful context more often, and that compounded across 731 attempts.

What is SWE-bench Pro?

SWE-bench Verified was the standard coding benchmark for about a year. Top agents crossed 70% last spring. The benchmark started to saturate. Scores kept climbing, but the problems weren't getting any harder.

Scale AI released SWE-bench Pro in late 2025 to fix this. The problems are harder in ways that matter:

Multi-file edits. The average solution touches 4.1 files and changes 107 lines of code. You can't grep your way through these. The agent needs to understand how the codebase fits together.
Multiple languages. SWE-bench Verified was Python-only. Pro includes Go, TypeScript, and JavaScript. Error messages in Go are terse. TypeScript has type errors that don't map cleanly to runtime behavior. Each language has its own failure modes.
Real task diversity. Bug fixes, feature requests, security patches, performance optimizations, UI changes. The problems go beyond "fix this failing test."

When SWE-Bench Pro launched, the best models dropped from 70%+ to around 23%. That gap has closed somewhat. Opus 4.5 now hits 45.89% on Scale's leaderboard. But it's still a harder benchmark by a wide margin.

Try it yourself

SWE-bench Pro is public. The dataset is on HuggingFace, and Scale published their evaluation harness on GitHub. If you want to verify these numbers or test your own agent, you can.

We'd rather you just try Auggie on your actual codebase. Benchmarks are useful for comparison, but the question that matters is whether it helps you ship faster. That's harder to measure and more important.

Install Auggie and see what it does with your code.

Auggie tops SWE-Bench Pro

Same model, different results

What is SWE-bench Pro?

Try it yourself

Written by

Arash [AJ] Joobandi

Give your codebase the agents it deserves