Opus 4.7 for 33% less: How Auggie beats Claude Code on cost and quality

TL;DR: We benchmarked Auggie vs Claude Code on Opus 4.7. Auggie takes a modest lead in quality (67.4% vs 66.3% pass rate) while costing ~33% less, thanks to sharper retrieval that results in token efficiency.

Augment’s Context Engine was built to deliver high-quality results on large, complex codebases. As frontier models have improved, engineering leaders’ questions have shifted from “can it do this?” to “what does it cost at our scale?”. Usage is exploding, and token spend is now a board-level line item. Because OpenAI and Anthropic dominate the frontier-model market, neither is motivated to make coding agents cheaper to run. For Augment, token efficiency is a key differentiator and point of pride. Below we show a head-to-head comparison between Augment’s agent Auggie and Claude Code on Opus 4.7. The headline: matched quality, at 33% less cost. Combined with optimal model routing with Prism, Augment customers can expect to save up to 50% on state of the art models to get the same quality output.

Same model, 33% discount: Terminal Bench 2.0 on Opus 4.7

We ran Terminal Bench 2.0 with Auggie CLI and Claude Code head to head using Opus 4.7 and default settings, on a GCP n4-highcpu-16 VM (16 vCPU, 32 GB RAM). The benchmark was run via the Harbor framework with five attempts per task and four tasks executing in parallel.

Terminal Bench: Terminal Bench 2.0 on Opus 4.7. Auggie vs Claude Code baseline: pass rate 67.4% vs 66.3%, total cost $463 vs $695 (33% reduction), total tokens 368M vs 543M (32% reduction).

Same model, 32% fewer tokens, 33% lower spend.

The pass-rate gap (1.1%) sits inside the variance we see across runs of any single benchmark, but the cost gap doesn't. And in the table below you can see where the savings come from: reduced tokens. Cache reads, the volume of historical context replayed each turn, drop by 32%. Output tokens by 37%. That's the Context Engine and our harness doing what it was built to do: less wasted exploration, fewer expensive turns.

Token category (Opus 4.7)	Auggie CLI	Claude Code	Delta
Total tokens	367,587,892	543,090,485	−32%
Output tokens	7,217,279	11,381,425	−37%
Cache read tokens	341,980,440	506,455,124	−32%
Cache write tokens	17,960,193	25,219,909	−29%
Total cost (USD)	$463.04	$694.50	−33%

Auggie on SWE-Bench Pro: Higher Quality, 23% Lower Cost

The same pattern holds on SWE-Bench Pro, a widely recognized benchmark for coding tasks. We ran it on the same head-to-head setup, three attempts per task, eight batches in parallel.

SWE-Bench Pro: SWE-Bench Pro on Opus 4.7. Auggie vs Claude Code baseline: pass rate 61.8% vs 59.9%, total cost $1,449 vs $1,870 (23% reduction), total tokens 1.65B vs 2.35B (30% reduction).

Harder benchmark, same shape: ahead on quality, 23% cheaper per task.

Auggie edges ahead on quality and is still 23% cheaper per run.

Token category (Opus 4.7)	Auggie CLI	Claude Code	Delta
Total tokens	1,651,716,301	2,349,143,356	−30%
Cache read tokens	1,582,841,271	2,269,905,161	−30%
Cache write tokens	52,849,663	63,777,293	−17%
Total cost (USD)	$1,448.63	$1,869.97	−23%

Cache reads down 30%, cache writes down 17.0%, total tokens by almost a third, pass rate slightly ahead. The shape is the same as Terminal Bench 2.0: a smaller, sharper context produces less work for the model and a meaningfully smaller bill at the end of the run.

What’s driving the token efficiency

Most coding agents assemble context through grep and keyword search. While this approach has improved in quality over time, it remains inefficient: agents burn turns crawling files, reading large spans of code, and pulling in irrelevant matches just to find the few lines that actually matter. Every miss costs another round trip, and every round trip costs tokens.

Augment's Context Engine and harness are built for token efficiency. It maintains a semantic index of your codebase that not only helps with quality on large, complex codebases, but it’s also much more efficient from a retrieval perspective. Resulting in fewer turns, less tokens used, and ultimately lower cost.

Model-agnostic offers further savings

Auggie isn't bound to one model provider. The Context Engine sits in front of whichever frontier model you pick, which means the same efficiency advantage compounds when you choose a different one. Below are four alternative models measured on Terminal Bench 2.0 head to head against the Claude Code on Opus 4.7 baseline.

Auggie CLI across five model configurations vs Claude Code on Opus 4.7 baseline (66.3% pass rate, $695 cost). Pass rates: GPT 5.5 76.0%, Gemini 3.1 67.6%, Opus 4.7 67.4%, GPT 5.4 66.5%. Costs: GPT 5.4 $215 (−69%), Gemini 3.1 $283 (−59%), GPT 5.5 $362 (−48%), Opus 4.7 $463 (−33%). Every Auggie configuration beats the baseline on both axes.

Two stand out: GPT 5.5 leads on quality, GPT 5.4 leads on cost.

Every model is cheaper than the Claude Code baseline; three of four match or beat its pass rate.

Two configurations stand out. Auggie + GPT 5.5 is the quality play: +9.3% pass rate over the Claude Code baseline at 54% lower cost. Auggie + GPT 5.4 is the value play: comparable pass rate at 73% lower cost. Auggie + Gemini 3.1 lands in between on both axes. You set the quality-to-cost balance that works for you.

Does this hold up on real codebases?

Public benchmarks are a useful baseline, but the question every engineering leader actually wants answered is: "how does this translate to my codebases?" We ran an internal evaluation suite against private repositories, real customer codebases, and the pattern holds.

Internal evaluation on private customer codebases, Opus 4.7. Auggie vs Claude Code baseline: pass rate 72.6% vs 73.8% (61 vs 62 tasks passed), cost per passing task $3.90 vs $6.49 (40% reduction), total spend $238 vs $402 (41% reduction).

Same pattern on private repos as on the public benchmarks.

Claude Code passed 62 of the tasks; Auggie CLI passed 61 — effectively a tie. But Claude Code spent $6.49 per passing task ($402 total) while Auggie spent $3.90 per passing task ($238 total). Same model, real repos, the same shape of result we see in the public benchmarks above.

Further optimization with model routing via Prism

Everything above holds the model constant on each side of the comparison. With Prism, our new model router, you don't have to. It evaluates at each user turn and chooses the model best suited to the prompt — frontier when the work demands it, cheaper alternatives when it doesn't, with cache-aware switching so the savings actually land. On top of Auggie's per-task efficiency, Prism is another 20–30% cost reduction on the workloads we've measured, with negligible quality impact. Read the Prism deep-dive →

Opus 4.7 for 33% less: How Auggie beats Claude Code on cost and quality

Same model, 33% discount: Terminal Bench 2.0 on Opus 4.7

Auggie on SWE-Bench Pro: Higher Quality, 23% Lower Cost

What’s driving the token efficiency

Model-agnostic offers further savings

Does this hold up on real codebases?

Further optimization with model routing via Prism

Written by

Robbert Kauffman

Mayur Nagarsheth

Give your codebase the agents it deserves