The era of single-model engineering is over

Thirteen months ago, I wrote a post arguing that model pickers were a design failure: the tool should pick the model. Developers should focus on shipping.

This month, we added Gemini 3.1 Pro to Augment Code, our third model, alongside Claude and GPT-5.

The market moved faster than we expected it to, and it's only accelerating. Vinay wrote about what it feels like to be inside an exponential, and this is one of the places you can see it most clearly: the models we're running today are the worst and most expensive models we'll use all year. Every few months, something new clears the bar. Picking one model and building around it was the wrong strategy. Here's how I got there.

One model worked fine, until it didn't

The argument I made in March 2025 was simple. AI coding tools exist to make developers productive. A dropdown with eight models is complexity that belongs in the system, not in the UI. If you have to choose, the tool isn't doing its job.

That was true when one model was clearly the best. We started with Anthropic's Sonnet. It was the best thing available for real coding work, and it wasn't close. You could build around it, tune your prompts for it, and feel confident you weren't missing anything.

Then GPT-5 showed up and was good enough that people wanted a choice. We shipped the picker.

Then Opus 4.5 pulled ahead of everything else.

Then GPT-5.4 showed up with stronger reasoning and aggressive pricing - roughly 2.6x cheaper per message than what we were running before, at quality we'd ship without hesitation.

The gap between these models is shrinking.

Models are converging on intelligence, speed, and price. Source: artificialanalysis.ai

A model choice is a provider commitment

Most teams underestimate what they're deciding when they standardize on a model. It feels like a technical choice, but in practice it's a provider commitment. Even with multiple model tiers within a single provider, you're inheriting their pricing, their availability, their release cadence, and their tradeoffs. You're also missing what's happening outside of them.

We've already seen how fast the ground shifts.

OpenAI closed the gap faster than anyone expected.
Google entered with competitive performance and dramatically different pricing.
Benchmarks have tightened to the point where the differences between top models are increasingly hard to separate.

Teams are also paying closer attention to headlines around supply chain risk and security incidents across providers, not as disqualifiers, but as signals that optionality matters.

There's an obvious counterargument here: shouldn’t we just go all-in on our favorite model provider? Of course, you should use the best model available for your workload. This post is about what happens next. In the past thirteen months, the lead has changed three times. The best model in March 2025 wasn't the best model in August. The best model in August wasn't the best model in November. The question is whether your setup lets you move to the next one when it shows up.

When you're only set up for one provider, every time the market moves, you're scrambling.

Why switching models doesn't have to mean switching everything

Most teams think of their AI tool and their model as the same thing, and when the tool is built around one provider, that's basically true. The prompts are tuned for that model. The context retrieval is optimized for its strengths. The agent behavior is shaped around its quirks. Switching the model means rebuilding all of it. That's why it feels like a migration.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

But those are separate layers, and they don't have to be coupled.

Think of it as three things:

The model is the LLM doing the generation: Claude, GPT-5, Gemini.
The harness is what feeds the model context from your codebase: retrieval, indexing, prompt construction.
The orchestration is how agents coordinate across a workflow: planning, execution, validation.

When the harness and orchestration are built to be provider-agnostic, the model becomes the part you swap. Switching from Claude to Gemini for a given task doesn't mean re-tuning your prompts or revalidating your agent pipelines. It means toggling a setting. The rest of the system stays the same.

Side-by-side architecture comparison. Left, labeled Lock-in: three gray layers — Model, Harness, and Orchestration — shown as a fused stack with side rails connecting all three. Right, labeled Choice: the same three rows aligned horizontally, but the model layer is a dashed container holding three interchangeable LLMs in gray, while the Harness and Orchestration layers below are green and provider-agnostic. Both sides end with a caption: the left reads 'Switching the model means rebuilding everything,' the right reads 'Switching the model means toggling a setting.

When the model, harness, and orchestration are fused to one provider, a model switch is a migration. When they're decoupled, it's a toggle.

When all three layers are fused to one provider, pulling out any piece means pulling out everything. When the model is a variable instead of a foundation, you have choice.

For us, that architecture looks like this. The harness is the Context Engine: it indexes at repo scale, stays current with real-time changes, and gives every model the context that makes its outputs worth using. That foundation is a big part of how we topped SWE-Bench Pro. The orchestration layer is Intent: systems of agents coordinating across a workflow to get real work done, regardless of which model is underneath.

This is the direction we’re building toward more broadly: a platform that supports an AI-native SDLC, where models, context, and orchestration come together to automate meaningful parts of the software lifecycle, not just assist within it.

Experimentation is the strategy now

A year ago, I thought multiple models were a distraction. Now we run three, and I expect we'll add more before the year is out. They'll be better and cheaper than what we have today.

The lead changes every few months. Pricing shifts. New players show up. The model that was your best option in January might be your most expensive option by April. The teams that stay ahead are the ones that made it easy to try the next one.The lead changes every few months. Pricing shifts. New players show up. The model that was your best option in January might be your most expensive option by April. The teams that stay ahead are the ones that made it easy to try the next one.

Thirteen months ago I thought simplicity meant fewer models. Now I think simplicity means your developers never have to worry about whether they're on the right one. That's our job. We test the models, we do the integration work, and when something new clears the bar, it shows up in the product.

The era of single-model engineering is over. Three models clear the bar today. There will be more by the end of the year, and they'll be better and cheaper. I'd rather be ready for that than scrambling to catch up.