Developers are choosing older AI models — and the data explain why

At Augment Code, we run multiple frontier models side by side in production. This gives us a unique vantage point into how different models behave in real coding workflows. Usage patterns suggest developers are no longer just chasing the newest model; they are matching models to specific task profiles.

This post shares data from millions of live interactions and discusses what it may reveal about model adoption, behavioral differences, and system-level trade-offs.

Model Adoption Is Fragmenting

Over the first week of October 2025, Sonnet 4.5’s share of total requests declined from 66% → 52%, while Sonnet 4.0 rose from 23% → 37%. GPT-5 usage stayed steady at about 10–12%.

Date	Sonnet 4.5	Sonnet 4.0	GPT-5
2025-09-30	66.18%	23.26%	10.57%
2025-10-01	59.39%	30.28%	10.33%
2025-10-02	55.77%	33.54%	10.69%
2025-10-03	54.16%	35.36%	10.48%
2025-10-04	56.66%	31.70%	11.64%
2025-10-05	56.54%	31.02%	12.44%
2025-10-06	52.29%	37.38%	10.33%

At first glance this could look like short-term churn after a new release. But if developers were simply upgrading, Sonnet 4.5’s share would continue rising while 4.0’s declined. The opposite happened. Both models retained significant usage, suggesting that teams are choosing models based on the kind of task, not on version number. In other words, upgrades are beginning to behave like alternatives rather than successors. That shift marks the early stages of specialization in production environments.

Diverging Behaviors: Reasoning Depth vs. Action Frequency

Despite producing larger total outputs, Sonnet 4.5 makes fewer tool calls per user message than 4.0.

Model	Avg Tool Calls / User Message
Sonnet 4.5	12.33
Sonnet 4.0	15.65
GPT-5	11.58

Higher verbosity combined with fewer actions suggests that Sonnet 4.5 performs more internal reasoning before deciding to act. By contrast, 4.0 issues more frequent tool calls, favoring quick task execution over extended deliberation. GPT-5 falls close to 4.5 in call frequency but tends to favor natural-language reasoning over tool use.

We are monitoring whether this behavioral difference aligns with prompt success rates. If higher internal reasoning correlates with improved completion, it would confirm that Sonnet 4.5’s “think more, act less” tendency leads to better outcomes.

Throughput and Token Economy

Sonnet 4.5 generates more text and tool output per message—about 7.5 k tokens on average compared with 5.5 k for 4.0. That is a 37 % increase in total output per interaction.

Model	Text Output	Tool Output	Total Output
Sonnet 4.5	2,497	5,018	7,517
Sonnet 4.0	1,168	3,948	5,481
GPT-5	3,740	1,729	5,469

Richer reasoning leads to more contextual responses but introduces additional latency. We do not yet have per-request tokens-per-second data, but qualitative traces suggest throughput is slightly lower, consistent with the extra compute required for deeper reasoning chains.

Compute Footprint and Cache Utilization

To understand how reasoning depth affects system load, we sampled a small subset of production data covering several billion tokens and corresponding cache operations.

Sonnet 4.5 still accounts for the majority of processed volume, with roughly one-third more cache reads than Sonnet 4.0. GPT-5 shows a much lighter footprint overall.

Model	Input Tokens	Text Output	Tool Output	Total Output	Cache Reads
Sonnet 4.5	0.25 B	0.75 B	1.55 B	2.30 B	240.0 B
Sonnet 4.0	0.13 B	0.20 B	0.72 B	0.92 B	135.0 B
GPT-5	0.16 B	0.22 B	0.10 B	0.32 B	28.0 B
Grand Total	0.54 B	1.17 B	2.37 B	3.54 B	403.0 B

The higher cache-read volume for Sonnet 4.5 likely comes from heavier use of retrieval-augmented workflows and longer context windows. This suggests a system-level shift: more compute is being spent on managing and reusing context rather than on token generation itself.

Emergent Specialization: Where Each Model Excels

Even though developers can freely choose models, their behavior reveals clear preferences by task type. Usage data and qualitative feedback show early signs of specialization.

Model	Observed Strengths	Typical Workflows
Sonnet 4.5	Long-context reasoning, multi-file understanding, autonomous planning	Refactoring agents, complex debugging, design synthesis
Sonnet 4.0	Deterministic completions, consistent formatting, tool-friendly outputs	API generation, structured edits, rule-based transforms
GPT-5	Explanatory fluency, general reasoning, hybrid coding + documentation	Code walkthroughs, summarization, developer education

Each model appears to emphasize a different balance between reasoning and execution. Rather than seeking one “best” system, developers are assembling model alloys—ensembles that select the cognitive style best suited to a task.

Community Sentiment Mirrors Production Behavior

Community discussions of Sonnet 4.5, 4.0, and GPT-5 align closely with the production data:

Sonnet 4.5: Users describe it as thoughtful and reliable for multi-file reasoning but occasionally verbose or slower for simple edits. It handles refactors and architectural planning effectively but can over-explain.
Sonnet 4.0: Praised for tool integration stability and predictable formatting. It is quick and consistent, ideal for automation or rule-based coding tasks. Teams often select it as the “safe default” model.
GPT-5: Recognized for fluency and clarity in explanations. It performs well in hybrid reasoning-plus-writing contexts such as code reviews and documentation but lags in heavy tool execution.

Theme	Sonnet 4.5	Sonnet 4.0	GPT-5
Reasoning Depth	⭐⭐⭐⭐ — Deep planning, sometimes overthinks	⭐⭐ — Direct, task-driven	⭐⭐⭐⭐ — Analytical and expressive
Latency / Responsiveness	Slower	Fast	Moderate
Output Determinism	Medium	High	Medium
Code Generation Quality	Excellent for multi-file	Strong for single-file	Great for hybrid code + docs
Ideal Use Cases	Refactors, architecture	Automation, structured tasks	Walkthroughs, learning, synthesis

Takeaways: The Early Signals of Behavioral Specialization

Three main insights emerge from this dataset:

Adoption is diversifying, not consolidating. Newer models are not always better for every workflow.
Behavioral divergence is measurable. Sonnet 4.5 reasons more deeply, while 4.0 acts more frequently.
System costs are shifting. Reasoning intensity and cache utilization are now central performance metrics.

The story here is not about one model surpassing others but about each developing its own niche. As capabilities expand, behaviors diverge. The industry may be entering a stage where functional specialization replaces the race for a single “best” model—much like how databases evolved into SQL, NoSQL, and time-series systems optimized for different workloads. The same dynamic is beginning to appear in AI: success depends less on overall strength and more on the right cognitive style for the job.

As reasoning depth increases, these behavioral distinctions could define the next phase of AI tooling. The key question for builders is no longer “Which model is best?” but “Which model best fits this task?”