October 20, 2025
Developers are choosing older AI models — and the data explain why

At Augment Code, we run multiple frontier models side by side in production. This gives us a unique vantage point into how different models behave in real coding workflows. Usage patterns suggest developers are no longer just chasing the newest model; they are matching models to specific task profiles.
This post shares data from millions of live interactions and discusses what it may reveal about model adoption, behavioral differences, and system-level trade-offs.
Model Adoption Is Fragmenting
Over the first week of October 2025, Sonnet 4.5’s share of total requests declined from 66% → 52%, while Sonnet 4.0 rose from 23% → 37%. GPT-5 usage stayed steady at about 10–12%.
Date | Sonnet 4.5 | Sonnet 4.0 | GPT-5 |
---|---|---|---|
2025-09-30 | 66.18% | 23.26% | 10.57% |
2025-10-01 | 59.39% | 30.28% | 10.33% |
2025-10-02 | 55.77% | 33.54% | 10.69% |
2025-10-03 | 54.16% | 35.36% | 10.48% |
2025-10-04 | 56.66% | 31.70% | 11.64% |
2025-10-05 | 56.54% | 31.02% | 12.44% |
2025-10-06 | 52.29% | 37.38% | 10.33% |
At first glance this could look like short-term churn after a new release. But if developers were simply upgrading, Sonnet 4.5’s share would continue rising while 4.0’s declined. The opposite happened. Both models retained significant usage, suggesting that teams are choosing models based on the kind of task, not on version number. In other words, upgrades are beginning to behave like alternatives rather than successors. That shift marks the early stages of specialization in production environments.
Diverging Behaviors: Reasoning Depth vs. Action Frequency
Despite producing larger total outputs, Sonnet 4.5 makes fewer tool calls per user message than 4.0.
Model | Avg Tool Calls / User Message |
---|---|
Sonnet 4.5 | 12.33 |
Sonnet 4.0 | 15.65 |
GPT-5 | 11.58 |
Higher verbosity combined with fewer actions suggests that Sonnet 4.5 performs more internal reasoning before deciding to act. By contrast, 4.0 issues more frequent tool calls, favoring quick task execution over extended deliberation. GPT-5 falls close to 4.5 in call frequency but tends to favor natural-language reasoning over tool use.
We are monitoring whether this behavioral difference aligns with prompt success rates. If higher internal reasoning correlates with improved completion, it would confirm that Sonnet 4.5’s “think more, act less” tendency leads to better outcomes.
Throughput and Token Economy
Sonnet 4.5 generates more text and tool output per message—about 7.5 k tokens on average compared with 5.5 k for 4.0. That is a 37 % increase in total output per interaction.
Model | Text Output | Tool Output | Total Output |
---|---|---|---|
Sonnet 4.5 | 2,497 | 5,018 | 7,517 |
Sonnet 4.0 | 1,168 | 3,948 | 5,481 |
GPT-5 | 3,740 | 1,729 | 5,469 |
Richer reasoning leads to more contextual responses but introduces additional latency. We do not yet have per-request tokens-per-second data, but qualitative traces suggest throughput is slightly lower, consistent with the extra compute required for deeper reasoning chains.
Compute Footprint and Cache Utilization
To understand how reasoning depth affects system load, we sampled a small subset of production data covering several billion tokens and corresponding cache operations.
Sonnet 4.5 still accounts for the majority of processed volume, with roughly one-third more cache reads than Sonnet 4.0. GPT-5 shows a much lighter footprint overall.
Model | Input Tokens | Text Output | Tool Output | Total Output | Cache Reads |
---|---|---|---|---|---|
Sonnet 4.5 | 0.25 B | 0.75 B | 1.55 B | 2.30 B | 240.0 B |
Sonnet 4.0 | 0.13 B | 0.20 B | 0.72 B | 0.92 B | 135.0 B |
GPT-5 | 0.16 B | 0.22 B | 0.10 B | 0.32 B | 28.0 B |
Grand Total | 0.54 B | 1.17 B | 2.37 B | 3.54 B | 403.0 B |
The higher cache-read volume for Sonnet 4.5 likely comes from heavier use of retrieval-augmented workflows and longer context windows. This suggests a system-level shift: more compute is being spent on managing and reusing context rather than on token generation itself.
Emergent Specialization: Where Each Model Excels
Even though developers can freely choose models, their behavior reveals clear preferences by task type. Usage data and qualitative feedback show early signs of specialization.
Model | Observed Strengths | Typical Workflows |
---|---|---|
Sonnet 4.5 | Long-context reasoning, multi-file understanding, autonomous planning | Refactoring agents, complex debugging, design synthesis |
Sonnet 4.0 | Deterministic completions, consistent formatting, tool-friendly outputs | API generation, structured edits, rule-based transforms |
GPT-5 | Explanatory fluency, general reasoning, hybrid coding + documentation | Code walkthroughs, summarization, developer education |
Each model appears to emphasize a different balance between reasoning and execution. Rather than seeking one “best” system, developers are assembling model alloys—ensembles that select the cognitive style best suited to a task.
Community Sentiment Mirrors Production Behavior
Community discussions of Sonnet 4.5, 4.0, and GPT-5 align closely with the production data:
- Sonnet 4.5: Users describe it as thoughtful and reliable for multi-file reasoning but occasionally verbose or slower for simple edits. It handles refactors and architectural planning effectively but can over-explain.
- Sonnet 4.0: Praised for tool integration stability and predictable formatting. It is quick and consistent, ideal for automation or rule-based coding tasks. Teams often select it as the “safe default” model.
- GPT-5: Recognized for fluency and clarity in explanations. It performs well in hybrid reasoning-plus-writing contexts such as code reviews and documentation but lags in heavy tool execution.
Theme | Sonnet 4.5 | Sonnet 4.0 | GPT-5 |
---|---|---|---|
Reasoning Depth | ⭐⭐⭐⭐ — Deep planning, sometimes overthinks | ⭐⭐ — Direct, task-driven | ⭐⭐⭐⭐ — Analytical and expressive |
Latency / Responsiveness | Slower | Fast | Moderate |
Output Determinism | Medium | High | Medium |
Code Generation Quality | Excellent for multi-file | Strong for single-file | Great for hybrid code + docs |
Ideal Use Cases | Refactors, architecture | Automation, structured tasks | Walkthroughs, learning, synthesis |
Takeaways: The Early Signals of Behavioral Specialization
Three main insights emerge from this dataset:
- Adoption is diversifying, not consolidating. Newer models are not always better for every workflow.
- Behavioral divergence is measurable. Sonnet 4.5 reasons more deeply, while 4.0 acts more frequently.
- System costs are shifting. Reasoning intensity and cache utilization are now central performance metrics.
The story here is not about one model surpassing others but about each developing its own niche. As capabilities expand, behaviors diverge. The industry may be entering a stage where functional specialization replaces the race for a single “best” model—much like how databases evolved into SQL, NoSQL, and time-series systems optimized for different workloads. The same dynamic is beginning to appear in AI: success depends less on overall strength and more on the right cognitive style for the job.
As reasoning depth increases, these behavioral distinctions could define the next phase of AI tooling. The key question for builders is no longer “Which model is best?” but “Which model best fits this task?”

Molisha Shah
GTM and Customer Champion