The agent learning flywheel is a four-stage continuous improvement architecture (execute → coach → distill → improve) that turns an AI agent's operational experience into better future behavior, compounding performance gains across sessions instead of starting from zero on every run.
TL;DR
Most AI agents reset in production because LLMs have static weights, many systems lack a persistent memory layer and strong retrieval architecture, and adding more history to prompts can increase token costs while degrading output quality. The agent learning flywheel addresses this by turning traces, feedback, and distillation into reusable knowledge, even though benchmarks still show long-term degradation in memory-augmented systems. Cosmos, Augment Code's operating system for agentic software development (currently in public preview), implements this flywheel as shared infrastructure across agents, sessions, and team workflows.
Why Today's AI Agents Show Perpetual Amnesia
Today's AI agents show perpetual amnesia because static model weights and weak cross-session memory architectures prevent experience from carrying forward reliably, causing repeated mistakes and reset behavior across sessions.
Every engineering team that has deployed an AI agent in production has run into the same frustration: the agent that made the same mistake yesterday makes it again today. Corrections vanish between sessions, context evaporates, and the agent that took a week to become useful starts over from day one every Monday morning.
The reset problem sits at two architectural layers. First, LLM weights are static after training, so no interaction updates the model's underlying knowledge without costly fine-tuning. Second, applications still have to compress, retrieve, or discard historical experience at inference time unless that knowledge is stored outside the model in a reusable form. Together, these constraints mean every session begins with a blank slate at the parameter level and a fragile retrieval layer at the system level.
Two constraints create most of the reset behavior:
- Static model weights: interactions do not update the model's underlying knowledge without fine-tuning.
- Fragile system memory: historical experience has to be compressed, retrieved, or discarded unless it is stored outside the model in reusable form.
Prompt-injected history creates its own failure mode because memory packed into context can still degrade output quality. The phenomenon was popularized by Chroma's 2025 context rot report, which tested 18 frontier models, and it also appears in related memory-layer research: relevant information sits in the prompt but still degrades output quality as window size grows. This article walks through the four-stage agent learning flywheel, the architectural properties that make an agent learnable, and why narrow task scoping plus persistent memory outperforms front-loaded context.
Cosmos is Augment Code's new operating system for agentic software development, now in public preview, designed to give engineering teams shared memory, tracing, and feedback infrastructure across every agent and session.
See how Cosmos turns traces, feedback, and memory into reusable agent improvements across sessions.
Free tier available · VS Code extension · Takes 2 minutes
Memory-augmented agent systems still show long-term degradation because external storage alone does not guarantee durable retrieval or reasoning quality across time scales. Benchmark studies measuring memory performance across time scales reveal consistent decay in memory-augmented agents:
| System | Weekly Remembering Score | Quarterly Remembering Score |
|---|---|---|
| MemoBase | 43.6 | 15.18 |
| MemoryOS | 51.84 | 25.05 |
| Mem-0 | 40.42 | 19.90 |
Reasoning degradation proved more severe than remembering decay. MemoryOS dropped from 20.66 weekly to 5.50 quarterly on reasoning tasks. Across the full benchmark, 11 of 14 LLMs and agents performed worse at quarterly scale than weekly scale on reasoning tasks. Adding a vector store does not solve the stateless problem.
Other research on tool-execution decay over multi-day windows provides a specific operational metric: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to compounding failure modes in flat-file memory systems.
| Failure pattern | What the evidence shows |
|---|---|
| Remembering decay | Weekly scores fall sharply at quarterly scale across multiple memory systems. |
| Reasoning decay | MemoryOS drops from 20.66 weekly to 5.50 quarterly on reasoning tasks. |
| Tool-use decay | Tool-execution success rates fall 14 percentage points over 72 hours. |
Most production agent systems therefore lack any mechanism to carry forward what they learn. Corrections disappear, preferences reset, and the agent that understood codebase conventions on Friday forgets them by Monday.
The Learning Flywheel: Execute → Coach → Distill → Improve
The learning flywheel is a staged operational loop where each stage transforms a different artifact, moving from traces to feedback to reusable knowledge so production systems can preserve learning across runs.
Stage 1: Execute and Log Traces
Execution trace capture logs task, trajectory, and outcome so later coaching and distillation can convert raw runs into reusable knowledge.
The atomic unit of agent experience is a triplet: task, trajectory, and outcome. An execution record captures what the agent was asked to do, what happened during execution, and the resulting feedback.
Trace engineering involves a deliberate tradeoff: extracting usable learning signals from dynamic agent runs increases parsing overhead, and that tradeoff determines how much execution detail production systems can preserve. Recent work (August 2025) on decoupling agent execution from training pipelines shows how to extract reinforcement-learning signals from dynamic agent runs with minimal code modifications. Richness competes with feasibility, and teams have to decide where to draw the line for their workload.
Stage 2: Coach with Improvement Signals
Coaching signals improve future agent behavior because human, automated, and environmental feedback turn completed runs into explicit signals the system can reuse in later decisions.
Coaching signals come from three source types because production systems balance latency, cost, and reliability differently across feedback channels, and that mix determines how quickly the flywheel can improve later decisions:
- Human feedback: Engineers correct agent behavior directly. The key design constraint for agentic settings is that the agent must continue its learning process while awaiting human input, and humans need not provide continuous supervision.
- Automated coaching: Systems like TextGrad treat textual feedback as differentiable training signals. AutoRule converts reasoning traces and preference feedback into explicit rule-based rewards. Self-rewarding systems generate outputs or training examples and assess their own performance to create reward signals.
- Environmental signals: Task outcomes serve as reward signals without human annotation. Outcome-based feedback loops can operate without continuous human labeling when the environment exposes a measurable success signal.
The coaching stage turns completed runs into reusable signals in three ways:
- Human corrections provide explicit guidance.
- Automated systems convert traces and preferences into reward signals.
- Environmental outcomes expose success or failure without manual labeling.
Stage 3: Distill Experience into Reusable Knowledge
Experience distillation turns raw coaching signals into reusable knowledge formats so later runs can retrieve guidance directly.
Distillation converts raw coaching signals into formats the agent can retrieve and act on in future sessions. Experience distillation has four production targets because different knowledge formats trade off deployment speed, inspectability, and retraining requirements, and that choice determines how quickly improvements can be reused:
| Distillation Target | Mechanism | Weight Updates? | Production Readiness |
|---|---|---|---|
| Prompt/context updates | Dynamic cheatsheets, cached plans, playbooks | No | High: no retraining infrastructure |
| Structured heuristics | Trigger conditions and recommended actions from trajectory analysis | No | High: inspectable, retrievable |
| Skill libraries | Executable procedures verified and stored for reuse | No | Medium: requires verification gate |
| Fine-tuning | Teacher-student distillation on operational data | Yes | Lower: requires training infrastructure |
Research on experiential learning in agents specifies a concrete heuristic format: after each task, the agent generates an analysis identifying what led to success or failure, then produces a guideline with explicit trigger conditions and recommended actions. The original paper describes heuristic guidance, though specific workflow examples that have circulated online could not be verified from the available sources.
Prompt and context distillation works as a pragmatic production starting point because it is inspectable, requires no retraining infrastructure, and supports fast iteration cycles.
See how Cosmos operationalizes persistent memory, tracing, and feedback as shared system services for agent teams.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
Stage 4: Deploy the Improved Agent
Improved agent deployment closes the flywheel because distilled knowledge gets loaded back into later runs, allowing future executions to apply what earlier runs learned.
A Map-Shuffle-Reduce framework for prompt learning shows how parallel agents can process distinct context shards, generate reflections from their interactions, and hierarchically aggregate local updates into a coherent global context. The approach requires no retraining infrastructure.
Other research on training small agentic models with dual data flywheels introduces an additional property: the flywheel should continuously generate increasingly challenging tasks. Static synthetic datasets with fixed linear solution structures quickly saturate the learning signal.
Production deployment needs a quality gate because continuous monitoring and issue detection determine whether learned updates are safe to accept into the next loop iteration.
| Flywheel stage | Primary artifact | Main outcome |
|---|---|---|
| Execute | Task, trajectory, outcome | Captured trace for later analysis |
| Coach | Human, automated, or environmental feedback | Reusable improvement signal |
| Distill | Prompt updates, heuristics, skills, or fine-tuning data | Reusable knowledge |
| Improve | Updated context or deployed agent behavior | Better future runs |
What Makes an Agent "Learnable"
Agent learnability depends on whether the system can capture, store, retrieve, and safely reuse experience across runs. The distinction is architectural: a learnable agent writes useful knowledge outside model weights and loads that knowledge back into future decisions.
Agent learnability does not require model weight updates. Foundational systems described in the Reflexion paper and the Voyager paper largely improve through in-context mechanisms or external memory, accumulating structured knowledge outside the model's parameters. The CoALA cognitive architecture framework identifies four action types available to agents: grounding (environment interaction), reasoning (internal computation), retrieval (reading from memory), and learning (writing to memory).
The learning action, writing to long-term memory, is the one most often absent from non-learnable architectures. An agent loop that can only read from memory but never write to memory cannot learn. Cosmos addresses this gap by exposing a shared filesystem with tenant and private memory as a first-class platform service.
Five architectural properties separate learnable agents from stateless ones:
- Separated reasoning engine and knowledge store. The LLM has to be decoupled from the memory system so knowledge updates, corrections, and expansions can happen without retraining. LlamaIndex memory abstractions expose
memory.put_messages()andmemory.get()APIs as one example of this separation. - Multi-type memory architecture. CoALA establishes four distinct memory types: episodic (records of specific experiences), semantic (generalized knowledge distilled from episodes), procedural (executable skills and behaviors), and what the paper calls in-context memory, which functions as working memory holding information for the current decision. Each requires distinct storage, retrieval, and write policies.
- Structured reflection mechanism. Raw experience has to be converted into storable knowledge through a reflection step. Reflexion reaches 91% pass@1 on HumanEval through verbal reflection on feedback signals stored in episodic memory, compared to 80% for the GPT-4 baseline reported in the original 2023 paper. The minimal loop runs feedback signal → reflection → memory write → improved future decisions.
- Multi-signal retrieval over pure semantic search. Generative Agents uses a three-signal memory retrieval score combining recency, importance, and relevance. ExpeL's documented scaling limitation, concatenating all insights into every prompt regardless of relevance, demonstrates what happens when retrieval architecture is neglected. AutoGuide retrieves context- or state-relevant heuristics at each timestep by matching them to the current decision context.
- Verification gates on procedural writes. CoALA explicitly warns that writing to procedural memory is significantly riskier than episodic or semantic writes because incorrect procedures can introduce bugs or subvert design intentions. Voyager implements verification through self-verification during iterative prompting, checking task success and improving programs before moving on.
| Learnable agent property | Why it matters |
|---|---|
| Separate memory writes | An agent that only reads from memory but never writes to memory cannot learn. |
| Multi-type memory | Episodic, semantic, procedural, and in-context (working) memory require different policies. |
| Reflection step | Raw experience has to become storable knowledge before reuse. |
| Retrieval strategy | Retrieval strategies that combine recency, importance, and relevance may outperform pure semantic search in some settings. |
| Verification gate | Procedural writes can introduce bugs or subvert design intentions. |
Narrow task scope is a prerequisite for effective learning. Recent agent frameworks and related research point to the importance of structured feedback, iterative prompting, and scoped skill acquisition. Effective learning agents are specialists that accumulate high-density feedback about a well-scoped domain.
Milo Demonstrates How Narrow Scope and Persistent Memory Improve Agents
Milo demonstrates how narrow scope and persistent memory improve agents because stored corrections let later runs apply validated guidance built from real engineering feedback.
Milo offers a concrete example of the learning flywheel in production. Its progression shows how narrowing scope and preserving corrections can outperform attempts to preload every relevant detail into the initial prompt. Milo is Augment Code's internal tester agent, a real deployed system used for QA and testing tasks within the company.
What failed first. The team tried to load Milo with all available context about how testing is done at Augment, placing every documented testing practice directly into the initial prompt. The front-loaded approach failed. This mirrors the context rot problem identified in the research, where information present in the prompt is not effectively used by the model.
What worked instead. Two changes produced a functioning system:
- Milo was scoped to be the best testing expert at Augment, a specialist role with a tight domain.
- The agent was tuned for continuous learning and memory, accumulating context over time.
The feedback mechanism. When Milo ran a test and stumbled, an engineer on Slack would jump in to coach the agent. Those corrections were preserved as durable artifacts: Milo distills important information from those interactions and stores it persistently.
The flywheel cycle demonstrated by Milo:
- Agent runs a task and makes an error.
- A human provides a correction via Slack.
- The agent distills that correction into persistent memory.
- Future runs benefit from the stored correction.
- The cycle repeats, compounding improvement over time.
The agentic SDLC guide covers how effective feedback and team learning shape agent performance over time. With Cosmos's shared learning flywheel, Slack feedback gets distilled into persistent memory and survives session boundaries.
Milo's trajectory validates two patterns from the research at the same time: narrow scoping produces higher-quality learning signals, and persistent memory with human feedback outperforms larger initial prompts.
Infrastructure Requirements for Continuous Agent Learning
Continuous agent learning requires infrastructure that captures execution, scores outcomes, links feedback to traces, and safely deploys updates, because each layer turns raw activity into an artifact the next layer can use.
Layer 1: Execution trace capture. Trace engineering research defines three trace surfaces: cognitive (reasoning steps, decision points), operational (tool calls, timing, token counts), and contextual (environment state, user inputs). OpenTelemetry with AI-specific semantic attributes serves as the instrumentation standard. Trace correlation IDs have to propagate across all agent sub-calls, and session-ID threading has to group related spans for multi-turn evaluation.
Layer 2: Evaluation pipeline. Evaluation pipelines score agent behavior through automated rules, LLM judging, and human review so teams can decide whether a learned change should be accepted into production.
Three scorer types divide evaluation work by cost and quality, since deterministic checks, model-based judgment, and human review answer different validation needs, and that mix determines whether production teams can trust learned changes:
- automated code-based rules for deterministic checks
- LLM-as-judge for nuanced quality assessment
- human review for ground truth
Online evaluation runs asynchronously with zero latency impact on production. Offline evaluation batches against curated, versioned datasets in CI/CD pipelines.
Layer 3: Feedback collection. Feedback collection links explicit and implicit signals back to source traces so the system can trace learned behavior to the runs that generated it.
Feedback collection combines explicit and implicit signals because corrections, ratings, comments, task completion rates, re-queries, and abandonment only become reusable when they are linked back to source traces, and that provenance determines whether learned behaviors can be audited later.
Feedback has to link to source traces by ID for provenance. Without provenance, tracing learned behaviors back to source trajectories becomes impossible.
| Infrastructure layer | Core function |
|---|---|
| Execution trace capture | Record task activity across cognitive, operational, and contextual surfaces. |
| Evaluation pipeline | Score whether changes improve agent behavior. |
| Feedback collection | Link corrections and outcomes back to source traces. |
Layer 4: Knowledge distillation. Knowledge distillation converts evaluated feedback into prompt updates, heuristics, or fine-tuning data that later runs can actually use.
Production-viable distillation starts with prompt optimization and structured heuristics, with no weight updates required. Weight updates become an option later in the maturity curve, once prompt-level distillation can no longer carry the load.
Layer 5: Deployment and versioning. Deployment and versioning protect production systems because learned updates need staged rollout, rollback planning, and quality gates before they can be trusted.
Deployment needs staged rollout, rollback planning, and quality gates before any improved agent version reaches production.
Layer 6: Monitoring and observability. Monitoring and observability close the flywheel because production traces, drift signals, and cost attribution reveal whether deployed learning actually improves future runs.
AI observability typically includes tracing multi-step workflows because production teams need workflow visibility and token-level cost attribution, and some tools also support features like search across traces and drift monitoring to assess whether changes improve future runs. Monitoring signals feed back to Layer 1, closing the loop.
How Cosmos Implements the Agent Learning Flywheel
Cosmos implements the agent learning flywheel as shared system infrastructure because persistent memory, tracing, and feedback have to operate across agents, sessions, and team workflows for improvements to compound.
When using Cosmos's learning flywheel infrastructure, teams persist memory, tracing, and feedback across agents and sessions because the system provides those capabilities as shared services with consistent semantics across the platform.
Cosmos positions the learning flywheel as a named system service in a three-layer architecture, sitting between the application layer (expert agents) and the core runtime (Context Engine, event bus, agent runtime).
Three infrastructure components underlie the flywheel:
Shared virtual filesystem. Agents on Cosmos share and persist data across tenants and individual agent sessions. The agentic SDLC guide describes context as moving with work across agents and stages, with governance and review gates helping preserve continuity from planning through implementation and review. With Cosmos's shared virtual filesystem, feedback stays attached to work across the SDLC because tenant-shared and private memory move with the task across handoffs.
Two-tier memory model. Tenant memory carries organization-shared context, feedback, and policy across agents and across the development lifecycle. When one agent on a team is corrected, the correction propagates to improve the same role for all team members. Private memory stores user-specific or session-specific context not shared across the team. The agent runtime guide explains how corrections to one expert agent improve the same role for everyone on the team. With Cosmos's two-tier memory model, corrections propagate across the same role while private memory preserves user-specific context.
Context Engine foundation. The Context Engine maintains live understanding of the codebase, while memory operates as a separate but integrated persistence layer. The Context Engine processes entire codebases across 400,000+ files, giving agents architectural-level understanding rather than file-by-file snippets.
Cosmos receives flywheel input signals from three categories of SDLC tooling beyond IDE interactions: issue trackers (Linear), team communication (Slack), and CI pipelines. The Expert Registry serves as the organizational scaling mechanism: when someone on a team figures out an effective pattern, it lands in the registry and becomes available to the whole team.
Narrow scoping with continuous learning outperforms front-loaded context because specialist agents accumulate denser feedback and preserve corrections across runs. Shared experts and knowledge become available to the whole team as a first-class artifact of the workflow. Cosmos ships reference experts (Deep Code Review, PR Author, E2E Testing, Incident Response) and provides tooling for teams to create custom experts scoped to their specific workflows.
Build Agent Systems That Preserve Corrections Across Runs
Building agent systems that preserve corrections across runs is the core architectural tradeoff because stored feedback and successful procedures can influence later executions in ways that front-loaded prompts cannot reliably match.
Persistent agent learning is the central architectural tradeoff because systems that preserve corrections, feedback, and successful procedures can apply them in later runs with high fidelity. Front-loaded context may feel simpler, yet available evidence on context rot points consistently in one direction: broad initial loading degrades performance, while narrower scope helps maintain performance over time.
A practical next step is to:
- start with trace capture
- connect feedback to those traces
- write distilled heuristics back into a memory layer separate from the reasoning engine
- add a quality gate before deploying any improvement
Before deploying any improvement, add a quality gate that checks whether the new heuristic actually improves future runs and watches for prompt bloat.
See how Cosmos carries tenant-shared and private memory between agents and preserves context and feedback across handoffs.
Free tier available · VS Code extension · Takes 2 minutes
FAQ
Related
Written by

Ani Galstian
Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance