Skip to content
Install
Back to Guides

Cosmos Experts: AI Agents That Learn From Feedback

May 9, 2026
Ani Galstian
Ani Galstian
Cosmos Experts: AI Agents That Learn From Feedback

Cosmos Experts are a specialized agent architecture inside Augment Cosmos where narrow task scope, shared memory, and coaching-based feedback let agents improve over time on domain-specific work. Cosmos itself is Augment Code's operating system for agentic software development, where developers, agents, codebases, tools, and memory coexist and coordinate across the SDLC.

When teams use Cosmos Experts for repeated domain workflows, agents get better with each run because narrow scope, shared memory, and coaching preserve reusable corrections instead of resetting each session. Experts are the architectural pattern inside Cosmos where persistent feedback and shared context turn live engineering work into durable team knowledge.

TL;DR

Cosmos Experts matter most in large software environments because repeated work, shared conventions, and live feedback determine whether agent quality compounds or resets. General-purpose AI agents plateau on domain tasks because they lack persistent memory and carry too broad a scope. Cosmos addresses this through narrow task scoping, shared memory, and coaching delivered through Slack in documented examples.

Where Agent Quality Stops Improving

Engineering teams adopting AI coding agents hit a predictable wall. The agent writes passable code on day one, then repeats the same mistakes on day ninety, forgetting last week's debugging session, dropping naming conventions after a few turns, and treating each task as a cold start. Benchmark results back this up: AI coding agents still post substantial failure rates on structured evaluations, and they tend to do worse on the longer, reasoning-heavy tasks that look more like real engineering work.

The opening problem usually shows up in three places:

  • The agent repeats mistakes instead of improving over time.
  • Team conventions disappear after a few turns.
  • Each new task starts from reconstructed context instead of shared memory.

Team-wide adoption creates a second failure mode: expertise stays trapped in individual setups instead of becoming shared operating knowledge. Personal configs hold the best prompts, no shared quality signal emerges, and humans only get pulled in at the final PR. The rest of this guide explains why general-purpose agents plateau, how the Cosmos Expert pattern works, what the Milo experiment revealed about frontloaded instruction, and how custom Experts compound knowledge across the organization.

See how Cosmos turns repeated domain workflows into shared, reusable agent knowledge.

Explore Cosmos

Free tier available · VS Code extension · Takes 2 minutes

Why General-Purpose Agents Plateau on Domain Tasks

General-purpose AI coding agents degrade predictably on domain-specific work because they combine broad scope, weak persistence, and unstable adaptation in environments that demand project-specific judgment. Three failure drivers compound on top of each other in everyday work:

Failure driverWhat happens
Training distribution mismatchDomain-specific tasks require knowledge of internal libraries or operational logic absent from training data, so agents produce syntactically valid but functionally incorrect output.
Architectural understanding degradationPerformance drops as the system has to reason across more files, more dependencies, and more interacting conventions.
No project memoryInstitutional knowledge such as architectural decisions, debugging approaches, and team conventions gets reconstructed from scratch on every invocation.

Training distribution mismatch is the most fundamental driver. LLMs are trained on publicly available, general-purpose code, so when a domain-specific task requires knowledge of internal libraries or operational logic absent from that training data, agents produce syntactically valid but functionally incorrect output. Fine-tuning can help, though maintenance costs scale poorly against continuously updated libraries.

Architectural understanding degradation compounds the problem. Even when relevant codebase information is available, performance drops as the system has to reason across more files, more dependencies, and more interacting conventions. The challenge is preserving usable architectural understanding across a large codebase, beyond simply giving the model access to more text. Cosmos addresses this by running every Expert on top of the Context Engine, which preserves cross-file relationships across repositories of 400,000+ files through semantic dependency graph analysis. Teams evaluating large-repository behavior often weigh this failure mode alongside approaches for preventing context loss in long-running agent sessions.

No project memory means every session is a cold start. The institutional knowledge that experienced engineers carry across sessions, including architectural decisions, prior debugging approaches, and team conventions, gets reconstructed from scratch on every invocation. Single-file context documents like .cursorrules or CLAUDE.md are often insufficient for larger codebases, as several 2025-2026 studies on codified context and instruction-following in code tasks suggest.

Three additional failure modes show up alongside the primary drivers:

  • API hallucination: Static model weights and dynamic tool environments lead to incorrect tool use and broken integrations.
  • Agentic task collapse: Multi-step workflow coordination failures cause quality to drop across long-horizon tasks.
  • Benchmark overfitting: Evaluation on narrow public benchmarks means benchmark results do not transfer cleanly to real teams.

The collective effect is predictable. An agent that scores well on a public benchmark may still perform poorly on a team's proprietary codebase, because that score reflects performance on a distribution the agent was never evaluated against.

The Expert Pattern: Scope, Learn, Compound

The Cosmos Expert pattern improves domain performance through three connected architectural moves: narrow scope reduces interference, persistent memory preserves team knowledge, and shared learning turns isolated corrections into reusable operating knowledge.

The pattern works through three linked moves:

  1. Scope the agent to one task.
  2. Preserve domain-specific memory.
  3. Share learning across later runs and across the team.

Scope: Constrain the Agent to One Task

Single-task agent scoping improves reliability because a narrow role limits interference from unrelated tools, goals, and decision paths.

Agent performance degrades when the toolkit expands and responsibilities blur. Specialized agents operate within clearly defined boundaries where each agent handles a distinct role: retrieval, reasoning, validation, or monitoring. Microsoft Research's AIOpsLab evaluates AIOps agents along these lines and highlights their capabilities and limitations on complex operational tasks in cloud environments.

The scoping decision is architectural rather than cosmetic. A testing Expert does not also handle deployment, and a code review Expert does not also triage incidents. Teams comparing review automation patterns can weigh this distinction against leading AI review tools tested in head-to-head evaluations.

Learn: Accumulate Domain-Specific Memory

Domain-specific memory turns scoped work into compounding performance because the agent retains the exact corrections, conventions, and procedures that matter inside one operating domain.

Scoping an agent narrows what it has to do, but compounding performance also requires memory architecture that preserves corrections and conventions across repeated workflows. Cosmos shared memory carries that context between agents, so each run starts further forward than the last.

Three memory layers matter for this pattern:

  1. Episodic memory captures specific events: actions taken, errors encountered, feedback received. A scoped agent's episodic memory grows denser without dilution from adjacent domains.
  2. Domain-specific memory isolation prevents cross-contamination. A code review agent can accumulate knowledge of recurring defects and team conventions through memory systems.
  3. Procedural memory refines operating procedures based on feedback. Human corrections update an agent's instructions durably, beyond the bounds of a single session.

These three layers explain why repeated work compounds across runs.

Compound: Knowledge Transfers Across the Team

Shared Expert memory compounds team performance because one engineer's correction becomes reusable knowledge for every later run in the same workflow.

New runs start from accumulated context because corrections persist across the team and survive the session. Every new agent inherits the context the team has already built up. When one engineer coaches an Expert through a tricky edge case, that learning becomes available to the whole team through the shared Expert Registry.

In practice, this pattern combines role decomposition with durable shared learning, which is exactly the orchestration gap visible across most open-source agent orchestrators where coordination work still falls on the human's plate.

Why Frontloading Context Fails: The Milo Story

Frontloaded instruction fails in production agent workflows because larger static instruction sets do not create durable project understanding, while scoped coaching and persistent memory preserve reusable corrections.

Milo, an internal testing agent built at Augment Code, is the anchor example for why Cosmos Experts work the way they do. The team tried the intuitive approach first and loaded Milo with all the guidance about how testing is done at Augment Code directly into the agent's initial instructions, and the approach broke down quickly under real workloads.

The Milo sequence is straightforward:

  1. The team frontloaded Milo with testing guidance in the initial instructions.
  2. The larger instruction set did not create durable project understanding.
  3. Performance degraded as volume and interaction history grew.
  4. The team shifted to scoped coaching and persistent memory.

Multiple 2025-2026 studies on long and complex inputs help explain why frontloaded instruction fails. Performance degradation can come from input volume itself, even when relevant instructions are present and repeated, and a large instruction set does not guarantee the agent applies all of it consistently. The volume of context becomes its own failure mode.

Multi-turn degradation can make the problem worse over time. Research across many LLMs found meaningful drops in multi-turn settings as the interaction history grew, with models retaining their capabilities but applying them less consistently as more material accumulated around the task. Anthropic's April 2026 postmortem describes how a seemingly small system-prompt change adding strict length limits contributed to a measurable production-quality regression in Claude Code, even though the change had passed multiple internal reviews and tests before deployment. The failure was only detected through post-deployment user feedback.

Scoped Coaching and Persistent Memory Preserve Reusable Corrections

Scoped coaching and persistent memory improve production agent workflows because a narrow operating domain and stored corrections turn live feedback into reusable task knowledge.

Milo improved when the testing workflow was narrowed and the learning loop was made persistent, because coaching during live work produced reusable testing knowledge. The team scoped Milo to specialize in the internal testing workflow and tuned it for continuous learning and memory. The approach shifted from frontloading every detail Milo might need to letting Milo learn through coaching: when Milo stumbled, an engineer on Slack would jump in to help, and Milo distilled the important information from those conversations and stored it.

The Cosmos learning flywheel applies the same pattern at platform level. Live Slack corrections get distilled and stored, so coaching-driven testing workflows accumulate reusable task knowledge over time. Agent performance depends on preserving environment-specific feedback, which is why a shared learning flywheel matters more than a larger initial prompt. Teams assessing testing-specific options often weigh this approach alongside practical patterns for AI test generation.

Continuous Learning via Coaching Conversations

Continuous learning in Cosmos works because engineers correct agents during real work, and the system converts those corrections into persistent operating knowledge.

The Milo workflow generalizes to a coaching mechanism that any Cosmos Expert can use. Engineers provide feedback during real work, and the agent distills and stores the important information from those conversations. Cosmos includes a persistent memory feature that retains useful context from conversations, so the agent extracts and persists what matters for future runs.

Cosmos Coaching Uses Two Types of Feedback

Cosmos coaching compounds fastest when feedback changes both the current task and the agent's future reasoning. The platform distinguishes between two kinds of coaching:

  1. Task corrections: Fix the immediate output. An engineer tells the agent the test assertion is wrong, and the agent corrects it.
  2. Mental model corrections: Teach the underlying reasoning. An engineer explains how prioritization should work for a specific kind of feedback going forward, and the agent remembers, so the next day's prioritization comes back better.

Mental model corrections produce compounding returns because one explanation updates the agent's future reasoning durably. Human leverage shifts from doing the prioritization to teaching the priority function, and the team's effort moves from repetitive execution to one-time instruction that persists.

Cosmos Uses a Checkpoint Model to Reduce Interruption Overhead

The Cosmos checkpoint model reduces interruption overhead by concentrating human judgment at a few high-leverage review points. Eight human interruptions in a typical SDLC loop become three checkpoints:

CheckpointWhat the agent doesWhat the human does
PrioritizationThe agent monitors channels, aggregates signal, and proposes priorities.The human corrects priorities and/or teaches the priority function.
Spec and intent reviewThe agent opens PRs and drafts implementation.The human reviews specs before agents write, test, and review code.
Code evolutionThe agent performs deep code review for recall and surfaces shifting assumptions.The human maintains codebase understanding and ships with confidence.

Continuous learning from real-world feedback supports the coaching approach over static prompts because adaptation turns live corrections into future behavior. Recent benchmark work on multi-step software-engineering tasks shows that systems combining reinforcement learning, iterative refinement, and careful agent design can outperform simpler static-prompt baselines on complex tasks. In one benchmark study, model performance on MLEBench varied across the evaluated systems and appeared to improve through a combination of reinforcement learning, iterative refinement, and agent design choices.

See how Cosmos stores coaching feedback as reusable operating knowledge, so agents improve without repeated human restarts.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

Reference Experts: What Ships with Cosmos

Cosmos Reference Experts apply the same scoped-memory architecture to distinct SDLC workflows, so teams can reuse the pattern across review, implementation, testing, and incident response. The four Reference Experts cover the main operating pattern from review through incident handling, and each one shows a different angle on scope and compounding knowledge.

Reference ExpertPrimary workflowHow scope is definedHow knowledge compounds
Deep Code ReviewHigh-recall reviewThe agent optimizes for recall while the human reviews shifts in risk and architecture.The review system evaluates changes against architectural context across the codebase.
PR AuthorImplementation to merge-ready PRThe agent handles implementation through the spec and intent review checkpoint.Specs return for human review before agents independently write, test, and review code.
E2E TestingTesting against real infrastructureThe workflow specializes in environment-specific testing work.Each run adds reusable testing knowledge for the team's environment through coaching.
Incident ResponseLive operational incidentsMultiple tightly scoped agents coordinate around one incident.Knowledge from every incident compounds for the next one.

Deep Code Review Applies High-Recall Review to Agent Workflows

Deep Code Review changes the objective of code review for agent workflows because an agent can optimize for recall while the human reviews only the shifts in risk and architecture that matter.

Deep Code Review reframes the assumptions of conventional code review tooling. Traditional tools optimize for precision: surface the highest-importance issues and keep noise low, respecting the human reader's time. When the reviewer is an agent, the goal shifts toward recall, because the agent can scan every line and surface anything worth a human's attention.

The human experience surfaces the places where key assumptions are shifting and the decisions worth a second look. Agents perform an initial bug scan and risk triage, low-risk changes get auto-approved, and higher-risk changes get a collaborative review where the agent surfaces key risks and architectural decisions. Deep Code Review evaluates changes against architectural context across the codebase, so high-recall review workflows catch issues that diff-level tools miss.

PR Author Turns Implementation Requests into Merge-Ready Pull Requests

PR Author moves implementation requests through to merge-ready pull requests with human review in the loop. During the spec and intent review checkpoint, parallel agents can open PRs or take a first stab at work, with specs returning for human review before agents independently write, test, and review code.

Open source
augmentcode/augment.vim610
Star on GitHub

E2E Testing Accumulates Reusable Testing Knowledge

E2E Testing shows how coaching improves environment-specific reliability because each run adds reusable testing knowledge for the team's own infrastructure. The Milo story, described earlier, is the operational narrative for this Expert pattern. E2E Testing validates against real infrastructure, accumulating test-specific knowledge through coaching that makes each subsequent run more accurate for the team's environment.

Incident Response Coordinates Tightly Scoped Agents Around Live Incidents

Incident Response uses the most granular role decomposition in Cosmos because incident work benefits from multiple tightly scoped agents coordinating around one live operational problem.

Incident Response aims for a granular multi-agent breakdown where specialized agents handle triage, investigation, implementation, and coordination, with humans supervising and stepping in on higher-risk decisions. A coordinated set of agents works the incident together: triager, investigator, PR author, Slack coordinator, and SRE, all orchestrated by an Incident Coordinator. The knowledge from every incident compounds for the next one.

The incident workflow centers on these scoped roles:

  • triager
  • investigator
  • PR author
  • Slack coordinator
  • SRE
  • Incident Coordinator

The product page uses an alerting scenario to illustrate the workflow. The differentiated point is a specialized incident response system built around scoped agent roles and shared learning.

Building Custom Experts: Describe, Build, Register

Custom Experts turn a plain-language workflow description into a reusable team asset because Cosmos can build, register, and share specialized agent patterns across the organization. Custom Expert creation follows three steps that move from natural language description to team-wide availability.

Step 1: Describe it. A developer writes a plain-language description of the desired workflow. Example: "Build me a security scanner for our APIs that runs weekly."

Step 2: An agent builds the Expert and wires up dependencies. Cosmos ships with a knowledge base of the best agent shapes the applied team has built. When a customer creates a new Expert, the agent leans on that knowledge base to figure out how to build it, and the knowledge base improves as more teams build on it.

Step 3: The Expert lands in the registry, available to the whole team. The system supports collaboration among Experts, so when someone on a team figures out a strong pattern, the rest of the team can find and reuse it.

The three steps map cleanly to three organizational outcomes:

StepWhat happensTeam outcome
Describe itA developer writes a plain-language workflow description.The workflow becomes explicit enough to build.
Build itAn agent builds the Expert and wires up dependencies.Existing knowledge about strong agent shapes gets reused.
Register itThe Expert lands in the registry for the whole team.Strong patterns become reusable team assets.

Plug-in Architecture

The Cosmos plug-in architecture lets new capabilities slot in through plugins without modifying the existing stack. New Experts plug in without being rewired into the stack each time, while the platform already understands the organization's build, testing, code review process, and deployment pipeline. A shared Expert Registry lets patterns built by one team compound across the organization, so strong patterns spread beyond a single engineer's configuration.

Ship Agents That Remember What Your Team Already Taught Them

The real tradeoff sits between static prompting and systems that keep learning after deployment. Frontloading more rules into an initial instruction set feels simpler, but the evidence in this article shows that larger instruction bundles and longer interaction histories introduce their own failure modes. A better next step is to pick one workflow where engineers repeatedly restate the same guidance, then turn that workflow into a scoped Expert with coaching checkpoints and shared memory. Start where the repetition is obvious: prioritization, review, testing, or incident response. That is where persistent learning starts to matter operationally, because each correction improves the next run and persists across the team.

See how Cosmos turns coaching conversations into persistent team-wide memory for repeated workflows.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

FAQ

Written by

Ani Galstian

Ani Galstian

Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.