The best AI agent evaluation tools for production teams are Braintrust, Arize Phoenix, Promptfoo, Galileo, and Cosmos. Each one solves a different slice of the agent quality problem: Braintrust for CI/CD-integrated eval workflows, Arize Phoenix for open-source production observability, Promptfoo for security red-teaming, Galileo for cost-efficient hallucination detection via purpose-built small models, and Augment Code's Cosmos (currently in public preview) for carrying evaluation feedback forward through its learning flywheel.
TL;DR
AI agents fail in ways pass/fail tests miss: multi-step reasoning errors, wrong logic behind correct outputs, and post-deployment drift. I reviewed five tools across behavioral metrics, continuous evaluation, and trace depth. No single platform covered all three, so most teams will need multiple tools plus a way to reuse corrections. Cosmos, Augment Code's agent operating system in public preview, is included because it addresses the post-evaluation reuse layer that the other four leave open.
Where Agent Failures Actually Show Up
I spent the last several weeks testing evaluation tools against production agent workflows, and the gap between what traditional testing catches and what actually breaks in production is wider than many teams expect. A 2025 arXiv preprint on multi-agent failures reports that 17.14% of agent failures are step repetitions and 13.98% are mismatches between reasoning and action. Both failure modes slip past a final-output check. The same dynamic shows up in broader agent observability work, where multi-step traces routinely surface failures that final-output assertions never see.
The deeper problem I ran into was what happens after teams catch failures. One engineer corrects an agent prompt, that knowledge stays in their personal config, and the rest of the team rediscovers the same failure next week. When I tested the post-eval workflow across these tools, I found no clear public documentation comparing correction reuse for the same agent role across a team.
This guide breaks down what each tool does well, where each falls short, and how production teams should think about combining them. Cosmos is included as the fifth option because it focuses on a different layer: turning one-off corrections into team-wide memory.
See how Cosmos carries corrections forward across the same agent role, so teams stop rediscovering the same failures.
Free tier available Β· VS Code extension Β· Takes 2 minutes
Why Traditional Testing Fails for Non-Deterministic Agents
Traditional software testing assumes deterministic outputs: given input A passed to function B, output C is always identical. LLM-based agents violate this assumption structurally. The same semantically correct answer can appear in dozens of syntactically distinct forms, and checking response == expected_response fails the moment an agent uses slightly different wording.
Recent work on stochastic evaluation of agents argues this point formally. Standard agentic benchmarks reduce complex stochastic processes to a single leaderboard number from one run. That number reflects one sample from a distribution and gives an incomplete picture of how the system behaves across runs. A 95% confidence interval on 30 samples spans a range too wide to distinguish good agents from mediocre ones reliably.
I mapped out six specific failure modes that traditional testing misses entirely.
Six Failure Modes Beyond Pass/Fail
The table below summarizes the failure modes I observed most often in production agent runs, along with the specific reason each one slips past traditional output-based tests. These are the patterns evaluation tools need to catch directly:
| Failure Mode | What Breaks | Why Traditional Tests Miss It |
|---|---|---|
| Output variability | Same correct answer expressed differently across runs | String-match assertions reject semantically correct responses |
| Error cascades | Single wrong intermediate step corrupts all downstream reasoning | End-state tests only check final output |
| Wrong tool path | Correct answer reached via wrong tool or unsafe sequence | Output validation ignores execution path |
| Behavioral drift | Agent degrades after model provider updates | Static test suites use mocked LLM responses |
| Correct answer, wrong reasoning | Hallucinated intermediate logic produces right final answer | Pass/fail rewards this and generates no signal |
| Statistical invalidity | Small test sets produce unreliable confidence intervals | Engineers validate against 20-30 hand-curated examples |
The multi-step reasoning problem matters because a single intermediate mistake can pass a final-output check while still corrupting the full workflow. A research agent might correctly retrieve competitor information, misattribute a product feature to the wrong company in step 3, build analysis on that misattribution, and produce a final summary that passes a surface-level check for competitor mentions. The factual error propagates through the entire output chain, invisible to any test checking only the final state.
Anthropic's 2026 evals guidance describes a production monitoring requirement that static tests cannot fulfill. Detecting distribution drift and unanticipated real-world failures requires post-launch monitoring, including systematic human review calibrated against LLM graders for subjective outputs.
Evaluation Criteria: Behavioral Metrics, Continuous Eval, Trace Depth
Agent evaluation requires simultaneous assessment across multiple dimensions. Recent AWS guidance on agent evaluations emphasizes response quality, latency, and cost as core dimensions, with responsibility and safety addressed through additional safeguards and policies. The AlphaEval paper finds that production agent evaluation averages 2.8 leaf-node evaluation types per task.
I used three criteria to evaluate each tool.
Criterion 1: Behavioral Metrics
OpenAI's current evaluation best practices outline general guidance for defining evaluation objectives, metrics, and continuous evaluation. A third dimension, tool efficiency, measures whether the tool-calling trajectory was the most efficient method available.
Beyond tool use, production evaluation needs:
- Task completion rate: binary, rubric-based, or LLM-as-judge scoring
- Reasoning coherence: whether intermediate steps are logically consistent at both response-level and trajectory-level
- Groundedness: whether summarization evaluation outputs are supported by retrieved context, with G-Eval reporting a 0.514 Spearman correlation with human judgments on summarization tasks
- Error recovery: whether the agent handles tool failures gracefully
- Task abandonment honesty: whether the agent correctly reports inability to complete a task
These metrics work together rather than in isolation, and most production teams end up tracking at least three of them concurrently.
Criterion 2: Continuous Evaluation
A common sequencing recommendation in the evaluation literature is that offline evaluation should be established first, then test coverage, then metric-outcome alignment, and only then should online or continuous evaluation be added.
Production monitoring specifies the architecture of the feedback loop as a continuous cycle: production monitoring, evaluation datasets, experimentation, and redeployment. Tight integration between observability, evaluation, and development tooling is the requirement.
Criterion 3: Trace Depth
Arize's guidance on LLM-as-judge evaluation describes evaluation formats such as binary verdicts and richer scoring scales, and recommends using code evals for deterministic checks and LLM judges for semantic evaluations. Trace depth itself splits into two layers:
- Span-level: targets individual steps to isolate where errors occur
- Trace-level: examines the complete operation chain to judge whether the overall workflow achieved a correct result
A technically successful request where every span completed without error can still produce a faithfulness failure at the output level. Session-level evaluation, which analyzes multi-turn conversations across traces, adds a third layer.
Braintrust: Evaluation-First Architecture
Braintrust structures its platform around a five-stage workflow: Instrument, Observe, Annotate, Evaluate, Deploy. The defining architectural property I found most valuable is that production traces and offline evaluations share the same data layer, so production logs convert into evaluation datasets without data export.
What I Tested
I ran evaluations using Braintrust's scoring system, which supports LLM-as-judge, automated code-based scorers (AutoEvals), custom code scorers, and human review scores. Each evaluation creates an experiment record with git metadata, so quality changes trace back to specific commits.
The span type system captures meaningful detail. Braintrust uses typed span attributes such as llm, tool, task, and function.
| Span Type | What It Captures |
|---|---|
| eval | Root span for an evaluation run |
| task | A single unit of work |
| llm | A single LLM call with model, messages, parameters, token usage, cost |
| function | Named block of logic (retrieval, formatting, routing) |
| tool | Tool call: external API, code execution, database query |
| scores | Field on a span that holds the results of scorers (online or offline), but no score span type is defined |
CI/CD Integration
Braintrust's CI/CD integration had the lowest setup overhead in my testing for pull-request regression review. The braintrustdata/eval-action GitHub Action runs evaluations and posts results as PR comments with improvements (π’) and regressions (π΄) per scorer, so teams do not need to assemble the reporting layer from custom workflow code. Eval runs collect git metadata by default when allowed by org-level settings, and built-in concurrency management supports a configurable maxConcurrency.
The Notion case study in Braintrust's documentation shows practical scale. After adopting Braintrust, Notion moved from JSONL files to hundreds of datasets testing specific criteria like tool usage and factual accuracy, and from 3 issues triaged per day to 30.
Where Braintrust Falls Short
Braintrust's free Starter tier includes 10K scores and 14-day retention, with capacity limits documented as 1 GB of processed data or roughly 1M trace spans across recent Braintrust materials. Treat these numbers as approximate vendor-stated limits and check the current pricing page for the latest figures, since the same plan is described differently across Braintrust assets. Custom Topics and Environments require the Pro plan, which currently starts at a flat $249/month with higher usage limits than Starter. A self-hosted deployment option is also available, with enablement and licensing handled through organization settings and direct contact with Braintrust.
Best fit: Teams wanting CI/CD-integrated evaluation with automatic PR regression detection, managed SaaS with a generous free tier (unlimited users, projects, datasets, experiments), and prompt playground plus dataset management in a single platform.
Arize Phoenix: Open-Source Observability with Production Monitoring
Arize Phoenix is an open-source AI observability and evaluation platform built on OpenTelemetry. Phoenix accepts traces over OTLP, so any OTLP-compliant source works without proprietary instrumentation. The GitHub repository hosts the full codebase under Elastic License 2.0 (ELv2), which among other restrictions prohibits offering Phoenix itself as a managed service to third parties while allowing internal self-hosting.
What I Tested
Phoenix's auto-instrumentation coverage spans LlamaIndex, LangChain, DSPy, Mastra, Vercel AI SDK, OpenAI, AWS Bedrock, and Anthropic across Python, TypeScript, and Java. Arize's Phoenix documentation for Bedrock agents describes capturing traces that can be viewed in the Phoenix UI, but AWS Bedrock agents documentation itself only shows CloudWatch metrics like invocations, latency, errors, and token counts, and does not show a Phoenix dashboard.
The custom evaluator workflow is iterative and requires hands-on setup: instrument tracing, generate realistic examples from production spans, annotate to create a benchmark dataset, define an evaluation template, run an experiment comparing LLM judge labels to human annotations, and iterate where labels disagree. Phoenix's documentation acknowledges that LLM judges can exhibit biases and unreliable behavior, such as favoring certain writing styles, and recommends identifying systematic biases during evaluation.
OSS vs. Commercial Feature Boundaries
The most important finding: as of early 2026, drift detection, agent graph visualization, composite metrics, online evaluations (5-minute production cadence), real-time alerting, and SOC2/GDPR/HIPAA compliance are all gated to the commercial Arize AX tier. The open-source Phoenix core covers tracing, span/trace evaluation, prompt management, and datasets/experiments.
| Feature | Phoenix OSS | Arize AX |
|---|---|---|
| Tracing and span/trace evaluation | β | β |
| Prompt management and datasets | β | β |
| Agent graph visualization | β | β |
| Online evaluations (5-min cadence) | β | β |
| Drift detection | β | β |
| Real-time alerting (Slack, PagerDuty, OpsGenie) | β | β |
| SOC2/GDPR/HIPAA compliance | β | β |
Where Phoenix Falls Short
CI/CD integration requires custom Python scripts and manual GitHub Actions workflow construction, with no dedicated marketplace action available.
Best fit: Teams needing full self-hosting for data residency without an enterprise agreement, production-first observability with deep span-level debugging, or broad OTel instrumentation across LlamaIndex, OpenAI Agents, or multi-agent frameworks. Budget for Arize AX if you need drift detection and advanced production analytics.
Promptfoo: Open-Source Eval and Red-Teaming Framework
Promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications, with the project hosted on GitHub. One critical piece of context: as of the March 2026 announcement, Promptfoo has agreed to be acquired by OpenAI, with the closing of the transaction still subject to customary closing conditions. Promptfoo will reportedly remain open source under the current license, though the project is no longer independent. Teams evaluating Promptfoo as a vendor-neutral tool for non-OpenAI models should factor this into their assessment.
What I Tested
Promptfoo's YAML-first configuration is the standout feature for developer ergonomics. Test cases are declared in YAML or JSON without requiring code for basic scenarios, and they version-control alongside application code. Evaluations run locally by default, and Promptfoo only sends data to its own servers when you use hosted or cloud-backed features such as remote generation, grading, sharing, or Cloud sync.
The assertion system covers both deterministic checks (equals, contains, regex, is-json, is-sql) and model-graded evaluations (llm-rubric, factuality, context-faithfulness, context-relevance). Custom evaluators extend via JavaScript or Python. Provider support is extensive and covers OpenAI, Anthropic, Google, AWS Bedrock, Azure, local providers (llama.cpp, Transformers.js), and custom providers.
Red-Teaming as Primary Differentiator
Promptfoo ships red-teaming presets aligned to OWASP's LLM Top 10 and the OWASP Top 10 for Agentic Applications, and describes mappings to major security frameworks used by enterprises in its documentation. These are preset-level testing presets rather than formal certifications. Vulnerability categories include PII leaks, prompt injection and extraction, jailbreaking, SQL/shell injection, ASCII smuggling, and agent-specific risks like misuse of connected APIs.
The promptfoo code-scans run command scans code diffs for prompt injection risks, PII exposure, and insecure output handling. Output per finding includes file path, line number, description, suggested fix, and severity.
Where Promptfoo Falls Short
Promptfoo is explicitly a pre-deployment tool. The system does not track real user interactions or alert on production degradation, though tracing support improves real-time debugging.
The CLI/YAML model creates friction for product managers and domain experts. In my testing, it worked best when the team was developer-centric.
Best fit: Teams where security and red-teaming with OWASP-aligned presets is a primary requirement, who want fully local, privacy-by-default evaluation integrated into CI/CD, and whose team is developer-centric and comfortable with YAML configuration. Plan for separate production monitoring tooling.
See how Cosmos turns evaluation feedback into shared team memory, so teams stop losing corrections to one-off fixes.
Free tier available Β· VS Code extension Β· Takes 2 minutes
in src/utils/helpers.ts:42
Galileo: Luna-2 Evaluators at Scale
Galileo describes itself as an AI agent reliability platform combining offline evaluations, production monitoring, and runtime guardrails. The core differentiator is Luna-2: purpose-built small language models (3B and 8B parameter variants) designed specifically for evaluation tasks.
What I Tested
The original Luna model is described in a peer-reviewed industry-track paper at COLING 2025, which Galileo cites as the foundation for its hallucination-detection methodology. Luna v1 is a DeBERTa-large encoder (440M parameters) fine-tuned for hallucination detection in RAG settings. It outperforms zero-shot detection using GPT-3.5, the ChainPoll GPT-3.5 ensemble, and RAG evaluation frameworks including RAGAS and Trulens.
Luna-2 converts decoder-only small language models into deterministic evaluation models through single-token generation. LLM-as-judge approaches rely on multi-token generation and produce non-deterministic outputs by comparison. The multi-headed design attaches lightweight adapters to a shared core model, supporting multiple evaluation metrics without multiplying infrastructure costs.
Galileo describes agentic evaluation metrics like Tool Selection Quality, Action Advancement, Agent Flow, and Action Completion, and uses Luna fineβtuned models alongside other LLM evaluators to score agent tool calls and responses. Three dedicated trace views (Graph, Trace, Message) give different perspectives on agent execution.
Vendor-Stated Performance (Not Independently Verified)
The figures below come from Galileo's own 2024β2025 published materials rather than third-party benchmarks. They have not yet been independently replicated in public benchmarks. I include them for context on how Galileo positions Luna-2 against generalist judges:
| Metric | Luna-2 (vendor-stated) | GPT-4o (vendor-stated comparison) |
|---|---|---|
| Cost per 1M tokens | ~$0.02 | $2.50 input / $10.00 output |
| Average latency | Sub-200ms | higher latency for GPT-4o in some benchmarks |
The methodology is discussed in the COLING 2025 paper. Treat the Luna-2 numbers as directional.
Where Galileo Falls Short
Real-time guardrails (Galileo Protect) are gated exclusively to the Enterprise tier. Free and Pro tiers do not include runtime protection. The free tier limits to 5,000 traces/month at the time of writing. Specific trace limits and pricing are listed on Galileo's current pricing page and may change over time.
| Tier | Price | Traces | Guardrails |
|---|---|---|---|
| Free | $0/month | 5,000/month | β |
| Pro | $100/month | 50,000/month | β |
| Enterprise | Contact sales | Unlimited | β |
Best fit: Teams building compliance-critical systems where Luna's peer-reviewed methodology provides differentiated credibility, who need cost-efficient hallucination detection at scale, or who require runtime guardrails with sub-200ms latency according to vendor-stated figures (Enterprise tier required).
How Cosmos's Learning Flywheel Connects Evaluation to Agent Improvement
Cosmos is Augment Code's agent operating system, launched into public preview in May 2026 on the MAX plan. It is the platform layer where developers, agents, codebases, tools, and memory coexist and coordinate, designed to run agents in local development environments, dev VMs, and Augment's managed cloud across the full software development lifecycle. The four tools above evaluate agents; Cosmos focuses on a different layer, turning individual corrections into reusable team knowledge.
The Problem Evaluation Tools Don't Solve
I have seen the same post-eval pattern on every team I've worked with. An engineer discovers an agent failure, corrects it locally, and moves on. That correction lives in their personal config. Next week, a teammate hits the same failure and spends the same time debugging. Evaluation tools surface the failure, but in my testing they did not carry the fix forward.
Cosmos was built around exactly this gap. Without a unifying system for agent use, four problems compound:
- Setups fragment: every engineer builds their own workflow with no shared patterns
- Expertise gets trapped: the engineer who figured out the effective prompt has it nowhere others can find it
- No quality signal: no way to know which agent setups actually work across teams
- The review bottleneck worsens: humans get pulled in only at the final PR, where catching problems costs the most
The Learning Flywheel as a System Service
Cosmos's architecture combines several team-oriented services: shared context and memory, self-improving agent loops, and connections to the tools the team already uses. Two types of human corrections feed the flywheel:
- Correcting outputs: directly adjusting an agent's immediate result
- Correcting the mental model: instructing the agent on how to approach a category of decisions going forward
The second type carries more leverage. Teams teach the priority function behind future decisions, and the correction becomes reusable across later work for the same agent role.
The Milo Case Study
Milo is a tester agent built internally at Augment, and it illustrates how the flywheel works in practice. The first attempt loaded Milo with all known testing context upfront, and that approach failed. The approach that worked scoped Milo narrowly as the best testing expert for the team's specific environment and tuned for continuous learning and memory. When Milo encountered problems, engineers coached via Slack. Milo was designed to distill important information from those conversations and store it for future runs.
The principle generalizes. A narrowly scoped agent that compounds feedback over time outperforms a broadly scoped agent loaded with everything at once.
From Eight Interruptions to Three Checkpoints
Cosmos's framing of the SDLC reduces the typical eight human interruptions in a product improvement loop to three checkpoints:
- Prioritization review: an agent monitors channels, aggregates signal, finds patterns, and proposes priorities. The human reviews and can correct both the priorities and the mental model behind them.
- Spec and intent review: once priorities are confirmed, parallel agents open PRs or take a first stab. Specs come back for human review before agents write, test, and review code. This pattern aligns with broader spec-driven development workflows.
- Deep code review: code review designed to improve bug detection and review coverage. The human experience surfaces the places where key assumptions are shifting, helping humans maintain codebase understanding and ship with confidence.
How Cosmos Complements Evaluation Tools
Cosmos addresses a different layer from the four evaluation tools. Evaluation metrics feed the improvement loop, and the flywheel propagates what the team learns. The Expert Registry makes those patterns reusable across the team, so new agents inherit prior team context and skip the blank-session reset. That changes the failure-recovery loop from one-off correction to shared reuse.
Comparison Table
The table below consolidates the dimensions I evaluated each tool against, including primary purpose, license model, self-hosting support, evaluation methodology, trace depth, production monitoring, CI/CD integration, free tier limits, paid pricing, and peer-reviewed research backing. Use it as a quick reference when matching a tool to your specific bottleneck:
| Dimension | Braintrust | Arize Phoenix | Promptfoo | Galileo | Cosmos |
|---|---|---|---|---|---|
| Primary purpose | Eval-driven development | Observability + tracing | Security/red-teaming + pre-deploy eval | Hallucination detection + guardrails | Organizational agent OS with learning flywheel |
| License | Proprietary SaaS | Open-source (ELv2) + commercial Arize AX | Open-source (agreed to be acquired by OpenAI) | Proprietary SaaS | Proprietary (public preview) |
| Self-hosting | Available after contacting Braintrust | Yes, free, no feature gates | Yes, fully local | Enterprise only | Runs on laptops, Dev-VMs, and cloud |
| Eval methodology | LLM-as-judge + custom scorers | LLM-as-judge + code-based + human | Custom assertions + LLM-as-judge | Luna-2 SLMs as alternative to LLM-as-judge | Corrections compound via shared memory |
| Agent trace depth | Full span traces, tool call logging | OTel-native, session-level tracking | Individual prompts; OTel added June 2025 | Graph, Trace, and Message views | Context Engine across large codebases |
| Production monitoring | Batch + online scoring | Real-time alerting (Arize AX) | Pre-deployment + continuous CI/CD monitoring | Real-time guardrails (Enterprise) | Continuous learning from production corrections |
| CI/CD integration | Native GitHub Action with PR comments | Custom scripts required | Native GitHub Action + 10+ platforms | Documented | Event Bus triggers from Linear, incidents, Slack |
| Free tier | 1 GB processed data or ~1M spans, 10K scores, 14-day retention (approx., per current Braintrust docs) | Unlimited (self-hosted) | Unlimited (local) | 5,000 traces/month | Public preview (MAX plan) |
| Entry paid price | $249/month | Contact sales (Arize AX) | Open-source | $100/month (Pro, yearly) | MAX plan pricing |
| Peer-reviewed research | None | None | None | Vendor-cited COLING 2025 paper (Luna) | None |
Galileo's Luna methodology is the only entry in this comparison that the vendor explicitly ties to a peer-reviewed industry-track paper.
Decision Framework
No single tool covers the full evaluation lifecycle.
Choose Braintrust if CI/CD regression detection with automatic PR comments is your primary workflow and you want managed SaaS with prompt playground and dataset management in one platform.
Choose Arize Phoenix if you need full self-hosting without an enterprise agreement and production-first observability across multi-agent frameworks with OTel instrumentation.
Choose Promptfoo if security red-teaming with OWASP-aligned presets is a core requirement and your team is comfortable with YAML configuration. Weigh the OpenAI acquisition carefully for multi-provider strategies.
Choose Galileo if cost-efficient hallucination detection with peer-reviewed methodology credibility matters for compliance, and you can commit to Enterprise for runtime guardrails.
Add Cosmos when evaluation signals need to compound across teams. In my testing framework, that meant addressing the post-evaluation problem: making sure fixes applied after review could compound across the team and avoid becoming one-off corrections.
Choose the Layer That Makes Evaluation Signals Reusable
The practical next step is to choose your stack based on the bottleneck: CI/CD regression detection, production observability, red-teaming, or hallucination detection. The tradeoff I kept running into was what happens after a failure is found. Evaluation tools gave me the signal, but they did not solve the reuse problem on their own. In my testing, Cosmos was the layer aimed most directly at carrying corrections forward: corrections to one expert agent propagate to the same role for every engineer, new agents inherit prior team context, and expertise carries across sessions.
Explore how Cosmos propagates evaluation-driven corrections across agent roles and team workflows.
Free tier available Β· VS Code extension Β· Takes 2 minutes
FAQ
Related
Written by

Molisha Shah
GTM
Molisha is an early GTM and Customer Champion at Augment Code, where she focuses on helping developers understand and adopt modern AI coding practices. She writes about clean code principles, agentic development environments, and how teams are restructuring their workflows around AI agents. She holds a degree in Business and Cognitive Science from UC Berkeley.