The strongest AI-assisted spec review options today are Intent's Critique persona, Kiro's spec generation system, Traycer's mini-specs, Claude Code with CLAUDE.md, and custom GPT-based spec reviewers. I evaluated these five approaches because they share a common premise: coding agents execute flawed specs with the same fidelity as correct ones, so the highest-leverage review point sits before implementation begins.
TL;DR
AI coding agents build exactly what specs tell them to, even when specs are wrong. Catching the flaw before execution is consistently cheaper than unwinding a confidently wrong implementation spread across many files. Below, I compare five approaches and map each one to the team context where it fits best.
Why Specs Need Review Before Agents Execute Them
Coding agents follow specs with high fidelity whether those specs describe the right thing to build or not, and that creates a distinct failure mode where strong execution amplifies weak direction. Research has identified recurring issues in AI-generated code, including semantic errors and regressions when specs leave requirements underspecified.
The failure modes compound in specific, documented ways:
- Confidently wrong direction: A 30-minute feature balloons into three hours because the agent rewrites half the codebase based on a misread requirement
- Cascading errors: A wrong assumption introduced early doesn't fail immediately; every subsequent step builds on it, and failures in multi-step pipelines often originate earlier than they appear
- Context degradation: In long agentic sessions, the agent's working model of the original spec degrades as context management drops earlier constraints, sometimes to the point where developers report needing to restart the session
- Over-engineering from under-specified scope: When a spec doesn't constrain scope, agents default toward maximally complete implementations, adding abstractions beyond what the task requires
The cost structure is asymmetric. A wrong spec produces confidently wrong output, and the effort to unwind it rises sharply once an agent has touched many files. Spec review earns its place as a distinct category by catching direction-level errors before they translate into code-level problems across the codebase.
See how Intent's Critique persona catches direction-level spec errors before agents touch a single file.
Free tier available · VS Code extension · Takes 2 minutes
Evaluation Criteria: How I Assessed Each Tool
The tools in this category span full IDEs, markdown file conventions, and prompt patterns, so I assessed each against the same framework. A Thoughtworks Radar note on spec-driven development warns that these tools "behave very differently depending on task size and type."
I evaluate each tool across five dimensions, inspired by Thoughtworks-style technology radars but distinct from the dimensions used in Technology Radar Volume 33 and Volume 34:
- Review depth: Does the tool catch edge cases, authorization logic, and error handling, or primarily happy paths? High output volume can still mask shallow coverage
- Automation level: Does review happen automatically, on-demand, or only through manual prompting? Does the tool separate planning from implementation?
- Integration: Native IDE support, CI/CD pipeline hooks, and whether the tool requires switching contexts to a separate interface
- Codebase awareness: Does the tool reason about your existing architecture, or process only the current session's content?
- Team workflow fit: Does the tool match how your team already plans work, issue-driven, PRD-driven, or RFC-driven, or does it impose a new process?
I assess each tool below against these five dimensions using cited product documentation, research, and industry analysis.
1. Intent's Critique Persona: Purpose-Built Spec Review Before Agent Execution
Intent is a spec-driven, multi-agent developer workspace currently in public beta. Critique is the clearest example I found of a persona designed specifically to review specs before implementation agents start executing.
What Critique Does
Intent ships with six specialist personas, two of which directly affect spec quality:
| Persona | Role | When It Runs |
|---|---|---|
| Critique | Review specs for feasibility | Before implementation |
| Verify | Check implementations match specs | After implementation |
The separation matters because Critique catches direction-level errors before any code is written, while Verify catches implementation drift after agents finish. Most other tools in this list combine both functions or handle only one.
The problem this addresses is straightforward: when the spec is weak, implementation review happens too late to prevent wrong-direction work.
How the Workflow Operates
The Coordinator analyzes the codebase through the Context Engine, drafts a living spec, and decomposes it into a dependency-ordered task graph. Critique reviews the spec before any Implementer receives a task.
The living spec approach is the most distinctive piece of Intent's architecture. As agents complete work, the spec updates to reflect what was actually built, and when requirements change, updates propagate to active agents. That addresses the spec-code synchronization problem that plagues static-spec approaches, a tension explored in detail in our analysis of where spec-driven development falls short.
Developers can stop the Coordinator to manually edit the spec before agents continue, and Implementers execute in isolated git worktrees, keeping parallel work from colliding.
Integration and Platform Support
Intent runs as a standalone desktop workspace application on macOS (Apple Silicon), with Windows on a waitlist and Linux support not yet announced. It supports bring-your-own-agent access for Claude Code, Codex, and OpenCode subscriptions.
For CI/CD pipelines, the Auggie CLI reference documents GitHub Actions support:
Teams can also define custom specialist agents beyond the built-ins, tailoring agent behavior to domain-specific concerns.
Pricing
Pricing follows the standard Augment Code plan structure, with Intent currently using the same credits as the IDE extensions and Auggie CLI during the public beta:
| Plan | Price | Credits/Month |
|---|---|---|
| Free | $0 | Limited |
| Indie | $20/month | 40,000 |
| Standard | $60/seat/month | 130,000/seat (up to 20 users) |
| Max | $200/seat/month | 450,000/seat (up to 20 users) |
| Enterprise | Custom | Custom |
Strengths and Limitations
Strengths: Intent provides a purpose-built separation between Critique (pre-execution review) and Verify (post-execution validation). Living specs update as agents work, the Context Engine extends codebase awareness across 400,000+ files, and custom specialist agents cover domain-specific review needs.
Limitations: Public beta status means some features are still evolving. Intent is macOS only for now, with Windows and Linux on the waitlist, and the specific issue types Critique flags beyond "feasibility" are not yet fully documented.
Best for: Teams who want spec review integrated into a coordinated multi-agent workflow, and teams building features complex enough to justify the planning overhead.
2. Kiro's Spec Generation System: Structured Specs Baked Into the IDE
Kiro is an agentic AI IDE developed by AWS. Its differentiator is a structured spec generation workflow built into the product itself, with the spec authoring step doing most of the review work.
How Spec Generation Works
Kiro converts a natural language prompt into structured specifications through a three-phase workflow.
Phase 1: Natural Language to EARS Notation Requirements. Kiro generates user stories with acceptance criteria written in EARS notation:
Phase 2: Architectural Design Document. Kiro analyzes the existing codebase and produces architecture, system design, and technology stack recommendations.
Phase 3: Implementation Task List. A sequenced plan with discrete coding tasks ordered by dependencies. Developers can trigger tasks one step at a time after reviewing them.
Three spec files are created per feature: requirements.md, design.md, and tasks.md, as documented in this Kiro spec-driven workflow walkthrough. Kiro recommends multiple specs per project, separated by functional domain.
Steering Files and Agent Hooks
Steering files encode persistent project-level standards for all code generation. The same walkthrough notes that a standard Kiro project configuration includes 6 steering files, 4 hooks, 2 MCP configs, and an initial spec.
Agent hooks trigger automated actions on file events (create, save, delete) or when invoked manually, covering use cases like generating tests, examining files for security issues, and code review.
Pricing
Kiro now uses a unified credit pool across vibe and spec work, with credits charged fractionally based on prompt complexity. The current published plans are:
| Tier | Monthly Credits | Price |
|---|---|---|
| Free | 50 credits | $0 |
| Pro | 1,000 credits | $20/month |
| Pro+ | 2,000 credits | $40/month |
| Power | 10,000 credits | $200/month |
| Enterprise | Custom | Custom |
According to Kiro's pricing page, overage runs $0.04 per additional credit on paid tiers, and the Kiro FAQ confirms no AWS account is required. New users also get a 500-credit, 30-day welcome bonus.
Strengths and Limitations
Strengths: A highly structured spec generation system. EARS notation produces requirements that are testable by design, and the three-phase workflow separates the question of what to build from the question of how to build it. Steering files persist project conventions across sessions, and the Kiro CLI supports macOS, Linux, and Windows.
Limitations: Kiro's documentation and related coverage note that specifications are mostly static documents and can drift from code changes, which is a meaningful gap compared to living-spec architectures. Kiro subscriptions are individual rather than pooled, so teams cannot share a single credit pool across seats, which is a notable workflow constraint for larger groups. For fast-moving, small features, the structure can also add friction.
Best for: Teams who want spec structure enforced at the IDE level rather than relying on configuration discipline. Medium-to-large projects where upfront planning time pays off in implementation accuracy.
3. Traycer's Mini-Spec System: An Agent-Agnostic Planning Layer
Traycer positions itself as a spec layer for coding agents. Its value is that it sits between human intent and AI code execution without forcing a team to switch its primary coding agent.
The Mini-Spec Concept
Mini-specs are the core of Traycer's Epic Mode, described as "focused mini-specs that each address a specific aspect of your project," organized in a three-layer hierarchy:
- PRD (Intent Capture): What to build and why
- Phases (Decomposition): Work broken into manageable chunks with milestones
- Plans (Tactical Changes): Actionable plans specifying interfaces, exact files and services to touch, and acceptance criteria
The Plans layer is the operative mini-spec.
The Review Workflow
A notable part of Traycer's approach is its active requirements elicitation. Epic Mode probes for constraints, edge cases, and the "invisible rules" behind requirements through targeted questions during intent clarification.
The default Epic Mode workflow runs from requirements to planning to ticket breakdown, followed by handoff to implementation, with verification integrated into the workflow. Use the workflow that fits the scope of the change.
Before execution, Traycer supports clarifying questions during intent clarification and offers a Plan Chat interface for editing plans. After the coding agent executes, Traycer runs a verification pass against the spec. Any discrepancies are fed back into the workflow for correction and categorized by severity. Verification can be repeated until everything aligns.
Artifacts are collaborative: multiple team members can be in the same spec simultaneously, leaving comments and edits in real time.
Execution Modes
Traycer offers three execution modes that map to different levels of human oversight:
| Mode | Description |
|---|---|
| Phases Mode | Developer reviews and approves each phase before proceeding |
| YOLO Mode | Fully automated: review, coding, and phase progression without human approval at each step |
| Epic Mode | Full orchestration combining mini-specs, ticket decomposition, and verification |
Pricing
Traycer's pricing has changed multiple times, so verify current numbers on their pricing page before committing. As of writing, Traycer uses credit-based subscriptions with bundle top-ups starting at $10, and a Pro trial is available with no credit card required. Published plan tiers include Lite, Pro, and Ultra, plus an Enterprise option with centralized billing. Traycer pricing is additive to whatever coding agent subscription you already pay for.
Strengths and Limitations
Strengths: Traycer offers an agent-agnostic design that works alongside whatever coding tool your team already uses. Active requirements elicitation surfaces ambiguities before they become implementation errors, collaborative artifacts let teams review specs together, and severity-tiered verification provides actionable post-execution feedback.
Limitations: GitHub PR integration is not yet supported as of writing, and that is an active development area worth re-checking. The additive cost structure means you pay for Traycer on top of your existing agent subscription, and running it as a separate planning layer can introduce context-switching cost.
Best for: Teams who want to add a spec review layer without switching their primary coding agent. The agent-agnostic design makes Traycer a portable option for teams using Cursor, Claude Code, or Windsurf.
Intent's coordinated workspace keeps your primary agent and a living spec aligned across parallel changes, even on cross-service refactors.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
4. Claude Code + CLAUDE.md: A Configurable Spec Review System
Claude Code with CLAUDE.md works as a configurable system that teams adapt for spec review through markdown rules. It works best for teams willing to encode review rules, architectural constraints, and spec compliance checks in those files.
How CLAUDE.md Works for Spec Review
Anthropic's memory documentation describes CLAUDE.md as a special markdown file that Claude Code reads at the start of every conversation, with additional CLAUDE.md files loaded automatically from relevant parent directories and child directories loaded on demand. The same documentation distinguishes it from auto memory: CLAUDE.md contains instructions and rules you write, while auto memory contains learnings Claude discovers. For spec review, CLAUDE.md is the authoritative layer.
Claude Code supports project-level instructions through CLAUDE.md, including files placed in your repository and along the directory hierarchy. Path-scoped rules via .claude/rules/ frontmatter activate only for matching files:
Four Spec Review Patterns
Teams have converged on four practical patterns for using Claude Code as a spec review layer.
Pattern A: Writer/Reviewer Two-Session Pattern. Anthropic's Claude Code best practices recommend using a fresh context for review: "A fresh context improves code review since Claude won't be biased toward code it just wrote." One session implements the change while a separate session, started fresh, reviews it.
Pattern B: Dedicated Security Reviewer Subagent. A security-reviewer subagent configuration may be provided as part of a project setup.
Pattern C: Spec Compliance in PR Review. The Claude Code review plugin supports parallel subagents with tiered model assignment: Opus subagents for bugs and logic issues, Sonnet agents for CLAUDE.md convention violations.
Pattern D: Stop Hook for Automated Review Gate. Modified files can be reviewed before control returns to the user.
A Dedicated REVIEW.md File
A separate code review configuration file can define severity levels, cap noise, and exclude what CI already catches:
Real Developer Workflows
CLAUDE.md can function as a living document updated during code review, which means spec enforcement rules evolve alongside the code they govern.
Strengths and Limitations
Strengths: Claude Code with CLAUDE.md adds zero tooling beyond Claude Code itself and remains highly configurable. Path-scoped rules reduce noise, the two-session pattern and subagent system provide architectural separation between writing and reviewing, and CLAUDE.md updates committed as PR artifacts create a living review standard.
Limitations: Manual maintenance is required, and an outdated CLAUDE.md produces inconsistent behavior. The audit trail is conversational only, with no diff-linked history or approval gating, and there is no spec-code synchronization mechanism. Auto memory is capped at the first 200 lines or 25KB of MEMORY.md, so large rule sets may need to be split into separate files. Instructions may also stop being followed as context fills during long sessions.
Best for: Teams willing to invest in configuration who want maximum control over review rules, and teams already using Claude Code who do not want to adopt a separate tool.
5. Custom GPT Spec Reviewers: DIY Approaches for Existing Toolchains
No single dominant custom GPT spec reviewer product is established in the research and documentation I reviewed. The real category is a set of patterns that vary in engineering investment, portability, and review depth.
Research on prompt underspecification found that LLMs can sometimes infer unspecified requirements by default, but the behavior is fragile: under-specified prompts are roughly twice as likely to regress across model or prompt changes, with accuracy drops sometimes exceeding 20%. That fragility is why spec review needs to surface missing constraints before agents commit to one interpretation.
Pattern 1: Adversarial Multi-Agent Reviewer
This pattern uses a two-agent configuration where one agent authors the spec and a separate, stronger reviewer agent evaluates it.
Whether the pattern works in practice depends on three things. The reviewer must be a stronger model than the implementer. Pair it with test-driven constraints so the reviewer has concrete criteria. Build in an explicit escape hatch, because without one the model will begin deleting content instead of flagging contradictions.
Pattern 2: Meta-Planning Iterative Spec Critique
This pattern spends meaningful time iterating on the spec document with AI before handing it to an implementation agent.
OpenAI's GPT-5 prompting guide supports this approach, noting that teams using GPT-5 to review their own prompt libraries "uncovered ambiguities and contradictions in their core prompt libraries upon conducting such a review." Updated guides for GPT-5.1 and GPT-5.2 are also available for teams running on newer models.
Pattern 3: Microsoft PromptKit for Version-Controlled Spec Audit
Microsoft PromptKit provides composable, version-controlled prompt components, including personas, protocols, formats, and templates designed for bug investigation, design docs, code review, security audits, and similar engineering tasks. The pieces relevant to spec review are the components that perform adversarial spec analysis against declared invariants instead of running generic "find problems" prompts. This catches structural contradictions that general critique misses, and other components in the kit help with comparing spec versions, auditing cross-component integration points, and authoring interface contracts.
Confirm exact component names against the live repository before adopting any of them in a workflow, as the kit is actively evolving.
Pattern 4: Inline Checklist Verification
The simplest approach is embedding spec review directly in the execution prompt. Appending a requirement check after generation can verify whether listed requirements were satisfied. This verifies only against requirements already listed; it cannot detect contradictions between requirements or identify missing requirements that were never written.
Strengths and Limitations
Strengths: Custom GPT reviewers stay LLM-agnostic when API-based and allow high domain customization by embedding team standards as context. PromptKit provides version control and composability, and no new tool adoption is required.
Limitations: LLM behavior on under-specified prompts is fragile and inconsistent across model changes, per the research on prompt underspecification. Output is also prompt-sensitive in another sense: the gap between what a team wants the reviewer to do and what the prompt actually instructs it to do can itself become a source of inconsistency.
Best for: Teams not ready to adopt a new tool, teams with strong prompt engineering capability, and teams with domain-specific review requirements that purpose-built tools do not address.
Comparison Table: All Five Approaches Side by Side
The table below summarizes how the five approaches compare across the dimensions that drive day-to-day decisions:
| Dimension | Intent Critique | Kiro | Traycer | Claude Code + CLAUDE.md | Custom GPT Reviewers |
|---|---|---|---|---|---|
| Review type | Pre-execution spec feasibility | Three-phase spec generation (EARS) | Mini-spec with active elicitation | Configurable via markdown rules | Pattern-dependent |
| Automation | Coordinator assigns Critique automatically | Guided three-phase workflow | Phases, YOLO, or Epic mode | Manual (subagents, hooks) | Fully manual |
| Spec-code sync | Living specs update as agents work | Static (known limitation) | Post-execution verification loop | None by design | None |
| Codebase awareness | Context Engine across 400,000+ files | Steering files + codebase analysis | Context across phases and tasks | Project files + CLAUDE.md for additional context | Varies by setup |
| IDE integration | Standalone macOS app | Full IDE | VS Code, Cursor, Windsurf extension | Terminal CLI, any Claude Code interface | Any LLM interface |
| Agent compatibility | Augment agents, Claude Code, Codex, OpenCode | Kiro agents only | Cursor, Claude Code, Windsurf, others | Claude Code only | Any LLM |
| CI/CD hooks | GitHub Actions via Auggie CLI | Agent hooks on file events | GitHub repo integration; PR/CI/CD hooks unconfirmed | Stop hooks, code review plugin | Custom implementation |
| Team collaboration | Shared workspaces, custom specialists | Steering files in git | Real-time collaborative artifacts | CLAUDE.md committed to git | Shared prompts/templates |
| Starting price | Free (limited credits) | Free (50 credits) | Free Pro trial (no card required) | Eligible Claude plan (e.g., Pro or Max) | API costs only |
| Additive cost | Uses Augment credits | Standalone | On top of agent subscription | Subagent compute | API calls per review |
When Spec Review Prevents Downstream Rework
Moving the highest-leverage review point earlier in the workflow is what prevents downstream rework. Thoughtworks' analysis of agile engineering practices in 2025 describes this shift directly: the human review point in AI-assisted development is moving upstream, from code toward specification.
This fits a larger pattern in agentic development workflows, where review checkpoints move from code-level to direction-level decisions. Augment Cosmos, the operating system for agentic software development (currently in research preview), frames the change as condensing the eight typical human interruptions across an SDLC down to three checkpoints where humans steer the work: reviewing priorities, reviewing the spec, and a final intent review before ship. Reviewing the spec sits squarely at the second of those checkpoints, and tools like Intent put a dedicated persona on it.
Three scenarios show where spec review pays off quickly.
Cross-service features with shared contracts. When an agent implements a new API endpoint, it follows whatever the spec says about request and response formats. If the spec misses three other services that expect a specific event signature, the agent can build a working but incompatible endpoint. Spec review catches the contract mismatch before code exists. This is the kind of cross-service problem that running parallel agents safely is specifically designed to address.
Legacy system modifications with implicit constraints. Legacy codebases have rules that live in developer memory rather than documentation. An agent operating from a spec that doesn't encode these constraints can produce syntactically valid code that violates unwritten rules. In codebase-aware systems, repository analysis can surface patterns and constraints the spec omitted because the review process reasons over existing architecture across the full codebase. We cover this dynamic in more depth in our writeup on spec-driven development for brownfield codebases.
Multi-sprint features where scope evolves. Specs written in sprint one drift from reality by sprint three, and static specs create a growing gap between the plan and the code. Tools with living spec architectures (Intent) or verification loops (Traycer) catch this drift continuously, while tools with static specs (Kiro, CLAUDE.md) require manual discipline to keep the spec current.
The Technology Radar Volume 34 captures the macro trend: "cycle times increase as engineers raise PRs filled with insufficiently reviewed AI output, leading to repeated back-and-forth with reviewers." Spec review moves that back-and-forth upstream, where corrections happen before broad implementation spreads.
A practical starting point for teams evaluating spec review is simple: pick your next medium-complexity feature, write the spec in markdown, and run it through two different review approaches from this list before handing it to your coding agent. Measure how many direction-level corrections the review catches. That gives you a baseline for whether the process is reducing rework in your workflow.
Adopt Spec Review Before Your Next Multi-File Feature
Every tool in this list addresses the same structural problem: AI coding agents execute faithfully against whatever spec they receive, and the effort to unwind a wrong-direction implementation rises sharply once that work spreads across many files. The right choice depends on your team's existing tools, tolerance for workflow change, and whether you need living spec synchronization or can maintain static specs with discipline.
For teams building complex features where multiple agents work in parallel, the coordination problem means spec quality drives output quality. Intent's architecture, with Critique running before implementation and living specs that update as agents work, addresses that coordination gap directly.
See how Intent's Critique persona and living specs prevent the rework that flawed specifications create in multi-agent workflows.
Free tier available · VS Code extension · Takes 2 minutes
FAQ
Related
Written by

Paula Hingel
Technical Writer
Paula writes about the patterns that make AI coding agents actually work — spec-driven development, multi-agent orchestration, and the context engineering layer most teams skip. Her guides draw on real build examples and focus on what changes when you move from a single AI assistant to a full agentic codebase.