What is the minimum viable AI spec template format?

Three sections prevent most agent correction cycles: acceptance criteria with concrete inputs and expected outputs, input/output contracts using machine-readable schemas, and a "Not Included" scope boundary. For complex features, expand to all seven sections.

Should AI agent specs use Gherkin or Markdown?

Markdown works best for agent instructions and feature specs because it has the lowest authoring friction and is the dominant standard. Gherkin works best for acceptance criteria that need to be directly executable as tests. Most teams benefit from using both: Markdown for the spec body and Gherkin or input/output tables for acceptance criteria.

How detailed should an executable spec be for AI agents?

Details should match the feature's risk profile. A feature with complex validation logic, multi-step workflows, or compliance requirements benefits from all seven sections. A two-field form with a single error state does not warrant the same level of investment. When the full template takes longer to write than the feature takes to build, use only the three load-bearing sections.

How do teams validate that a spec is complete before handing it to an agent?

Run the regeneration test: could an agent rebuild this feature from this spec alone and produce behaviorally identical output? Generate code from the spec, then regenerate from the same spec one week later. Divergence between the two outputs identifies which constraints were missing.

What is the difference between AGENTS.md and a feature spec?

AGENTS.md contains project-level conventions, build commands, and coding standards that apply across all features. A feature spec contains the behavioral contract for a single feature: its objective, contracts, business rules, and acceptance criteria. AGENTS.md is persistent context; a feature spec is a task-specific instruction.

How does Cosmos use executable specifications in production workflows?

Cosmos provides the agent runtime, organizational knowledge layer, and verification loops that turn specifications into operational execution contracts. Cosmos Experts (pre-built agents for authoring, review, and testing) consume specs as behavioral boundaries, execute work against them, and verify output through defined acceptance criteria rather than relying on agent self-assessment.

AI Spec Template: What to Include and Leave Out

An effective AI spec template format centers on a small set of mandatory sections: acceptance criteria with concrete inputs and outputs, input/output contracts with machine-readable schemas, business rules, and an explicit "Not Included" scope boundary. Production teams typically expand this baseline by adding sections such as an objective and scope, tech stack guidance, boundaries and guardrails, and a test plan, because these elements provide AI agents with sufficient behavioral constraints to generate correct code without prescribing the implementation. The template excludes implementation hints, pseudo-code, and vague quality attributes.

TL;DR

Specification issues are one of three failure categories in multi-agent LLM systems, and most agent correction cycles come from vague requirements rather than spec format choice. A workable AI-executable spec leans on three sections: acceptance criteria with concrete inputs and outputs, input/output contracts using machine-readable schemas, and a "Not Included" scope boundary that prevents hallucinated assumptions.

Why AI Agents Need Executable Specifications

AI agents do not fill gaps with engineering judgment. They generate statistically plausible completions that may have no relationship to actual requirements.

A spec that says "validate the payment input and return an error if invalid" typically produces a function that returns a boolean or throws an exception. No structured error object, no field-level error codes, no handling of multiple simultaneous validation failures. The agent's output passes a basic test. It breaks when the downstream UI tries to display which field failed, or when a retry system needs to distinguish a card expiry error from a CVV error.

The Specine paper formalizes why this happens: specification misalignment propagates through all downstream stages of code generation, including planning. An inaccurate perception of the specification limits the ability to optimize every subsequent step. Ambiguity at the spec level is not a cosmetic problem; it is an architectural one.

Benchmark data reinforces the urgency. Older HumanEval scores in the high 80s for GPT-4 rested on a benchmark that has now saturated for frontier models. On SWE-bench Verified, the current standard for real-world coding evaluation, GPT-4o scored 33.2%. Frontier models have improved substantially since, with Claude Opus 4.6 reaching 80.8% and newer releases like Claude Opus 4.7 pushing further. But on the harder, contamination-resistant SWE-Bench Pro, scores fall sharply: GPT-5.5 reaches just 58.6%. The trajectory is improving, but models still need precise behavioral constraints to produce reliable output at production scale.

Both GitHub Spec Kit and AWS Kiro structure spec authoring into intent and implementation phases, separating what should be built from how to build it. Birgitta Böckeler at Thoughtworks describes the pattern directly: "Spec-driven development means writing a 'spec' before writing code with AI ('documentation first'). The spec becomes the source of truth for the human and the AI."

For specifications to serve as the source of truth, the agent runtime that consumes them must honor that framing. Augment Cosmos is an orchestration layer for AI-native engineering workflows that combines agent runtime, organizational memory, and verification loops, so executable specifications drive coordinated work across the SDLC rather than sitting as static documents.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

Required Sections: Seven Elements Every AI Spec Needs

The minimum viable template has four mandatory sections. The seven elements below expand that core into a format teams use in production. Each section addresses a specific failure mode. Omitting anyone creates a gap the agent fills with hallucinated assumptions.

A feature with complex validation logic, multiple error states, or compliance requirements benefits from the full template. A simple two-field search filter with one error state probably does not. When time is short, the three sections that prevent the most common failures are: acceptance criteria with concrete inputs and outputs, input/output contracts with machine-readable schemas, and a "Not Included" scope boundary.

1. Objective and Scope

A clear, singular goal in one to three sentences prevents AI agents from over-engineering or adding unrequested features. GitHub Spec Kit enforces this through a four-phase gated workflow (Specify, Plan, Tasks, Implement), in which each phase produces a Markdown artifact, keeping the specification focused on what to build before agents are ever asked to determine how to build it.

text

Objective:
Build a payment validator that checks the card number, expiry, CVV,
and currency before processing. No external API calls.

The "Not Included" section is equally important. Without it, agents routinely add features that seem logical but were never requested. Ambig-SWE research documents a behavioral asymmetry: human developers engage in clarifying dialogue when facing incomplete instructions, while current AI agents proceed with incomplete understanding.

2. Tech Stack and Versions

A spec that says "use React" without specifying the version may result in code that mixes React 18 patterns with React 19 APIs. Explicit stack declarations eliminate this class of failures.

AWS Kiro addresses this through steering files in .kiro/steering/ that maintain conventions across sessions. GitHub Spec Kit stores its artifacts in a .specify folder and produces structured Markdown outputs at each phase of its gated workflow to ensure consistent governance.

3. Input/Output Contracts

AI agents cannot reliably infer types from prose descriptions. Machine-readable schemas using JSON Schema, Zod, or OpenAPI constrain the output space without prescribing implementation steps.

typescript

// Input Schema (Zod)
const PaymentInput = z.object({
  cardNumber: z.string().regex(/^[0-9]{13,19}$/),
  expiry: z.string().regex(/^(0[1-9]|1[0-2])[0-9]{2}$/),
  cvv: z.string().regex(/^[0-9]{3,4}$/),
  currency: z.string().regex(/^[A-Z]{3}$/),
  amount: z.number().positive().max(10000)
});

// Output Schema
type ValidationResult = {
  valid: boolean;
  errors: Array<{
    field: string;
    code: string;
    message: string;
  }>;
};

Use Zod or JSON Schema when the contract lives within a single TypeScript codebase. Use OpenAPI when the contract crosses a service boundary or is consumed in multiple languages. Anthropic's documentation now explicitly directs developers to use structured outputs for schema conformance rather than prompt engineering. Without any explicit contract, agents hallucinate API shapes, and integration breaks downstream.

4. Business Rules and Constraints

Every rule must be deterministic and testable. Each rule maps directly to an acceptance criterion, creating a verification chain from requirement to test.

Business Rules:

Card number must pass the Luhn algorithm check.
Expiry date must be a future month/year (UTC).
CVV must be 3 digits (4 for Amex prefix 34/37).
Currency must be a valid ISO 4217 currency code.
Amount must be > 0 and <= 10,000 per transaction.
Validation must be deterministic: same input produces same output.
No external API calls during validation.

5. Acceptance Criteria

Concrete, verifiable test cases with sample inputs and expected outputs serve as the spec's oracle. The agent can self-validate against these criteria, and the team has an unambiguous pass/fail signal.

ID	Input	Expected Output
AC1	Valid Visa (4111111111111111, 12/28, 123, USD, 100)	{ valid: true, errors: [] }
AC2	Expired card (01/24)	{ valid: false, errors: [{ field: "expiry", code: "EXPIRED" }] }
AC3	Invalid CVV (2 digits)	{ valid: false, errors: [{ field: "cvv", code: "INVALID_CVV" }] }
AC4	Amount exceeds 10,000	{ valid: false, errors: [{ field: "amount", code: "EXCEEDS_LIMIT" }] }
AC5	Non-ISO currency (XYZ)	{ valid: false, errors: [{ field: "currency", code: "INVALID" }] }

The Jfokus 2026 presentation recommends replacing vague terms such as "typically," "expected," and "best" with imperatives and measurable values. Whether teams use Given/When/Then syntax or input/output tables, criteria must resolve to a binary pass/fail.

Cosmos Experts consume specifications structured this way. Each Expert operates with its own environment, memory, and verification capabilities. When acceptance criteria are concrete, the Tester Expert can exercise changes end-to-end and post results with evidence. When criteria are vague, verification becomes phantom confirmation rather than a genuine gate.

6. Boundaries and Guardrails

A three-tier boundary system defines what agents must always do, what requires human confirmation, and what is strictly prohibited. Teams adopting effective collaboration models between humans and agents rely on this tier system to calibrate trust.

The tier teams most commonly omit is "Ask First." Without it, agents treat ambiguous situations as either always permitted or always prohibited.

Tier	Examples
Always	Log all notification delivery attempts; use UTC for scheduling; respect opt-out preferences
Ask First	Adding a new notification channel not in spec; changing retry intervals; sending to 1,000+ recipients in a single batch
Never	Send notifications without verified opt-in; modify unsubscribe flow; access user contact data outside the notification boundary

Without the Ask First tier, an agent building retry logic picks an interval on its own. With it, the agent pauses and surfaces the decision. Thoughtworks Technology Radar Vol. 34 elevated curated shared instructions to Adopt, reinforcing that encoding behavioral boundaries is a team-level engineering discipline rather than a per-developer preference.

7. Test Plan and Self-Verification Instructions

Building a self-audit instruction directly into the spec closes the verification loop. After implementing, the agent compares the result with the spec and confirms all requirements are met. Teams that pair this with broader testing strategies catch failures that single-feature tests miss.

The evidence gate pattern requires that every completion report cite specific, observable evidence. Instead of allowing "tests should pass" or "implementation looks correct," the spec requires the agent to cite specific test output and file paths. The key diagnostic is that when an agent says "should" instead of "did," it has not verified.

Optional Sections: Include When Relevant

These sections add value for specific feature types but are not required for every spec. Include them when the feature's risk profile or domain demands it.

Section	When to Include	What It Contains
Performance targets	Features with latency, throughput, or resource constraints	Measurable thresholds (e.g., P99 latency < 50ms, max memory 256MB)
Migration notes	Features replacing or modifying existing behavior	Before/after behavior mapping, backward compatibility requirements, rollback criteria
UI state definitions	Features with loading, error, empty, or partial states	State enumeration with transition triggers and visual requirements
Error handling taxonomy	Features with multiple failure modes beyond simple validation	Error categories, retry policies, fallback behavior, user-facing messages
Compliance constraints	Features touching PII, payments, or regulated data	Data retention rules, audit logging requirements, encryption standards

Complete Template: Copy, Customize, Ship

Replace the payment validator example with your feature. Keep all section headers.

markdown

# Feature Spec: Payment Validator

## Objective
Build a payment validator that checks card number, expiry,
CVV, and currency before processing. No external API calls.

### Not Included
- Retry logic or rate limiting
- Database persistence
- Stripe API integration

## Tech Stack
- Runtime: Node.js 20 LTS
- Language: TypeScript 5.4 (strict mode)
- Validation: Zod 3.23
- Testing: Jest 29.7

## Input/Output Contracts
<!-- Use Zod, JSON Schema, or OpenAPI -->
[Include typed schemas as shown in Section 3]

## Business Rules
1. Card number must pass Luhn algorithm check.
2. Expiry date must be future month/year (UTC).
3. CVV must be 3 digits (4 for American Express).
4. Currency must be a valid ISO 4217 currency code.
5. Amount must be > 0 and <= 10,000 per transaction.
6. Same input produces same output.
7. No external API calls during validation.

## Acceptance Criteria
| ID  | Input                       | Expected Output                |
|-----|-----------------------------|--------------------------------|
| AC1 | Valid Visa, 12/28, 123, USD | { valid: true, errors: [] }    |
| AC2 | Expired card (01/24)        | errors: [expiry, EXPIRED]      |
| AC3 | Invalid CVV (2 digits)      | errors: [cvv, INVALID_CVV]     |
| AC4 | Amount exceeds 10,000       | errors: [amount, EXCEEDS]      |
| AC5 | Non-ISO currency (XYZ)      | errors: [currency, INVALID]    |
## Boundaries
- Always: Run tests before commits, use UTC, log failures.
- Ask First: Adding dependencies, changing error codes.
- Never: External API calls, commit secrets, log raw PII.

## Test Plan
- Run: npm test -- payment-validator.test.ts
- Self-Verify: Run all 5 ACs. List any unmet criteria.

### Prohibited Completion Phrases
- "tests should pass"
- "follows best practices"
- "implementation looks correct"

Required: cite specific test output and file paths.

Anti-Patterns: What to Leave Out

AI spec anti-patterns fail in two directions. Over-constraining implementation locks the agent into a solution that may conflict with codebase conventions. Under-constraining behavior leaves gaps that the agent fills with plausible defaults that break during integration.

Implementation Hints That Constrain Agent Solutions

Specifying data structures or iteration mechanisms eliminates the agent's ability to find a more appropriate approach.

text

# Before (over-specified)
Use a HashMap to store sessions. Initialize in constructor.

# After (outcome-specified)
Session lookup must be O(1). Sessions expire after 30 min
of inactivity. Cleanup must not block request handling.

Pseudo-Code That Agents Treat as Literal

Agents translate pseudo-code directly into production code. The implicit decisions (sequential processing, no error recovery, no batching) carry forward without scrutiny. Spec gaps resurface unpredictably across regenerations in the absence of behavioral contracts.

text

# Before (pseudo-code)
for each user in users:
    if user.lastLogin > 30 days:
        send_reminder_email(user)

# After (behavioral spec)
Send reminder emails to users inactive 30+ days.
- Idempotent: safe to run multiple times without duplicates
- Email failures must not prevent status updates
- Process in batches of 100 to avoid memory pressure

Prescriptive Architecture

Pete Hodgson identifies a practical alternative to prescribing class hierarchies: point to an existing example. "Add instrumentation to the UpdateAllProjects modal; there's an existing example in UpdateCompany" lets the agent infer the correct pattern by reading the codebase directly.

Vague Quality Attributes

"Make it fast," "ensure high security," and "handle errors gracefully" give agents no actionable target. Replace adjectives with verifiable behaviors: "return structured error codes for all 4xx/5xx responses," "P99 latency < 50ms," "pass OWASP ZAP scan."

Conflicting Instructions Across Spec Sections

When business rules contradict acceptance criteria, or when boundaries conflict with the objective, agents resolve the conflict silently by picking one instruction over the other. The team has no visibility into which instruction was dropped. A spec that says "no external API calls" in boundaries but includes an acceptance criterion requiring currency validation against a live ISO 4217 endpoint forces the agent to choose, and it will not ask. Each spec section should be cross-checked against every other before handoff.

Anti-Pattern	Failure Mode	Fix
Implementation hints	Eliminates better solutions	Specify outcomes, not mechanisms
Pseudo-code	Agent copies the flawed structure literally	Use behavioral specs with acceptance criteria
Prescriptive architecture	Forces inconsistency with the codebase	Point to existing files as examples
Vague quality attributes	No verification criterion	Replace adjectives with measurable conditions
Restating codebase content	Context waste; staleness risk	Reference source files directly
Conflicting instructions	Agent silently drops one constraint	Cross-check all sections for contradictions before handoff

Format Guidance: Markdown, YAML, and Gherkin

Agent input specs (AGENTS.md, spec.md) work best in Markdown. Thoughtworks Radar Vol. 34 placed curated shared instructions at Adopt.

Open source

augmentcode/augment.vim★607

Star on GitHub

Agent output constraints (API contracts, function calling schemas) work best in structured formats. Anthropic directs developers to use structured outputs for guaranteed schema compliance.

Format	Best For	Strengths	Weaknesses
Markdown	Agent instructions, feature specs	Lowest authoring friction; dominant standard	No schema enforcement
YAML	API contracts (OpenAPI), config	Human-readable structured data	Indentation errors are silent
JSON Schema	Output format constraints, function calling	Structural conformance guarantee	Token-intensive; brittle for human authoring
Gherkin	Acceptance criteria executable as tests	Bridges stakeholders and QA	Variable LLM generation quality

For most teams: write the spec body in Markdown, define API contracts in YAML (OpenAPI), embed data schemas as Zod or JSON Schema code blocks, and express acceptance criteria as Given/When/Then scenarios or input/output tables.

Validation: Could an Agent Rebuild This Feature?

Before handing a spec to an agent, run it through this diagnostic framework. Each item addresses a documented failure mode from multi-agent failure taxonomies.

Structural Completeness:

The objective states a single, clear goal (no compound features)
Tech stack lists exact frameworks and versions
Input/output contracts use machine-readable schemas (not prose)
Every business rule maps to at least one acceptance criterion
Acceptance criteria have concrete inputs and expected outputs
Boundaries define Always, Ask First, and Never tiers
The test plan includes specific commands and coverage targets
"Not Included" section explicitly scopes out adjacent features

Anti-Pattern Scan:

No implementation hints (data structures, iteration patterns)
No pseudo-code the agent could treat as literal
No prescriptive architecture (class hierarchies, injection patterns)
No vague quality attributes ("fast," "secure," "clean")
No restated information already in the codebase

The Regeneration Test: Could an agent rebuild this feature from this spec alone and produce behaviorally identical output?

Teams using spec-driven development regenerate implementations rather than performing risky surgery on poorly understood legacy systems. When regeneration produces different behavior, the divergence is diagnostic. If the output structure changes across runs (e.g., different error field names), the input/output contract section is too vague. If scope expands (the agent adds unrequested features), the "Not Included" boundary is incomplete. If edge cases handled in the first run are missing in the second, the acceptance criteria table does not cover them.

Shuvendu Lahiri of Microsoft Research formalizes the core bottleneck: "There is no oracle for specification correctness other than the user." Executable tests serve as a proxy oracle. Without them, spec-driven development is prompt-driven development with extra steps.

Continuous Improvement: A spec is a living document. Cosmos's Context Engine processes commit history, codebase patterns, external sources, and tribal knowledge, giving Cosmos Experts the architectural awareness to flag when specifications drift from actual codebase behavior. When an agent produces unexpected output, the correct intervention is to identify which constraint was absent and update the spec so the next run avoids the same failure.

Write Your First Executable Specification This Sprint

Executable specifications are becoming the primary interface between engineering teams and AI agents. Addy Osmani captures the boundary precisely: "If you underspecify a task, the AI may do something unexpected; if you overspecify, you may as well code it yourself." The template above encodes a practical principle: specify outcomes, contracts, and constraints; leave implementation decisions to the agent.

In an AI-native development lifecycle, executable specifications are the contract layer between human intent and agent execution, replacing manual handoffs with verifiable behavioral boundaries.

Pick one feature where an agent-generated bug would be hard to trace. Write an executable spec using the three load-bearing sections (acceptance criteria, input/output contract, scope boundary). Generate code and tests from it. Regenerate one week later and compare behavioral parity. If the regeneration diverges, the delta tells you exactly which constraint was missing.

AI Spec Template: What to Include and Leave Out

TL;DR

Why AI Agents Need Executable Specifications

The Agentic SDLC

Required Sections: Seven Elements Every AI Spec Needs

1. Objective and Scope

2. Tech Stack and Versions

3. Input/Output Contracts

4. Business Rules and Constraints

5. Acceptance Criteria

6. Boundaries and Guardrails

7. Test Plan and Self-Verification Instructions

Optional Sections: Include When Relevant

Complete Template: Copy, Customize, Ship

Anti-Patterns: What to Leave Out

Implementation Hints That Constrain Agent Solutions

Pseudo-Code That Agents Treat as Literal

Prescriptive Architecture

Vague Quality Attributes

Conflicting Instructions Across Spec Sections

Format Guidance: Markdown, YAML, and Gherkin

Validation: Could an Agent Rebuild This Feature?

Write Your First Executable Specification This Sprint

Frequently Asked Questions About AI Spec Templates

Written by

Paula Hingel

Give your codebase the agents it deserves

TL;DR

Why AI Agents Need Executable Specifications

The Agentic SDLC

Required Sections: Seven Elements Every AI Spec Needs

1. Objective and Scope

2. Tech Stack and Versions

3. Input/Output Contracts

4. Business Rules and Constraints

5. Acceptance Criteria

6. Boundaries and Guardrails

7. Test Plan and Self-Verification Instructions

Optional Sections: Include When Relevant

Complete Template: Copy, Customize, Ship

Anti-Patterns: What to Leave Out

Implementation Hints That Constrain Agent Solutions

Pseudo-Code That Agents Treat as Literal

Prescriptive Architecture

Vague Quality Attributes

Conflicting Instructions Across Spec Sections

Format Guidance: Markdown, YAML, and Gherkin

Validation: Could an Agent Rebuild This Feature?

Write Your First Executable Specification This Sprint

Frequently Asked Questions About AI Spec Templates

What is the minimum viable AI spec template format?

Should AI agent specs use Gherkin or Markdown?

How detailed should an executable spec be for AI agents?

How do teams validate that a spec is complete before handing it to an agent?

What is the difference between AGENTS.md and a feature spec?

How does Cosmos use executable specifications in production workflows?

Related Guides

Written by

Paula Hingel

Give your codebase the agents it deserves