Skip to content
Install
Back to Guides

AI Spec Template: What to Include and Leave Out

Apr 8, 2026
Paula Hingel
Paula Hingel
AI Spec Template: What to Include and Leave Out

An effective AI spec template format includes four mandatory sections: user scenarios and testing, requirements, success criteria, and assumptions. In practice, teams often add implementation-adjacent sections such as scope boundaries, contracts, guardrails, and test plans because these elements provide AI agents with sufficient behavioral constraints to generate correct code without prescribing the implementation. It excludes implementation hints, pseudo-code, and vague quality attributes.

TL;DR

Community analyses of open agent spec repositories, including GitHub Spec Kit, show that most failures arise from vague requirements rather than format choice. The question is whether writing a full seven-section executable spec is the right investment for your feature. For complex features with validation logic, multi-step workflows, or compliance requirements, detailed executable specifications reduce correction cycles and improve regeneration fidelity. For a simple two-field form with one error state, the same template takes longer than the feature. This article covers what goes in each section, which sections are load-bearing when time is short, and what to leave out regardless of feature size.

Why AI Agents Need a Different Kind of Spec

AI agents need a different kind of spec because ambiguity produces plausible code, not reliable code. Traditional specifications serve humans who fill gaps with experience and judgment. AI agents do not fill gaps; they generate statistically plausible completions that may have no relationship to actual requirements.

When a specification contains ambiguity, the model does not reason through it the way a human engineer would. Addy Osmani, engineering lead on Google Chrome, captures the boundary precisely: "If you underspecify a task, the AI may do something unexpected; if you overspecify, you may as well code it yourself." The practical challenge is landing between those extremes. A concrete example of what underspecification produces: a spec that says "validate the payment input and return an error if invalid" will typically generate a function that returns a boolean or throws an exception, with no structured error object, no field-level error codes, and no handling of the case where multiple fields are invalid simultaneously. The agent's output is not obviously wrong; it passes a basic test. It breaks when the downstream UI tries to display which field failed, or when a retry system needs to distinguish a card expiry error from a CVV error. The spec produced plausible code. It did not produce the right contract.

Internal evaluations, such as OpenAI's HumanEval benchmark, show that GPT-4-series models achieve around 88–92% functional correctness; structured specs are theorized to improve reliability in task execution, though peer-reviewed confirmation remains ongoing. Claude models demonstrate strong code-generation accuracy on evaluation tasks, though standardized HumanEval scoring for Claude 3.7 has not yet been publicly released. Specs need to be detailed about behavioral outcomes; the arXiv paper on specification quality metrics shows that vague behavioral constraints, not format choice, are the primary source of agent failure.

Both GitHub Spec Kit and AWS Kiro structure spec authoring into intent and implementation phases, separating what should be built from how to build it, a pattern consistent with leading research on agent reliability. This prevents agents from jumping to implementation before the scope is fully defined. Intent applies the same principle by structuring spec authoring around outcome contracts and acceptance criteria, ensuring agents receive behavioral boundaries rather than implementation prescriptions.

See how Intent's living specs keep changing requirements synchronized across active agent work.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

The template that follows synthesizes patterns from these tools, practitioner reports, and peer-reviewed research into a format any team can adopt.

Required Sections: The Seven Elements Every AI Spec Needs

The minimum viable template has four mandatory sections. The seven elements below expand that core into an AI-executable format teams can use in production. Each section addresses a specific failure mode documented in production AI agent workflows. Omitting any one creates a gap that the agent fills with hallucinated assumptions. That said, not every feature warrants all seven sections. A feature with complex validation logic, multiple error states, or compliance requirements benefits from the full template. A simple two-field search filter with one error state probably does not. When time is short, the three sections that prevent the most common failures are: acceptance criteria with concrete inputs and outputs, input/output contracts with machine-readable schemas, and a "Not Included" scope boundary. Those three together eliminate the output shape ambiguity and scope creep that account for the majority of agent correction cycles. The remaining sections matter most when the feature touches external systems, has performance requirements, or will be regenerated later.

1. Objective and Scope

Objective and scope define what the agent should build and what it must ignore. A clear, singular goal in one to three sentences prevents AI agents from over-engineering or adding unrequested features. GitHub Spec Kit enforces this by requiring the spec to focus on "WHAT and WHY" and to be written for business stakeholders rather than developers.

Objective
Build a payment validator that checks the card number, expiry, CVV, and currency before processing. No external API calls.

The "Not Included" section is equally important. Explicitly listing what the agent should not implement prevents scope creep. Without it, agents routinely add features that seem logical but were never requested. When using Intent's spec scaffolding, teams defining feature scope are encouraged to document boundaries and constraints before implementation begins.

2. Tech Stack and Versions

A tech stack and its versions prevent framework drift by removing default assumptions. AI model defaults vary across frameworks and versions. A spec that says "use React" without specifying the version may result in code that mixes React 18 patterns with React 19 APIs. Explicit stack declarations eliminate this class of failures.

Kiro addresses this by using steering files in .kiro/steering/, which maintain conventions across sessions, so teams do not have to repeat stack declarations in every feature spec. GitHub Spec Kit uses a constitution file at .specify/memory/constitution.md (or /memory/constitution.md) for global governance that subsequent specs, plans, and implementation steps reference and enforce.

3. Input/Output Contracts

Input/output contracts constrain behavior using schemas rather than prose. AI agents cannot reliably infer types from prose descriptions. Machine-readable schemas using JSON Schema, Zod, or OpenAPI constrain the output space without prescribing implementation steps. They tell the model what must be true about inputs and outputs without specifying how to achieve it.

typescript
// Input Schema (Zod)
const PaymentInput = z.object({
cardNumber: z.string().regex(/^[0-9]{13,19}$/),
expiry: z.string().regex(/^(0[1-9]|1[0-2])[0-9]{2}$/),
cvv: z.string().regex(/^[0-9]{3,4}$/),
currency: z.string().regex(/^[A-Z]{3}$/),
amount: z.number().positive().max(10000)
});
// Output Schema
type ValidationResult = {
valid: boolean;
errors: Array<{
field: string;
code: string;
message: string;
}>;
};

The format decision follows the consumer. Use Zod or JSON Schema when the contract lives within a single TypeScript or JavaScript codebase, and the agent both generates and consumes it: the schema serves as both runtime validation and a spec. Use OpenAPI when the contract crosses a service boundary, is consumed by agents or clients in multiple languages, or needs to be readable by non-engineers. Use plain JSON Schema when you need strict output conformance from the model itself, particularly for function calling: OpenAI reports near-100% format conformance with JSON Schema in strict mode versus under 40% with prompting alone. Without any explicit contract, in whatever format fits, agents hallucinate API shapes, and integration breaks downstream in ways that are hard to trace back to the spec.

4. Business Rules and Constraints

Business rules and constraints must be deterministic because they become tests. Every rule must be deterministic and testable. Each rule maps directly to an acceptance criterion, creating a verification chain from requirement to test.

Business Rules
  1. Card number must pass the Luhn algorithm check.
  2. Expiry date must be a future month/year (UTC).
  3. CVV must be 3 digits (4 for Amex prefix 34/37).
  4. Currency must be a valid ISO 4217 currency code.
  5. Amount must be > 0 and ≤ 10,000 per transaction.
  6. Validation must be deterministic: the same input produces the same output.
  7. No external API calls during validation.

5. Acceptance Criteria

Acceptance criteria give both the agent and the team a binary oracle. Concrete, verifiable test cases with sample inputs and expected outputs serve as the spec's oracle. The agent can self-validate against these criteria, and the team has an unambiguous pass/fail signal.

IDInputExpected Output
AC1Valid Visa (4111111111111111, 12/28, 123, USD, 100){ valid: true, errors: [] }
AC2Expired card (01/24){ valid: false, errors: [{ field: "expiry", code: "EXPIRED" }] }
AC3Invalid CVV (2 digits){ valid: false, errors: [{ field: "cvv", code: "INVALID_CVV" }] }
AC4Amount exceeds 10,000{ valid: false, errors: [{ field: "amount", code: "EXCEEDS_LIMIT" }] }
AC5Non-ISO currency (XYZ){ valid: false, errors: [{ field: "currency", code: "INVALID_CURRENCY" }] }

The Jfokus 2026 presentation discusses natural-language requirements for coding agents. Whether teams use Given/When/Then syntax or input/output tables, the principle is the same: criteria must resolve to a binary pass/fail.

See how Intent's living specs keep acceptance criteria aligned with agent output as implementations evolve.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

6. Boundaries and Guardrails

Boundaries and guardrails limit autonomous action before it becomes costly. A three-tier boundary system defines what agents must always do, what requires human confirmation first, and what is strictly prohibited.

The tier that teams most commonly omit is "Ask First." Without it, agents treat ambiguous situations as either always permitted or always prohibited, neither of which reflects how engineering decisions actually work. A user notification feature illustrates the difference:

TierExamples
AlwaysLog all notification delivery attempts and outcomes; use UTC for scheduling; respect opt-out preferences
Ask FirstAdding a new notification channel not in the original spec; changing retry intervals; sending to more than 1,000 recipients in a single batch
NeverSend notifications without a verified opt-in record; modify the unsubscribe flow; access user contact data outside the notification service boundary

Without the Ask First tier, an agent building the retry logic will pick an interval on its own. With it, the agent pauses and surfaces the decision to the team before writing code that will need to be changed. For guidance on encoding these constraints as persistent project-level instructions, see the related guide on AGENTS.md.

7. Test Plan and Self-Verification Instructions

A test plan closes the loop by forcing the agent to prove conformance. Building a self-audit instruction directly into the spec closes the loop. After implementing, the agent compares the result with the spec and confirms all requirements are met.

The evidence gate concept comes from Blake Crosley's guidance on avoiding phantom verification, which requires that every completion report cite specific, observable evidence rather than vague claims of completion. Instead of allowing "tests should pass" or "implementation looks correct," the spec requires the agent to cite specific test output and file paths.

Complete Template: Copy, Customize, Ship

The following template assembles all seven required sections into a single, copy-paste-ready document. Replace the payment validator example with your feature, and keep all section headers.

markdown
# Feature Spec: Payment Validator
<!-- Replace with your feature name -->
## Objective
<!-- 1-3 sentences: what this feature does and why it exists -->
Build a payment validator that checks card number, expiry,
CVV, and currency before processing. No external API calls.
### Not Included
<!-- List what the agent should NOT build -->
- Retry logic or rate limiting
- Database persistence
- Stripe API integration
## Tech Stack
<!-- Pin exact versions; no ranges -->
- Runtime: Node.js 20 LTS
- Language: TypeScript 5.4 (strict mode)
- Validation: Zod 3.23
- Testing: Jest 29.7
## Input/Output Contracts
<!-- Use Zod, JSON Schema, or OpenAPI, not prose -->
typescript
// Input Schema (Zod)
const PaymentInput = z.object({
cardNumber: z.string().regex(/^[0-9]{13,19}$/),
expiry: z.string().regex(/^(0[1-9]|1[0-2])[0-9]{2}$/),
cvv: z.string().regex(/^[0-9]{3,4}$/),
currency: z.string().regex(/^[A-Z]{3}$/),
amount: z.number().positive().max(10000)
});
// Output Schema
type ValidationResult = {
valid: boolean;
errors: Array<{
field: string;
code: string;
message: string;
}>;
};
markdown
## Business Rules
<!-- Every rule must be deterministic and testable -->
1. Card number must pass Luhn algorithm check.
2. Expiry date must be future month/year (UTC).
3. CVV must be 3 digits (4 for American Express cards).
4. Currency must be a valid ISO 4217 currency code.
5. Amount must be > 0 and <= 10,000 per transaction.
6. Validation must be deterministic: same input produces same output.
7. No external API calls during validation.
## Acceptance Criteria
<!-- Concrete inputs -> expected outputs; each must resolve to pass/fail -->
| ID | Input | Expected Output |
|-----|-----------------------------------------------------|------------------------------------------------------------------------|
| AC1 | Valid Visa (4111111111111111, 12/28, 123, USD, 100) | { valid: true, errors: [] } |
| AC2 | Expired card (01/24) | { valid: false, errors: [{ field: "expiry", code: "EXPIRED" }] } |
| AC3 | Invalid CVV (2 digits) | { valid: false, errors: [{ field: "cvv", code: "invalid_cvv" }] } |
| AC4 | Amount exceeds 10,000 | { valid: false, errors: [{ field: "amount", code: "EXCEEDS_LIMIT" }] } |
| AC5 | Non-ISO currency (XYZ) | { valid: false, errors: [{ field: "currency" }] } |
## Boundaries
<!-- Three tiers: non-negotiable, requires approval, hard prohibitions -->
- Always: Run tests before commits, use UTC for dates, log validation failures.
- Ask First: Adding dependencies, changing error codes, modifying rate limits.
- Never: External API calls, commit secrets, modify DB schemas, log raw PII.
## Test Plan
<!-- Specific commands, coverage targets, and self-verification -->
- Run: npm test -- payment-validator.test.ts
- Coverage requirements should be defined by the project's quality guidelines.
- Self-Verify: After implementation, run all 5 ACs. List any unmet criteria before submitting.
### Prohibited Completion Phrases
<!-- Force evidence-based reporting, not vague claims -->
- "tests should pass"
- "follows best practices"
- "implementation looks correct"
Required instead: cite specific test output and file paths.

Each section maps to a failure mode documented above. Remove none; customize all.

Optional Sections: Include When Relevant

Optional sections improve a spec only when they constrain real failure modes. Not every spec needs every section. Including irrelevant sections wastes context tokens and dilutes the instructions that matter.

Optional SectionWhen to IncludeExample
Performance targetsThe feature has latency or throughput requirements"P99 latency < 50ms for payloads under 10KB."
Migration notesReplacing or upgrading existing functionality"Must maintain backward compatibility with v1 API responses."
UI state definitionsFrontend features with complex state transitions"Loading, error, empty, and populated states for the dashboard widget"
Error handling taxonomyThe feature surfaces errors to end users"4xx errors return structured JSON; 5xx errors return generic message"
Compliance constraintsRegulated domains (healthcare, finance)"All PII fields must be encrypted at rest per HIPAA §164.312."

GitHub Spec Kit models this well: its template removes optional sections if they are not applicable, rather than leaving them as "N/A." Empty sections are noise that dilutes the signal.

Anti-Patterns: What to Leave Out

AI spec anti-patterns fail in one of two directions. Over-constraining implementation locks the agent into a solution that may conflict with codebase conventions or miss a better approach the agent would have found if given behavioral latitude. Under-constraining behavior leaves gaps that the agent fills with statistically plausible defaults that appear correct in isolation but break during integration. The anti-patterns below map to specific observable failures, not to general writing-quality concerns.

Implementation Hints That Constrain Agent Solutions

Implementation hints constrain the solution space without improving behavioral clarity. Specifying data structures, iteration mechanisms, or initialization patterns eliminates the agent's ability to find a more appropriate solution. Prescribing "use a HashMap" implicitly determines thread safety, memory layout, and serialization behavior.

text
# Before (over-specified)
Use a HashMap to store sessions. Initialize in constructor.
Clean up with for-each loop.
# After (outcome-specified)
Session lookup must be O(1). Sessions expire after 30 minutes
of inactivity. Cleanup must not block request handling.

Pseudo-Code That Agents Treat as Literal

Pseudo-code often becomes implementation by accident. Agents can misinterpret high-level guidance and translate it too directly into code. The implicit decisions in pseudo-code, sequential processing, no error recovery, no batching, coupled operations, carry forward into production code without scrutiny.

text
# Before (pseudo-code)
for each user in users:
if user.lastLogin > 30 days:
send_reminder_email(user)
# After (behavioral spec)
Send reminder emails to users inactive 30+ days.
- Idempotent: safe to run multiple times without duplicates
- Email failures must not prevent status updates
- Process in batches of 100 to avoid memory pressure

Prescriptive Architecture

Prescriptive architecture creates conflict with the codebase's real conventions. Pre-specifying class hierarchies and injection patterns prevents the agent from matching the existing codebase's actual conventions. The fix, explained in the Pete Hodgson article, is to point to an existing example: "add instrumentation to the UpdateAllProjects modal; there's an existing example in UpdateCompany," lets the agent infer the correct pattern by reading the codebase directly.

Vague Quality Attributes

Vague quality attributes fail because they cannot be verified. "Make it fast," "ensure high security," and "handle errors gracefully" give agents no actionable target and no verification criterion. Replace adjectives with verifiable behaviors: "return structured error codes for all 4xx/5xx responses," "P99 latency < 50ms," "pass OWASP ZAP scan."

What the Codebase Already Contains

Restating codebase details wastes context and goes stale. Restating existing implementation details, as noted in the Pete article, wastes context and introduces a risk of staleness. If the implementation changes, the spec becomes actively misleading. Point to files instead: "The auth pattern is in src/middleware/auth.ts. Follow the same approach."

Anti-PatternFailure ModeFix
Implementation hintsEliminates better solutionsSpecify outcomes, not mechanisms
Pseudo-codeAgent copies flawed structure literallyUse behavioral specs with acceptance criteria
Prescriptive architectureForces inconsistency with codebasePoint to existing files as examples
Vague quality attributesNo verification criterionReplace adjectives with measurable conditions
Restating codebase contentContext waste; staleness riskReference source files directly
Copying entire style guidesContext bloat; conflicts with priorsDocument only deviations from conventions

Format Guidance: Markdown, YAML, and Gherkin Tradeoffs

Format guidance matters because specs and contracts do different jobs. The most important format decision is not picking one format. It is recognizing that input specs and output constraints are two different jobs requiring different formats.

The Input/Output Format Split

The input/output split is the practical rule most teams should remember. Agent input specs (AGENTS.md, spec.md, requirements.md) work best in Markdown. The Thoughtworks Radar placed AGENTS.md in the "Trial" ring as a common format for agent instruction files.

Open source
augmentcode/augment-swebench-agent867
Star on GitHub

Agent output constraints, including API contracts, function calling schemas, and inter-agent data, work best in structured formats. The OpenAI docs report near-100% output-format conformance with JSON Schema in strict mode, compared to less than 40% when prompting alone.

FormatBest ForStrengthsWeaknesses
MarkdownAgent instructions, feature specs, planning docsLowest authoring difficulty; dominant standardNo schema enforcement; vagueness at scale
YAMLAPI contracts (OpenAPI), config metadata, skill definitionsOpenAI Model Spec: guidance for handling structured/untrusted prompt data using YAML, JSON, or XMLIndentation errors are silent; less natural for narrative
JSON SchemaOutput format constraints, function calling and data validationOpenAI reports near-100% conformance in strict mode on its internal evals for Structured OutputsToken-intensive; brittle for human authoring
GherkinAcceptance criteria executable as testsBridges stakeholders and QA; executable via CucumberThe SBES study analyzes the effectiveness and variability of LLM-generated Gherkin but does not systematically document failure modes

The Layered Approach

A layered approach keeps each format aligned with the job it does best. All three major spec-driven tools, GitHub Spec Kit, AWS Kiro, and Tessl, use layered, multi-document specs rather than a single format. The Jfokus slides identify four components: intent (Markdown), interfaces (OpenAPI/YAML), requirements (Markdown), and acceptance criteria (Gherkin). Different concerns are expressed in the format best suited to each.

For most teams, the practical recommendation is simple: write the spec body in Markdown, define API contracts in YAML (OpenAPI), embed data schemas as Zod or JSON Schema code blocks, and express acceptance criteria as Given/When/Then scenarios or input/output tables.

Validation Checklist: Could an Agent Rebuild This Feature?

Validation checks whether the spec is complete enough to regenerate behavior, not just code. Before handing a spec to an agent, run it through this checklist. Each item addresses a documented failure mode.

Structural Completeness

  • Objective states a single, clear goal (no compound features)
  • Tech stack lists exact frameworks and versions
  • Input/output contracts use machine-readable schemas (not prose)
  • Every business rule maps to at least one acceptance criterion
  • Acceptance criteria have concrete inputs and expected outputs
  • Boundaries define Always, Ask First, and Never tiers
  • Test plan includes specific commands and coverage targets
  • "Not Included" section explicitly scopes out adjacent features

Anti-Pattern Scan

  • No implementation hints (data structures, iteration patterns)
  • No pseudo-code that the agent could treat as literal
  • No prescriptive architecture (class hierarchies, injection patterns)
  • No vague quality attributes ("fast," "secure," "clean")
  • No restated information already in the codebase
  • No future work or "nice-to-haves" mixed in

The Regeneration Test

The regeneration test asks whether the spec can recreate equivalent behavior. The ultimate validation question: could an agent rebuild this feature from this spec alone and produce behaviorally identical output?

This is not hypothetical. Teams using spec-driven development regenerate implementations rather than performing risky surgery on poorly understood legacy systems. Regeneration fidelity requires behavioral tests: code that passes every functional test but produces different error messages or UI states is not equivalent. When regeneration does produce different behavior, the divergence is diagnostic. If the output structure changes across runs (e.g., different error field names or status codes), the input/output contract section is missing or too vague. If the scope expands and the agent adds unrequested features, the "Not Included" boundary is incomplete. If edge cases handled in the first run are missing in the second, the acceptance criteria table does not explicitly cover them. Divergence is not a model reliability problem; it is a signal of spec completeness. Each difference between two regeneration points indicates a constraint that should have been in the spec but was not.

The arXiv paper on specification validation identifies the core bottleneck: "since there is no oracle for specification correctness other than the user, we need spec metrics that can assess specification quality." Executable tests serve as that proxy oracle. Without them, spec-driven development is prompt-driven development with extra steps.

Continuous Validation

Continuous validation turns unexpected agent behavior into better specs. A spec is not a write-once document. Cursor rules guidance is often summarized as: begin with basic rules, then add more as needed when repeated mistakes emerge. When an agent produces unexpected output, the correct intervention is to identify which constraint was absent or ambiguous and then update the spec so that the next run avoids the same failure.

GitHub Spec Kit's /speckit.analyze command runs a Spec Kit check across spec, plan, and task files for inconsistencies, ambiguities, coverage gaps, underspecification, and constitution alignment. Automated checks like these catch drift that manual review misses.

Start With One Executable Spec

The gap between AI-generated code that works and code that requires hours of manual fix-up is usually a spec-quality problem, not a model-capability problem. The template above encodes a practical principle: specify outcomes, contracts, and constraints; leave implementation decisions to the agent.

The next step is concrete: pick one feature where an agent-generated bug would be hard to trace, write an executable spec using the three load-bearing sections at minimum (acceptance criteria, input/output contract, scope boundary), generate code and tests from it, then regenerate one week later and compare behavioral parity. If the regeneration diverges, the delta tells you exactly which constraint was missing. Each spec you write and regenerate calibrates the template to your codebase's actual failure modes faster than any checklist can.

See how Intent's living specs keep agent work aligned as implementations evolve.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

Frequently Asked Questions about AI-Executable Spec Template

Written by

Paula Hingel

Paula Hingel

Paula writes about the patterns that make AI coding agents actually work — spec-driven development, multi-agent orchestration, and the context engineering layer most teams skip. Her guides draw on real build examples and focus on what changes when you move from a single AI assistant to a full agentic codebase.

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.