Why do AI agents produce flaky E2E tests even when the code works?

Most AI coding agents operate from static code analysis without access to a running browser or cross-session memory. Selectors reference elements that may not exist at runtime, timing assumptions break in CI environments, and each agent session produces subtly different assertions for the same flow.

What is the minimum viable stable app contract?

A Zod schema file shared between backend handlers, frontend clients, and test suites. Define the request, response, and error schemas for a critical endpoint. Import and .parse() on both sides. Schema violations produce ZodError with exact field paths instead of mysterious E2E failures.

How do state machine contracts prevent AI agents from skipping error paths?

The testCoverage() method from @xstate/test fails if any state in the machine was never reached during test execution. Agents that implement only the happy path fail the coverage check. Error states are explicitly modeled in the machine definition and automatically enforced through graph traversal.

Can stable contracts work with multiple AI coding agents working in parallel?

Stable contracts are the coordination mechanism that makes parallel agents work coherently. Each agent reads the same schema, so implementation details may differ while the contract surface remains consistent. Research on multi-agent workflows confirms that underspecified plans and information loss among agents lead to inconsistent code-generation outputs.

How do contract-first teams handle schema changes without breaking everything?

Semantic versioning (v1, v2) combined with CI enforcement through oasdiff and Pact's can-i-deploy. Breaking changes increment the major version; backward-compatible additions increment the minor version. Pact's can-i-deploy check requires both consumer and provider to publish contracts to a shared broker, so teams consuming third-party APIs they do not control should fall back to oasdiff-only enforcement against the provider's published spec.

Why AI Coding Agents Fail E2E Tests (And What a Stable App Contract Looks Like)

AI agents fail E2E tests because they generate code from static file analysis without runtime observability, cross-session memory, or shared contract enforcement, producing brittle selectors, implicit timing assumptions, and schema drift that compound into flaky, unreliable test suites.

TL;DR

AI-generated E2E failures cluster into five modes, each with a different contract fix. Consistent failures trace to selector brittleness or schema drift. Intermittent CI-only failures trace to hardcoded timing. Tests that pass but ship wrong behavior trace to hardcoded assertions. Identifying the mode first determines the minimum contract investment required to resolve it.

Teams usually discover the AI testing problem in the most expensive place possible: after code generation appears successful, but the E2E suite starts failing for reasons that are hard to reproduce. A frontend flow renders, a backend route returns JSON, and a regenerated Playwright test still fails because the agent guessed the wrong selector, assumed the wrong timing, or followed an outdated response shape. Those failures are structural, not incidental. E2E testing validates behavior across layers, while most coding agents generate from partial context inside isolated sessions.

That gap is exactly where spec-driven development becomes practical rather than theoretical. Intent's living specs give agents stable, machine-readable artifacts to read before they write code, keeping parallel agents aligned as requirements evolve and implementations change. This guide explains the main failure modes behind AI-generated E2E breakage, then shows what a stable app contract looks like in practice with OpenAPI, Zod, state machines, Playwright configuration, and CI enforcement.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

The Core Problem: Agents Optimize Locally, Tests Validate Globally

E2E tests validate that an entire application works as a user expects: the frontend renders the correct data, the backend returns the correct responses, and state transitions follow business rules. AI coding agents, by contrast, operate on isolated files in isolated sessions. An agent generating a backend handler has no awareness of what the frontend test expects. An agent writing a Playwright test has no access to a running browser.

Traditional testing assumes identical input produces identical output, but agents break that assumption because LLMs are non-deterministic and agent workflows span multiple steps. Many failure modes in AI agent E2E testing stem from the gap between static code generation and runtime behavior validation.

The fix is not better prompting or more retries. The right contract depends on which failure mode is causing the breakage, and the five modes below have different root causes and different remedies.

Five Failure Modes: How Agents Break E2E Suites

Understanding these failure modes explains why E2E testing is uniquely difficult for AI agents and why each mode traces back to a missing contract. The failure mode also determines which contract layer to apply first, starting with the right one, which avoids the mistake of layering all three contract types onto a suite that only needed one.

Failure Mode 1: Brittle, Hallucinated Selectors

Agents default to CSS class selectors (.btn-primary, .form-submit) that break when design systems update. They reference data-testid attributes never added to actual components and produce XPath expressions based on assumed DOM nesting that do not match the rendered tree. Playwright docs explicitly recommend avoiding implementation details such as CSS class names and function names.

Failure Mode 2: Timing Assumptions and Race Conditions

Fixed page.waitForTimeout(2000) calls are hard waits that introduce flakiness and unnecessary delays compared with condition-based waits. In CI environments with different network latency and CPU allocation, hardcoded delays either expire too early or add unnecessary wait time. Test flakiness from timing issues is among the most notable challenges in AI-generated E2E automation.

Failure Mode 3: Schema Drift Across Layers

Frontend and backend code are often generated in separate sessions. Without explicit contract tooling such as OpenAPI or contract testing, there is no automatic mechanism to enforce API consistency. A concrete failure: tests break after an API change because an agent's invocation logic no longer matches the provider's updated schema. Typed schemas and constrained actions only work when consistently enforced across sessions.

Failure Mode 4: Non-Deterministic Test Suites

Two agent sessions generating tests for the same flow produce different selector strategies, assertion granularities, and setup or teardown assumptions. Non-deterministic model behavior makes strict pass/fail evaluation fragile when workflows span multiple agent steps.

Failure Mode 5: Hardcoded Assertions That Conceal Bugs

The most dangerous failure mode is an agent optimizing for green tests instead of a correct implementation. A concrete example: an agent implementing a payment flow generates a test that asserts the confirmation page heading is visible after submission. The test passes. The agent also hardcoded the payment amount as $0.00 in the mock response because that matched the test fixture. The test is green; the bug ships. The state machine coverage check in Layer 3 catches this class of failure because "payment confirmed with zero amount" is not a modeled state, forcing the agent to implement the full payment flow rather than a shortcut version.

Root Cause	Failure Modes
No runtime DOM access	Brittle selectors, timing violations
No cross-session memory	Schema drift, context drift, state pollution
Optimization for passing tests	Over-mocking, hardcoded assertions
Inherent LLM non-determinism	Inconsistent suites, CI non-determinism
No inter-agent contract enforcement	Schema mismatch, multi-agent inconsistency

What a Stable App Contract Looks Like

A stable app contract is a machine-readable artifact that defines API endpoints, request and response shapes, data types, and validation rules before any implementation code is written. The contract operates across three layers, each catching a different class of failure.

Layer 1: OpenAPI Specification (HTTP Contract)

The OpenAPI spec defines endpoints, methods, status codes, and schemas as the canonical reference for both human developers and AI agents: a standard, language-agnostic interface description that allows both humans and computers to understand a service without requiring access to source code.

yaml

# orders.v1.yaml: Source of truth for order endpoints
openapi: "3.1.0"
info:
  title: Order API
  version: "1.0.0"
paths:
  /orders:
    post:
      operationId: createOrder
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: "#/components/schemas/CreateOrderRequest"
      responses:
        "201":
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/OrderResponse"
        "422":
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/ValidationErrorResponse"
components:
  schemas:
    CreateOrderRequest:
      type: object
      required: [items, shippingAddress]
      additionalProperties: false
      properties:
        items:
          type: array
          minItems: 1
        shippingAddress:
          type: string
    OrderResponse:
      type: object
      required: [orderId, orderStatus, amount, createdAt]
      properties:
        orderId:
          type: string
          format: uuid
        orderStatus:
          type: string
          enum: [pending, shipped, delivered]
        amount:
          type: number
          minimum: 0

Key decisions that stabilize AI agent output: operationId values serve as stable function-name anchors across generated files. additionalProperties: false on request bodies tells agents the exact field set. All 4xx and 5xx responses are enumerated with schemas, so error handling is complete, not inferred.

Layer 2: Zod Schemas (Runtime Enforcement)

Zod provides runtime contract enforcement with TypeScript type inference, making the same schema the source of truth for backend validation, frontend parsing, and test assertions.

typescript

// packages/contracts/src/order.schema.ts
import { z } from "zod";

export const OrderStatusSchema = z.enum(["pending", "shipped", "delivered"]);

export const CreateOrderRequestSchema = z.object({
  items: z.array(z.object({
    productId: z.string().uuid(),
    quantity: z.number().int().positive(),
  })).min(1),
  shippingAddress: z.string().min(1),
});

export const OrderResponseSchema = z.object({
  orderId: z.string().uuid(),
  orderStatus: OrderStatusSchema,
  amount: z.number().nonnegative(),
  createdAt: z.string().datetime(),
});

export type CreateOrderRequest = z.infer<typeof CreateOrderRequestSchema>;
export type OrderResponse = z.infer<typeof OrderResponseSchema>;

When both the backend handler and the frontend client import and parse using the same Zod schema, any backend change that violates the contract produces a ZodError with the exact field path and actionable diagnostics, instead of a mysterious E2E failure.

Layer 3: State Machine Definitions (Behavioral Contract)

State machines make behavioral contracts the authoritative artifact from which tests are derived. Per Stately docs, @xstate/test utilities automatically generate test cases from state machines, ensuring comprehensive coverage of all possible paths.

typescript

// checkout-flow.machine.ts: Behavioral contract
import { createMachine } from 'xstate';

export const checkoutMachine = createMachine({
  id: 'checkout',
  initial: 'cart',
  states: {
    cart: {
      meta: {
        test: async (page) => {
          await expect(page.getByRole('region', { name: 'Shopping Cart' }))
            .toBeVisible();
        }
      },
      on: { PROCEED_TO_CHECKOUT: 'payment' }
    },
    payment: {
      meta: {
        test: async (page) => {
          await expect(page.getByRole('form', { name: 'Payment Details' }))
            .toBeVisible();
        }
      },
      on: { SUBMIT_PAYMENT: 'processing', RETURN_TO_CART: 'cart' }
    },
    confirmed: {
      meta: {
        test: async (page) => {
          await expect(
            page.getByRole('heading', { name: /Order Confirmed/i })
          ).toBeVisible();
        }
      },
      type: 'final'
    },
    paymentError: {
      meta: {
        test: async (page) => {
          await expect(page.getByRole('alert')).toBeVisible();
          await expect(page.getByRole('button', { name: 'Try Again' }))
            .toBeEnabled();
        }
      },
      on: { RETRY_PAYMENT: 'payment' }
    }
  }
});

The testCoverage() method fails if any state was never reached during test execution. An AI agent that implements only the happy path cannot pass this check, thereby automatically enforcing error path coverage. XState is the right tool when state complexity justifies it. A checkout flow with five states and two error paths earns the machine definition overhead. A simple form submission probably does not. Playwright's test.step blocks that name each state transition explicitly can achieve most of the same coverage benefit without adding XState to the stack.

Playwright Configuration for Agent-Friendly Testing

Playwright docs establish the foundational principle: automated tests should verify that the application code works for end users while avoiding reliance on implementation details such as CSS class names.

typescript

// playwright.config.ts
import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  testDir: './tests',
  timeout: 30_000,
  expect: { timeout: 5_000 },
  fullyParallel: true,
  forbidOnly: !!process.env.CI,
  retries: process.env.CI ? 2 : 0,
  workers: process.env.CI ? 1 : undefined,
  use: {
    baseURL: process.env.BASE_URL ?? 'http://localhost:3000',
    testIdAttribute: 'data-testid',
    trace: 'on-first-retry',
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
  },
  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
    { name: 'mobile-chrome', use: { ...devices['Pixel 5'] } },
  ],
});

Explicitly declaring testIdAttribute: 'data-testid' makes the team's selector convention machine-readable for every engineer and AI agent reading the config. Setting forbidOnly: !!process.env.CI prevents .only from silently narrowing test runs in CI.

Testing Library and Playwright both recommend prioritizing user-facing locators, with test IDs used as a last resort:

Strategy	AI Regen Resilience	When to Use
getByRole('button', { name: 'Submit' })	High	Interactive elements with stable accessible names
getByLabel('Email address')	High	Form inputs with label associations
getByTestId('login-form-submit')	Highest	Dynamic text, i18n, complex components
locator('.btn-primary')	Fragile	Never
locator('xpath=//button[2]')	Fragile	Never

Component-scoped, descriptive, kebab-case data-testid names prevent collision. An agent seeing data-testid="btn" generates getByTestId('btn'). When a second agent later adds a Cancel button to the same form using the same pattern, Playwright throws a strict mode violation because the locator matches multiple elements. With data-testid="login-form-submit", the ID is unique by construction, so collisions cannot occur.

Enforcing Contracts in CI: The Merge Gate

Contracts only prevent E2E failures when CI enforces them. A complete pipeline chains schema linting, breaking change detection, and compatibility checks.

Open source

augmentcode/augment.vim★608

Star on GitHub

yaml

# .github/workflows/contract-check.yml
name: API Contract Enforcement
on:
  pull_request:
    paths: ['openapi.yaml']
jobs:
  breaking-changes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Detect breaking changes
        uses: oasdiff/oasdiff-action@latest
        with:
          err-ignore: '.oasdiff-ignore'

Non-zero exit on severity: error from Spectral linting blocks the merge. When combined with the Pact broker, can-i-deploy checks for consumer-provider compatibility, the pipeline catches contract violations before they reach E2E test execution.

Enforcement Layer	Tool	Blocks Merge
Schema style/lint	Spectral	Yes
Breaking change detection	oasdiff	Yes (configurable)
Consumer-provider compatibility	Pact can-i-deploy	Yes
Schema syntax validation	openapi-generator-cli validate	Partial: syntax only

Making Agents Contract-Aware: Rules Files and Context Injection

Contracts exist as files. Agents need explicit instructions to read them. Anthropic's context engineering research recommends curating diverse, canonical examples rather than listing exhaustive rules.

markdown

# .cursor/rules/api-contract.mdc
description: "API contract compliance for route handlers and API clients"
alwaysApply: false

## Contract Sources
- OpenAPI spec: docs/openapi.yaml (source of truth)
- Zod schemas: src/schemas/ (runtime validation)
- Generated types: src/types/api.generated.ts (do not hand-edit)

## Rules
1. Before generating any route handler, READ the relevant path in docs/openapi.yaml
2. Request validation MUST use Zod schemas from src/schemas/
3. Response objects MUST include only fields declared in the OpenAPI spec
4. If you cannot find the schema for a type, STOP and ask

In multi-agent setups, separate agents receive explicit interface contracts or scoped specifications for their part of the work, reducing overlap and confusion. Intent's Coordinator agent implements this pattern by delegating to parallel Implementor agents, each of which receives only the contract slice relevant to their task.

How Intent's Architecture Maps to These Contracts

The three-layer contract architecture maps directly onto Intent's workflow. A Coordinator agent analyzes the codebase and drafts the living spec. Implementor agents execute tasks in parallel, reading from and writing to the spec. A Verifier agent checks results against the original specification before human review. When requirements change, Intent automatically propagates updates to all active agents.

Research on the planner-coder gap identifies why this coordination layer is architecturally necessary: when planning agents decompose requirements into underspecified plans, coding agents misinterpret intricate logic. A living spec addresses this gap by maintaining completeness as the authoritative artifact throughout implementation.

text

┌──────────────────────────────────────────────────────┐
│ SPEC LAYER (Living spec, human-owned)                │
│ OpenAPI schemas + state machine contracts            │
│ Intent's Coordinator drafts; humans refine           │
└──────────────────┬───────────────────────────────────┘
                   │ derives
                   ▼
┌──────────────────────────────────────────────────────┐
│ TEST LAYER (Contract-derived, CI-enforced)           │
│ Zod validation, Playwright assertions, oasdiff       │
│ Intent's Verifier checks against spec                │
└──────────────────┬───────────────────────────────────┘
                   │ validates
                   ▼
┌──────────────────────────────────────────────────────┐
│ IMPLEMENTATION LAYER (Agent-generated, replaceable)  │
│ Route handlers, components, API clients              │
│ Intent's Implementor agents execute in parallel      │
└──────────────────────────────────────────────────────┘

The implementation layer becomes the disposable artifact that AI agents regenerate against a stable contract, rather than the contract itself.

Start with One Contract Boundary Before the Next Agent Regeneration

Diagnose before adding contracts. Review the last five E2E failures in CI history. Consistently, the same assertion on the same selector means the minimum fix is a Zod schema shared between the backend handler and the frontend test, plus a data-testid convention scoped by component and role. Intermittent timing failures that pass on retries mean the minimum fix is to replace hardcoded waits with condition-based waits. Tests that pass but ship wrong behavior mean the minimum fix is a state machine with testCoverage() enforced in CI. Most teams need one of these three, not all three at once.

Intent's living specs act as the authoritative shared contract for parallel agents, so spec and contract updates propagate across active work before mismatches cause downstream integration failures.

Why AI Coding Agents Fail E2E Tests (And What a Stable App Contract Looks Like)

TL;DR

The Agentic SDLC

The Core Problem: Agents Optimize Locally, Tests Validate Globally

Five Failure Modes: How Agents Break E2E Suites

Failure Mode 1: Brittle, Hallucinated Selectors

Failure Mode 2: Timing Assumptions and Race Conditions

Failure Mode 3: Schema Drift Across Layers

Failure Mode 4: Non-Deterministic Test Suites

Failure Mode 5: Hardcoded Assertions That Conceal Bugs

What a Stable App Contract Looks Like

Layer 1: OpenAPI Specification (HTTP Contract)

Layer 2: Zod Schemas (Runtime Enforcement)

Layer 3: State Machine Definitions (Behavioral Contract)

Playwright Configuration for Agent-Friendly Testing

Enforcing Contracts in CI: The Merge Gate

Making Agents Contract-Aware: Rules Files and Context Injection

How Intent's Architecture Maps to These Contracts

Start with One Contract Boundary Before the Next Agent Regeneration

Frequently Asked Questions about AI Agents and E2E Testing

Written by

Paula Hingel

Give your codebase the agents it deserves

TL;DR

The Agentic SDLC

The Core Problem: Agents Optimize Locally, Tests Validate Globally

Five Failure Modes: How Agents Break E2E Suites

Failure Mode 1: Brittle, Hallucinated Selectors

Failure Mode 2: Timing Assumptions and Race Conditions

Failure Mode 3: Schema Drift Across Layers

Failure Mode 4: Non-Deterministic Test Suites

Failure Mode 5: Hardcoded Assertions That Conceal Bugs

What a Stable App Contract Looks Like

Layer 1: OpenAPI Specification (HTTP Contract)

Layer 2: Zod Schemas (Runtime Enforcement)

Layer 3: State Machine Definitions (Behavioral Contract)

Playwright Configuration for Agent-Friendly Testing

Enforcing Contracts in CI: The Merge Gate

Making Agents Contract-Aware: Rules Files and Context Injection

How Intent's Architecture Maps to These Contracts

Start with One Contract Boundary Before the Next Agent Regeneration

Frequently Asked Questions about AI Agents and E2E Testing

Why do AI agents produce flaky E2E tests even when the code works?

What is the minimum viable stable app contract?

How do state machine contracts prevent AI agents from skipping error paths?

Can stable contracts work with multiple AI coding agents working in parallel?

How do contract-first teams handle schema changes without breaking everything?

Related Guides

Written by

Paula Hingel

Give your codebase the agents it deserves