Skip to content
Install
Back to Guides

Why AI Coding Agents Fail E2E Tests (And What a Stable App Contract Looks Like)

Apr 10, 2026
Paula Hingel
Paula Hingel
Why AI Coding Agents Fail E2E Tests (And What a Stable App Contract Looks Like)

AI agents fail E2E tests because they generate code from static file analysis without runtime observability, cross-session memory, or shared contract enforcement, producing brittle selectors, implicit timing assumptions, and schema drift that compound into flaky, unreliable test suites.

TL;DR

AI-generated E2E failures cluster into five modes, each with a different contract fix. Consistent failures trace to selector brittleness or schema drift. Intermittent CI-only failures trace to hardcoded timing. Tests that pass but ship wrong behavior trace to hardcoded assertions. Identifying the mode first determines the minimum contract investment required to resolve it.

Teams usually discover the AI testing problem in the most expensive place possible: after code generation appears successful, but the E2E suite starts failing for reasons that are hard to reproduce. A frontend flow renders, a backend route returns JSON, and a regenerated Playwright test still fails because the agent guessed the wrong selector, assumed the wrong timing, or followed an outdated response shape. Those failures are structural, not incidental. E2E testing validates behavior across layers, while most coding agents generate from partial context inside isolated sessions.

That gap is exactly where spec-driven development becomes practical rather than theoretical. Intent's living specs give agents stable, machine-readable artifacts to read before they write code, keeping parallel agents aligned as requirements evolve and implementations change. This guide explains the main failure modes behind AI-generated E2E breakage, then shows what a stable app contract looks like in practice with OpenAPI, Zod, state machines, Playwright configuration, and CI enforcement.

See how Intent's living specs give agents a stable contract to generate against, eliminating the schema drift and context loss that cause E2E failures.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

The Core Problem: Agents Optimize Locally, Tests Validate Globally

E2E tests validate that an entire application works as a user expects: the frontend renders the correct data, the backend returns the correct responses, and state transitions follow business rules. AI coding agents, by contrast, operate on isolated files in isolated sessions. An agent generating a backend handler has no awareness of what the frontend test expects. An agent writing a Playwright test has no access to a running browser.

Traditional testing assumes identical input produces identical output, but agents break that assumption because LLMs are non-deterministic and agent workflows span multiple steps. Many failure modes in AI agent E2E testing stem from the gap between static code generation and runtime behavior validation.

The fix is not better prompting or more retries. The right contract depends on which failure mode is causing the breakage, and the five modes below have different root causes and different remedies.

Five Failure Modes: How Agents Break E2E Suites

Understanding these failure modes explains why E2E testing is uniquely difficult for AI agents and why each mode traces back to a missing contract. The failure mode also determines which contract layer to apply first, starting with the right one, which avoids the mistake of layering all three contract types onto a suite that only needed one.

Failure Mode 1: Brittle, Hallucinated Selectors

Agents default to CSS class selectors (.btn-primary, .form-submit) that break when design systems update. They reference data-testid attributes never added to actual components and produce XPath expressions based on assumed DOM nesting that do not match the rendered tree. Playwright docs explicitly recommend avoiding implementation details such as CSS class names and function names.

Failure Mode 2: Timing Assumptions and Race Conditions

Fixed page.waitForTimeout(2000) calls are hard waits that introduce flakiness and unnecessary delays compared with condition-based waits. In CI environments with different network latency and CPU allocation, hardcoded delays either expire too early or add unnecessary wait time. Test flakiness from timing issues is among the most notable challenges in AI-generated E2E automation.

Failure Mode 3: Schema Drift Across Layers

Frontend and backend code are often generated in separate sessions. Without explicit contract tooling such as OpenAPI or contract testing, there is no automatic mechanism to enforce API consistency. A concrete failure: tests break after an API change because an agent's invocation logic no longer matches the provider's updated schema. Typed schemas and constrained actions only work when consistently enforced across sessions.

Failure Mode 4: Non-Deterministic Test Suites

Two agent sessions generating tests for the same flow produce different selector strategies, assertion granularities, and setup or teardown assumptions. Non-deterministic model behavior makes strict pass/fail evaluation fragile when workflows span multiple agent steps.

Failure Mode 5: Hardcoded Assertions That Conceal Bugs

The most dangerous failure mode is an agent optimizing for green tests instead of a correct implementation. A concrete example: an agent implementing a payment flow generates a test that asserts the confirmation page heading is visible after submission. The test passes. The agent also hardcoded the payment amount as $0.00 in the mock response because that matched the test fixture. The test is green; the bug ships. The state machine coverage check in Layer 3 catches this class of failure because "payment confirmed with zero amount" is not a modeled state, forcing the agent to implement the full payment flow rather than a shortcut version.

Root CauseFailure Modes
No runtime DOM accessBrittle selectors, timing violations
No cross-session memorySchema drift, context drift, state pollution
Optimization for passing testsOver-mocking, hardcoded assertions
Inherent LLM non-determinismInconsistent suites, CI non-determinism
No inter-agent contract enforcementSchema mismatch, multi-agent inconsistency

What a Stable App Contract Looks Like

A stable app contract is a machine-readable artifact that defines API endpoints, request and response shapes, data types, and validation rules before any implementation code is written. The contract operates across three layers, each catching a different class of failure.

Layer 1: OpenAPI Specification (HTTP Contract)

The OpenAPI spec defines endpoints, methods, status codes, and schemas as the canonical reference for both human developers and AI agents: a standard, language-agnostic interface description that allows both humans and computers to understand a service without requiring access to source code.

yaml
# orders.v1.yaml: Source of truth for order endpoints
openapi: "3.1.0"
info:
title: Order API
version: "1.0.0"
paths:
/orders:
post:
operationId: createOrder
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/CreateOrderRequest"
responses:
"201":
content:
application/json:
schema:
$ref: "#/components/schemas/OrderResponse"
"422":
content:
application/json:
schema:
$ref: "#/components/schemas/ValidationErrorResponse"
components:
schemas:
CreateOrderRequest:
type: object
required: [items, shippingAddress]
additionalProperties: false
properties:
items:
type: array
minItems: 1
shippingAddress:
type: string
OrderResponse:
type: object
required: [orderId, orderStatus, amount, createdAt]
properties:
orderId:
type: string
format: uuid
orderStatus:
type: string
enum: [pending, shipped, delivered]
amount:
type: number
minimum: 0

Key decisions that stabilize AI agent output: operationId values serve as stable function-name anchors across generated files. additionalProperties: false on request bodies tells agents the exact field set. All 4xx and 5xx responses are enumerated with schemas, so error handling is complete, not inferred.

Layer 2: Zod Schemas (Runtime Enforcement)

Zod provides runtime contract enforcement with TypeScript type inference, making the same schema the source of truth for backend validation, frontend parsing, and test assertions.

typescript
// packages/contracts/src/order.schema.ts
import { z } from "zod";
export const OrderStatusSchema = z.enum(["pending", "shipped", "delivered"]);
export const CreateOrderRequestSchema = z.object({
items: z.array(z.object({
productId: z.string().uuid(),
quantity: z.number().int().positive(),
})).min(1),
shippingAddress: z.string().min(1),
});
export const OrderResponseSchema = z.object({
orderId: z.string().uuid(),
orderStatus: OrderStatusSchema,
amount: z.number().nonnegative(),
createdAt: z.string().datetime(),
});
export type CreateOrderRequest = z.infer<typeof CreateOrderRequestSchema>;
export type OrderResponse = z.infer<typeof OrderResponseSchema>;

When both the backend handler and the frontend client import and parse using the same Zod schema, any backend change that violates the contract produces a ZodError with the exact field path and actionable diagnostics, instead of a mysterious E2E failure.

Layer 3: State Machine Definitions (Behavioral Contract)

State machines make behavioral contracts the authoritative artifact from which tests are derived. Per Stately docs, @xstate/test utilities automatically generate test cases from state machines, ensuring comprehensive coverage of all possible paths.

typescript
// checkout-flow.machine.ts: Behavioral contract
import { createMachine } from 'xstate';
export const checkoutMachine = createMachine({
id: 'checkout',
initial: 'cart',
states: {
cart: {
meta: {
test: async (page) => {
await expect(page.getByRole('region', { name: 'Shopping Cart' }))
.toBeVisible();
}
},
on: { PROCEED_TO_CHECKOUT: 'payment' }
},
payment: {
meta: {
test: async (page) => {
await expect(page.getByRole('form', { name: 'Payment Details' }))
.toBeVisible();
}
},
on: { SUBMIT_PAYMENT: 'processing', RETURN_TO_CART: 'cart' }
},
confirmed: {
meta: {
test: async (page) => {
await expect(
page.getByRole('heading', { name: /Order Confirmed/i })
).toBeVisible();
}
},
type: 'final'
},
paymentError: {
meta: {
test: async (page) => {
await expect(page.getByRole('alert')).toBeVisible();
await expect(page.getByRole('button', { name: 'Try Again' }))
.toBeEnabled();
}
},
on: { RETRY_PAYMENT: 'payment' }
}
}
});

The testCoverage() method fails if any state was never reached during test execution. An AI agent that implements only the happy path cannot pass this check, thereby automatically enforcing error path coverage. XState is the right tool when state complexity justifies it. A checkout flow with five states and two error paths earns the machine definition overhead. A simple form submission probably does not. Playwright's test.step blocks that name each state transition explicitly can achieve most of the same coverage benefit without adding XState to the stack.

Playwright Configuration for Agent-Friendly Testing

Playwright docs establish the foundational principle: automated tests should verify that the application code works for end users while avoiding reliance on implementation details such as CSS class names.

typescript
// playwright.config.ts
import { defineConfig, devices } from '@playwright/test';
export default defineConfig({
testDir: './tests',
timeout: 30_000,
expect: { timeout: 5_000 },
fullyParallel: true,
forbidOnly: !!process.env.CI,
retries: process.env.CI ? 2 : 0,
workers: process.env.CI ? 1 : undefined,
use: {
baseURL: process.env.BASE_URL ?? 'http://localhost:3000',
testIdAttribute: 'data-testid',
trace: 'on-first-retry',
screenshot: 'only-on-failure',
video: 'retain-on-failure',
},
projects: [
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
{ name: 'mobile-chrome', use: { ...devices['Pixel 5'] } },
],
});

Explicitly declaring testIdAttribute: 'data-testid' makes the team's selector convention machine-readable for every engineer and AI agent reading the config. Setting forbidOnly: !!process.env.CI prevents .only from silently narrowing test runs in CI.

Testing Library and Playwright both recommend prioritizing user-facing locators, with test IDs used as a last resort:

StrategyAI Regen ResilienceWhen to Use
getByRole('button', { name: 'Submit' })HighInteractive elements with stable accessible names
getByLabel('Email address')HighForm inputs with label associations
getByTestId('login-form-submit')HighestDynamic text, i18n, complex components
locator('.btn-primary')FragileNever
locator('xpath=//button[2]')FragileNever

Component-scoped, descriptive, kebab-case data-testid names prevent collision. An agent seeing data-testid="btn" generates getByTestId('btn'). When a second agent later adds a Cancel button to the same form using the same pattern, Playwright throws a strict mode violation because the locator matches multiple elements. With data-testid="login-form-submit", the ID is unique by construction, so collisions cannot occur.

Enforcing Contracts in CI: The Merge Gate

Contracts only prevent E2E failures when CI enforces them. A complete pipeline chains schema linting, breaking change detection, and compatibility checks.

Open source
augmentcode/augment.vim613
Star on GitHub
yaml
# .github/workflows/contract-check.yml
name: API Contract Enforcement
on:
pull_request:
paths: ['openapi.yaml']
jobs:
breaking-changes:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Detect breaking changes
uses: oasdiff/oasdiff-action@latest
with:
err-ignore: '.oasdiff-ignore'

Non-zero exit on severity: error from Spectral linting blocks the merge. When combined with the Pact broker, can-i-deploy checks for consumer-provider compatibility, the pipeline catches contract violations before they reach E2E test execution.

Enforcement LayerToolBlocks Merge
Schema style/lintSpectralYes
Breaking change detectionoasdiffYes (configurable)
Consumer-provider compatibilityPact can-i-deployYes
Schema syntax validationopenapi-generator-cli validatePartial: syntax only

See how Intent's Coordinator and Verifier agents keep contract changes aligned before drift reaches CI.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

Making Agents Contract-Aware: Rules Files and Context Injection

Contracts exist as files. Agents need explicit instructions to read them. Anthropic's context engineering research recommends curating diverse, canonical examples rather than listing exhaustive rules.

markdown
# .cursor/rules/api-contract.mdc
description: "API contract compliance for route handlers and API clients"
alwaysApply: false
## Contract Sources
- OpenAPI spec: docs/openapi.yaml (source of truth)
- Zod schemas: src/schemas/ (runtime validation)
- Generated types: src/types/api.generated.ts (do not hand-edit)
## Rules
1. Before generating any route handler, READ the relevant path in docs/openapi.yaml
2. Request validation MUST use Zod schemas from src/schemas/
3. Response objects MUST include only fields declared in the OpenAPI spec
4. If you cannot find the schema for a type, STOP and ask

In multi-agent setups, separate agents receive explicit interface contracts or scoped specifications for their part of the work, reducing overlap and confusion. Intent's Coordinator agent implements this pattern by delegating to parallel Implementor agents, each of which receives only the contract slice relevant to their task.

How Intent's Architecture Maps to These Contracts

The three-layer contract architecture maps directly onto Intent's workflow. A Coordinator agent analyzes the codebase and drafts the living spec. Implementor agents execute tasks in parallel, reading from and writing to the spec. A Verifier agent checks results against the original specification before human review. When requirements change, Intent automatically propagates updates to all active agents.

Research on the planner-coder gap identifies why this coordination layer is architecturally necessary: when planning agents decompose requirements into underspecified plans, coding agents misinterpret intricate logic. A living spec addresses this gap by maintaining completeness as the authoritative artifact throughout implementation.

text
┌──────────────────────────────────────────────────────┐
│ SPEC LAYER (Living spec, human-owned) │
│ OpenAPI schemas + state machine contracts │
│ Intent's Coordinator drafts; humans refine │
└──────────────────┬───────────────────────────────────┘
│ derives
┌──────────────────────────────────────────────────────┐
│ TEST LAYER (Contract-derived, CI-enforced) │
│ Zod validation, Playwright assertions, oasdiff │
│ Intent's Verifier checks against spec │
└──────────────────┬───────────────────────────────────┘
│ validates
┌──────────────────────────────────────────────────────┐
│ IMPLEMENTATION LAYER (Agent-generated, replaceable) │
│ Route handlers, components, API clients │
│ Intent's Implementor agents execute in parallel │
└──────────────────────────────────────────────────────┘

The implementation layer becomes the disposable artifact that AI agents regenerate against a stable contract, rather than the contract itself.

Start with One Contract Boundary Before the Next Agent Regeneration

Diagnose before adding contracts. Review the last five E2E failures in CI history. Consistently, the same assertion on the same selector means the minimum fix is a Zod schema shared between the backend handler and the frontend test, plus a data-testid convention scoped by component and role. Intermittent timing failures that pass on retries mean the minimum fix is to replace hardcoded waits with condition-based waits. Tests that pass but ship wrong behavior mean the minimum fix is a state machine with testCoverage() enforced in CI. Most teams need one of these three, not all three at once.

Intent's living specs act as the authoritative shared contract for parallel agents, so spec and contract updates propagate across active work before mismatches cause downstream integration failures.

Intent's living specs keep parallel agents aligned across changing requirements, so schema drift stops before it reaches your test suite.

Build with Intent

Free tier available · VS Code extension · Takes 2 minutes

Frequently Asked Questions about AI Agents and E2E Testing

Written by

Paula Hingel

Paula Hingel

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.