What is the difference between harness engineering and prompt engineering?

Prompt engineering optimizes the wording of a single LLM instruction. Harness engineering designs the entire environment around the agent: tool permissions, verification loops, CI gates, and state persistence across multiple context windows. Prompt engineering operates on one turn; harness engineering operates across the full task lifecycle.

Do AGENTS.md files actually work to constrain AI agents?

Rule files improve agent behavior but are insufficient on their own, as LLM compliance with instructions is probabilistic rather than deterministic. Teams report consistent constraint enforcement only when rule files are combined with outer harness mechanisms, linters are set to "error," and CI gates block non-compliant PRs before merge.

What is the PEV loop in AI agent development?

PEV separates planning from execution and enforces verification as a structured phase gate rather than an afterthought. Unlike generate-and-check workflows that test output after the fact, PEV enforces gates at every phase transition, including pre-execution validation of tool calls and plan-alignment checks that catch architectural violations that are invisible to test suites.

How should teams measure the effectiveness of their AI agent harness?

Track six metrics: task resolution rate, code churn rate, verification tax, harness constraint effect, defect escape rate, and Pass@1 rate. DORA recommends establishing baseline delivery performance before agent deployment and measuring against the same metrics post-deployment.

Can engineering in AI agents reduce security risks?

Harness engineering reduces security risk through deterministic enforcement. SAST scanners wired as CI gates block unvalidated user input and missing authorization checks before merge. The Janitor GitHub Action detects hallucinated package names that pass syntax checks but introduce supply-chain risk. No harness eliminates all risk, but mechanical enforcement catches patterns that instructions alone cannot reliably prevent.

Harness Engineering for AI Coding Agents: Constraints That Ship Reliable Code

Harness engineering is the discipline of designing environments, constraints, and feedback loops that make AI coding agents reliable at scale, shifting engineers from writing code to designing the systems that govern how agents write code.

TL;DR

AI coding agents generate code faster than teams can review it, and the failures they introduce follow a predictable pattern: architecture drift, inconsistent security controls, and compliance gaps that pass all tests before surfacing in production. The question is not whether to constrain agent behavior, but which type of constraint addresses the specific failure mode. Agents producing inconsistent output across sessions need rules files and a verified architectural context. Agents introducing security gaps need deterministic enforcement at the CI layer. Agents drifting from spec need a verification loop that checks implementations against a persistent contract.

Why AI Agents Need Constraints, Not Just Instructions

Engineering teams adopting AI coding agents face a paradox: agents generate code at unprecedented speed, but that speed creates a verification bottleneck. The DORA report found that higher AI adoption correlates with increases in both software delivery throughput and software delivery instability. Time saved writing code is often re-spent auditing it.

Apiiro's Sep 2025 analysis found that AI-generated code introduced more than 10,000 new security findings per month by June 2025 across the studied repositories (a 10x increase from Dec 2024). Spotify's Honk system, which has merged 1,500+ AI-generated PRs across hundreds of repositories since mid-2024, addresses similar agent reliability challenges through verification loops.

The root cause is architectural, not behavioral. Telling an agent "follow our coding standards" in a prompt is fundamentally different from wiring a linter that blocks the PR when standards are violated. The first approach relies on probabilistic compliance; the second enforces deterministic constraints. Harness engineering formalizes this distinction.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

What Is Harness Engineering: Origin and Definition

The term "harness engineering" is commonly attributed to Mitchell Hashimoto, co-founder of HashiCorp and creator of Terraform, in a blog post published in early February 2026. The original post URL has not been independently recovered, so this attribution is based on secondary reports: the core principle being that whenever an agent makes a mistake, engineers should build a solution ensuring the agent never makes that specific mistake again.

The concept gained a formal definition through an OpenAI post by Ryan Lopopolo on February 11, 2026, built on the experience of shipping a production application with zero manually written lines of code. The tagline: "Humans steer. Agents execute." A LangChain post condensed the model: "Agent = Model + Harness."

A common misattribution: Andrej Karpathy coined context engineering (Dec 2025) and agentic engineering (Feb 2026), not harness engineering.

Term	Attributed To	Primary Source	Date
Harness engineering	Mitchell Hashimoto (per secondary reports)	Personal blog, "My AI Adoption Journey."	Early Feb 2026
Harness engineering (formal)	OpenAI / Ryan Lopopolo	openai.com/index/harness-engineering	Feb 11, 2026
Agentic engineering	Andrej Karpathy	x.com/karpathy	Feb 2026
Context engineering	Andrej Karpathy	x.com/karpathy	Dec 19, 2025

Harness Engineering vs. Prompt Engineering vs. Context Engineering

Harness engineering occupies a distinct architectural layer from prompt engineering and context engineering. Understanding the boundaries prevents teams from applying single-turn techniques to multi-session, multi-agent problems.

Prompt engineering optimizes instructions for a single interaction. Context engineering curates the information passed across turns within a single context window. Harness engineering operates outside both: it introduces context resets, structured handoff artifacts, and phase gates that enable coherent, goal-directed work across multiple context windows.

An arXiv analysis describes the harness as orchestrating tool dispatch, context management, and safety enforcement at runtime, with the tool registry and safety system as distinct components alongside the prompt composition layer.

Dimension	Prompt Engineering	Context Engineering	Harness Engineering
Scope	Single interaction	Model's context window across turns	Entire agent system
What it controls	Instruction wording	Token selection, ordering, compaction	Tool orchestration, state persistence, verification loops, error recovery
Failure modes addressed	Unclear instructions	Wrong or missing information in context	Agent errors, doom loops, multi-session drift, unsafe actions
Temporal boundary	One turn	One context window	Multiple context windows; full task lifetime

The Three Harness Layers: Constraints, Feedback Loops, Quality Gates

Harness engineering operates through reinforcing layers: preventive controls that stop unwanted outputs before they occur, and corrective controls that detect violations and trigger a response. Teams starting from zero should implement them in order: constraint harnesses first because they reduce failure volume before anything else is in place, feedback loops second because they enable self-correction without human intervention, quality gates third because they enforce what the first two layers could not prevent.

One tradeoff applies to all three: over-constraining is a real failure mode. Complexity limits set too low flag legitimate refactoring; lint rules that reject valid patterns slow agents without improving output quality. Start narrow, measure, then expand.

Layer 1: Constraint Harnesses (Feedforward)

Constraint harnesses reduce the agent's solution space before generation begins. Rules files, architectural lint configurations, and type systems all function as feedforward controls. They encode what the correct code looks like, so the agent converges faster on compliant output.

OpenAI's production system enforces what it calls taste invariants: a small set of rules that encode the team's engineering standards and design philosophy, including general coding conventions and reliability requirements. All are enforced as hard CI failures, not warnings.

Layer 2: Feedback Loops (Corrective)

Feedback loops return structured error signals to the agent, enabling autonomous self-correction. The critical implementation detail is that the lint message itself becomes a prompt. A lint error saying "violation detected" requires human interpretation. A lint error saying use logger.info({event: 'name', ...data}) instead of console.log enables the agent to fix the violation without human intervention.

One implementation detail most teams overlook: inline-disable rules, such as // eslint-disable-next-line, should be disabled to prevent agents from suppressing violations rather than fixing them.

Layer 3: Quality Gates (Enforcement)

Quality gates prevent non-compliant code from being merged. Standard CI pipelines are insufficient for AI-generated code because agents introduce problems that conventional checks miss. Some teams add purpose-built staleness gates to catch dependency choices that do not match the codebase's current strategy.

javascript

// .eslintrc.js: Structural constraints for AI-generated code
module.exports = {
  rules: {
    "complexity": ["error", { "max": 10 }],
    "max-depth": ["error", 4],
    "max-lines-per-function": ["error", { "max": 50 }],
    "max-lines": ["error", { "max": 300 }],
    "max-params": ["error", 4],
    "max-statements": ["error", 15]
  }
}

All rules set to "error", not "warn", so they function as hard gates, not advisory signals.

Rules Files as Constraint Harnesses

Rules files are persistent, repository-scoped instruction sets injected into an agent's context at session start. The mechanical distinction from simple prompts is precise: rules files survive across sessions, are injected automatically by the runtime, scope to an entire directory tree, and compose hierarchically across parent and child directories.

The Cross-Tool Landscape

The AGENTS.md spec, released in August 2025 as an open standard emerging from collaboration across the AI coding ecosystem, including OpenAI, Google, Cursor, Factory, and others, now functions as a shared cross-tool convention. OpenAI's repository uses 88 AGENTS.md files across subcomponents, demonstrating monorepo-scale constraint composition.

Tool	File	Scope
OpenAI Codex	AGENTS.md	Hierarchical, Git root to CWD
Claude Code	CLAUDE.md	Project root + ~/.claude
Cursor	.cursor/rules/*.mdc	Project-scoped
GitHub Copilot	.github/copilot-instructions.md	Repo-wide + path-specific
Cosmos	AGENTS.md, CLAUDE.md, plus native Augment Rules	Cross-tool compatible

Augment Rules: Three-Type Constraint Architecture

Cosmos implements a rules system called Augment Rules with three rule types that provide granular control over agent behavior and when rules are loaded into context. Documented in the rules reference: always_apply rules, included in every prompt automatically; agent_requested rules, which load when the agent determines relevance; and manual rules, which load only when explicitly invoked. This selective loading preserves constraint enforcement while keeping the context window focused on the task.

Workspace rules live in the repository and are shared via version control. User-level rules in the home directory apply across all projects. The architectural distinction: Cosmos's Context Engine handles what can be inferred from the codebase; rules files are reserved for what cannot be inferred, such as naming conventions, logging standards, and architectural boundaries.

Why Rules Files Alone Are Insufficient

Rules files are one layer of the harness, not the complete solution. LLM compliance with instructions is probabilistic, not deterministic. They must be combined with deterministic outer-harness constraints, such as linters and CI gates, to be reliable at scale.

GitHub's analysis of 2,500+ repositories using AGENTS.md files recommends a three-tier boundary pattern:

Tier	Examples
Always	Log all notification delivery attempts; use UTC for scheduling
Ask First	Adding a new notification channel, changing retry intervals
Never	Send notifications without verified opt-in; modify the unsubscribe flow

Without the "Ask First" tier, an agent building retry logic picks an interval on its own.

The PEV Loop: Plan, Execute, Verify as a Harness Pattern

The Plan-Execute-Verify (PEV) pattern is a three-phase agent architecture that separates planning from execution and enforces verification as a structured feedback loop. Rather than asking an LLM to solve a multi-step problem in one pass, PEV instructs the agent to decompose the problem into an explicit plan, execute against that plan, then verify the output against both the plan and external quality criteria.

How PEV Differs from Generate-and-Check

The distinction is architectural, not cosmetic. A generate-and-check workflow runs tests after the agent produces output. PEV enforces phase boundaries with gates at every transition.

Dimension	Generate-and-Check	PEV Loop
Planning	None; LLM generates directly	Explicit decomposition with acceptance criteria
Execution scope	Unconstrained	Bounded by plan; harness gates fire on every tool call
Verification timing	Post-hoc only	Pre-execution + runtime + post-execution + plan alignment
Feedback signal	Binary pass/fail	Error messages with context looped back into agent reasoning
Human involvement	Review output artifacts	Maintain harness; approve at high-leverage decision points

Pre-execution gates fire before any tool call: Is this a known tool? Are the arguments valid? Does this action require user approval? Is the requested path inside the workspace?

A distinct gate type addresses what test suites alone cannot: plan alignment. Did the agent use the existing auth middleware or create a new one? Did it follow the response format convention? These are architectural questions invisible to standard test runners.

Why PEV Addresses Non-Determinism

AI agents introduce non-determinism into application logic: the same task at different times may produce different reasoning paths. PEV addresses this by reducing degrees of freedom at the Plan phase, rejecting out-of-scope tool calls pre-execution, and catching architecturally non-compliant paths that test suites cannot see.

Cosmos as an Automated Harness

Cosmos implements a multi-agent architecture through three composable primitives: Environments, Experts, and Sessions. Environments provide full-fidelity cloud workspaces for each agent session. Experts are built-in specialized agents for deep code review, PR authoring, end-to-end testing, and incident response. Sessions maintain state and context across the full task lifecycle. The multi-agent architecture Context Engine drafts a codebase context.

How Cosmos Functions as a Harness Primitive

Cosmos's built-in code review agent checks each implementation against the shared codebase context, validates cross-service dependencies through the Context Engine, and flags violations before code reaches the PR stage. When the agent rejects an implementation, the rejection becomes a structured context for correction rather than being silently dropped.

Open source

augmentcode/augment.vim★607

Star on GitHub

Failure Type	Response	Owner
Spec violation (code diverges from contract)	Block merge; inject failure context into agent retry loop	Agent or author
Integration regression (change breaks consumer)	Block deployment; notify dependent teams	Provider team
Infrastructure failure (verification tooling unavailable)	Pause gated deployments; investigate separately	Platform team

Cosmos's Context Engine provides a shared codebase understanding, enabling cross-service verification. When multiple Experts run in parallel, they access the same semantic context across the full codebase. The Context Engine processes 400,000+ files with real-time indexing, so local changes are reflected in context queries in near real time.

A Documented Limitation

No automated system catches every form of semantic drift. Human review remains part of the workflow: developers can pause and redirect agent sessions at any time before implementation continues.

Measuring Harness Effectiveness: Agent Success Rate, Rework Rate, and Beyond

Harness engineering without measurement is guesswork. DORA guidance defines software delivery performance using core metrics that track changes in throughput and stability over time.

Metric	What It Measures	Measurement Method
Task Resolution Rate	Percentage of tasks an agent resolves correctly, verified by automated tests	Per-commit or per-PR test pass/fail
Code Churn Rate	Percentage of code written, then discarded or rewritten within two weeks	Weekly, attributed by authorship signal
Verification Tax	The time engineers spend auditing AI-generated code offsets the generation speed	Delta between time-to-first-commit and time-to-PR-approval
Harness Constraint Effect	Improvement in task success from constraining the agent environment, independent of model changes	Success rate comparison: constrained vs. unconstrained on identical tasks
Defect Escape Rate	Rate of defects in AI-generated code reaching production	Monthly, tagged by AI vs. non-AI commit metadata
Pass@1 Rate	Whether the agent resolves correctly on the first attempt, without retries	Per evaluation run

Benchmarks and Baselines

Top agents achieve 65–76.8% resolution rates on SWE-bench-verified Python tasks. METR found, however, that many benchmark-passing PRs would not be merged by actual maintainers. Teams need both metrics.

The DORA report found 30% of developers reported little to no trust in AI-generated code. A harness that makes agent behavior predictable reduces the cognitive burden of review, which DORA frames explicitly as a trust architecture problem.

One finding underscores the power of constraint design: Vercel has reported that reducing an agent's available tools improved task success rates. However, the specific claim that this improvement exceeded any model upgrade is not supported by available evidence and should not be treated as a verified result.

What to Avoid

DORA warns against relying on narrow, output-based metrics to measure actual productivity. Metrics like "lines of code accepted" or "number of AI suggestions used" measure volume, not reliability, and should not serve as primary signals of harness effectiveness.

Start with Three Constraints Before Your Next Agent-Generated PR

The diagnostic question: which architectural invariants must hold regardless of who writes the code? OpenAI started with structured logging, naming conventions, file size limits, and reliability requirements.

Audit the last five agent-generated PRs for recurring debt patterns. Pick three constraints that would have blocked those issues. Encode them as lint rules with remediation-instruction error messages, fail CI on violations, and measure review time and defect escape rate before and after.

Cosmos's pre-merge verification architecture implements this pattern at scale: Cosmos's built-in agents, Augment Rules, and the Context Engine combine to constrain multi-agent workflows without slowing generation down.

Harness Engineering for AI Coding Agents: Constraints That Ship Reliable Code

TL;DR

Why AI Agents Need Constraints, Not Just Instructions

The Agentic SDLC

What Is Harness Engineering: Origin and Definition

Harness Engineering vs. Prompt Engineering vs. Context Engineering

The Three Harness Layers: Constraints, Feedback Loops, Quality Gates

Layer 1: Constraint Harnesses (Feedforward)

Layer 2: Feedback Loops (Corrective)

Layer 3: Quality Gates (Enforcement)

Rules Files as Constraint Harnesses

The Cross-Tool Landscape

Augment Rules: Three-Type Constraint Architecture

Why Rules Files Alone Are Insufficient

The PEV Loop: Plan, Execute, Verify as a Harness Pattern

How PEV Differs from Generate-and-Check

Why PEV Addresses Non-Determinism

Cosmos as an Automated Harness

How Cosmos Functions as a Harness Primitive

A Documented Limitation

Measuring Harness Effectiveness: Agent Success Rate, Rework Rate, and Beyond

Benchmarks and Baselines

What to Avoid

Start with Three Constraints Before Your Next Agent-Generated PR

Frequently Asked Questions About AI Agent Harness Engineering

Written by

Molisha Shah

Give your codebase the agents it deserves

TL;DR

Why AI Agents Need Constraints, Not Just Instructions

The Agentic SDLC

What Is Harness Engineering: Origin and Definition

Harness Engineering vs. Prompt Engineering vs. Context Engineering

The Three Harness Layers: Constraints, Feedback Loops, Quality Gates

Layer 1: Constraint Harnesses (Feedforward)

Layer 2: Feedback Loops (Corrective)

Layer 3: Quality Gates (Enforcement)

Rules Files as Constraint Harnesses

The Cross-Tool Landscape

Augment Rules: Three-Type Constraint Architecture

Why Rules Files Alone Are Insufficient

The PEV Loop: Plan, Execute, Verify as a Harness Pattern

How PEV Differs from Generate-and-Check

Why PEV Addresses Non-Determinism

Cosmos as an Automated Harness

How Cosmos Functions as a Harness Primitive

A Documented Limitation

Measuring Harness Effectiveness: Agent Success Rate, Rework Rate, and Beyond

Benchmarks and Baselines

What to Avoid

Start with Three Constraints Before Your Next Agent-Generated PR

Frequently Asked Questions About AI Agent Harness Engineering

What is the difference between harness engineering and prompt engineering?

Do AGENTS.md files actually work to constrain AI agents?

What is the PEV loop in AI agent development?

How should teams measure the effectiveness of their AI agent harness?

Can engineering in AI agents reduce security risks?

Related Guides

Written by

Molisha Shah

Give your codebase the agents it deserves