Karpathy skills on OpenClaw: agents don't write better code. But they do it more efficiently.

We added Karpathy-inspired coding rules to root AGENTS.md and ran three coding agents through 40 OpenClaw PRs. The judge scored code quality as basically unchanged, but the agents got to the same answers with less work: fewer tool calls and file reads, and consistently lower time and cost.

What we tested

We tested Auggie, Claude Code, and Codex against 40 PRs from OpenClaw. For each PR, we ran each agent twice: once with existing AGENTS.md, once with AGENTS.md plus about 2.5k characters of Karpathy-style coding rules prepended to the file.

The Karpathy rules are the kind of thing you'd write at the top of a team style guide.

Here is the summary:

1. Think before coding: state assumptions, surface tradeoffs, ask if unclear
2. Simplicity first: no speculative features, no single-use abstractions, minimum code
3. Surgical changes: don't touch adjacent code, match existing style, no drive-by refactors
4. Goal-driven execution: define verifiable success criteria, loop until met

In short: solve the ticket, stay local, don't wander.

The setup:

40 hand-selected PRs from OpenClaw, mid-complexity (100–300 LOC excluding tests)
Three runners: Auggie on Opus 4.7, Claude Code on Opus 4.7, Codex on GPT-5.4
Two variants per PR: baseline AGENTS.md (~18K chars) vs. AGENTS-karpathy.md (~20.5K chars)
6 runs per config, total 18 repeats per individual PR
Scored by an LLM judge on completeness, correctness, best practices, code reuse, and unsolicited documentation

What we did not test:

We didn’t test the effect of Karpathy-style guidelines on easy or hard PRs, so the effect might be different. It also doesn't apply to all AGENTS.md.
We ran a test on another repo where the instructions heavily overlap with the Karpathy skill, and unsurprisingly, there was no statistically significant difference.

Results

Cost, time, and tokens dropped

Every runner used fewer tool calls and finished faster. Each dot represents a separate, full 40 PR run. Relative change is calculated as percentage from average per baseline runs (without Karpathy).

Baseline vs Karpathy — Per-Run Results

	Auggie	Claude Code	Codex
Duration	−3%	−7%	−7%
Tool Calls	−3%	−8%	−6%
Cost	−3%	−10%	−8%

The reduction came mostly from fewer searches and fewer file reads. The agents found what they needed in fewer lookups. Not always without impact on performance. Tool call failure rates stayed flat at 5–8%. Same directional change for every runner.

Output tokens fell by similar margins across runners. Per PR, Karpathy was faster and cheaper on about 30 of 40 PRs. The pattern held across all three agents.

A 3–10% efficiency gain from a small prompt change isn't a model breakthrough. If you're running a coding agent at scale, it's still real money, real latency, and real capacity.

Less sidequesting with the same quality (mostly)

	Auggie	Claude Code	Codex
Quality Score	+0.00	-0.07	+0.00

By the judge's score, the code didn't get better.

Open circles are baseline, solid green markers are Karpathy. Each score is in the [-1.0, 1.0] range, delta is showcased on chart.

Results on per-run quality score

Auggie, Codex did the same, but Claude Code has -0.07 (similar to Opus → Sonnet drop). It degraded on multiple sub-metrics. Correctness fell by 0.07, completeness by 0.06. A closer look at the data reveals why: Claude Code with Karpathy guidelines has more conservative trajectories, touching ~5% fewer files per task.

Karpathy-style guidelines don’t transfer uniformly across agent harnesses and repositories. While objectives there are meaningful for software engineering tasks, the baseline system prompt and orchestration of each system are different. In Codex, the guidelines likely add useful structure (improving efficiency). In Augment, the baseline prompt already encodes similar constraints, so the marginal impact is smaller. In Claude Code, the system prompt may already be highly constrained, so layering additional constraints could reduce exploration and degrade performance.

Less extra documentation

	Auggie	Claude Code	Codex
Overall score	+0.00	-0.07	+0.00
Completeness	-0.02	-0.06	+0.01
Correctness	-0.01	-0.07	-0.02
Best Practices	-0.01	-0.03	−0.00
Code Reuse	-0.00	-0.05	+0.02
Unsolicited Docs	+0.06	+0.05	+0.06

One sub-metric moved in the same direction for every runner: Unsolicited Docs scored better.

That tracks. The Karpathy rules told the agents not to add features beyond the ask and not to "improve" nearby code, comments, or formatting. The agents listened and did less unrequested work.

Per PR

Averaged across all runners per PR:

Cost, Duration: Karpathy is faster and cheaper on 30 of 40 PRs, consistent across all runners
Quality: Karpathy is stable with 20/20 wins, apart from Claude Code

The code itself didn't reliably get better, but the path got shorter.

AGENTS.md is a cheap control layer - use it wisely

A good AGENTS.md is a cheap control layer, and in a sense is an extension (tailored for specific repos) of your system prompt. It provides behavioral and domain constraints tailored to your repository. This experiment shows that such guidelines consistently result in faster, cheaper runs with fewer tool calls, however the quality stays the same or drops, depending on the harness.

If you're running coding agents at scale, adding Karpathy-style guidelines could give you a free lunch, but it should be calibrated to your existing setup.