We added Karpathy-inspired coding rules to root AGENTS.md and ran three coding agents through 40 OpenClaw PRs. The judge scored code quality as basically unchanged, but the agents got to the same answers with less work: fewer tool calls and file reads, and consistently lower time and cost.
What we tested
We tested Auggie, Claude Code, and Codex against 40 PRs from OpenClaw. For each PR, we ran each agent twice: once with existing AGENTS.md, once with AGENTS.md plus about 2.5k characters of Karpathy-style coding rules prepended to the file.
The Karpathy rules are the kind of thing you'd write at the top of a team style guide.
Here is the summary:
1. Think before coding: state assumptions, surface tradeoffs, ask if unclear
2. Simplicity first: no speculative features, no single-use abstractions, minimum code
3. Surgical changes: don't touch adjacent code, match existing style, no drive-by refactors
4. Goal-driven execution: define verifiable success criteria, loop until met
In short: solve the ticket, stay local, don't wander.
The setup:
- 40 hand-selected PRs from OpenClaw, mid-complexity (100–300 LOC excluding tests)
- Three runners: Auggie on Opus 4.7, Claude Code on Opus 4.7, Codex on GPT-5.4
- Two variants per PR: baseline
AGENTS.md(~18K chars) vs.AGENTS-karpathy.md(~20.5K chars) - 6 runs per config, total 18 repeats per individual PR
- Scored by an LLM judge on completeness, correctness, best practices, code reuse, and unsolicited documentation
What we did not test:
- We didn’t test the effect of Karpathy-style guidelines on easy or hard PRs, so the effect might be different. It also doesn't apply to all
AGENTS.md. - We ran a test on another repo where the instructions heavily overlap with the Karpathy skill, and unsurprisingly, there was no statistically significant difference.
Results
Cost, time, and tokens dropped
Every runner used fewer tool calls and finished faster. Each dot represents a separate, full 40 PR run. Relative change is calculated as percentage from average per baseline runs (without Karpathy).

Baseline vs Karpathy — Per-Run Results
| Auggie | Claude Code | Codex | |
|---|---|---|---|
| Duration | −3% | −7% | −7% |
| Tool Calls | −3% | −8% | −6% |
| Cost | −3% | −10% | −8% |
The reduction came mostly from fewer searches and fewer file reads. The agents found what they needed in fewer lookups. Not always without impact on performance. Tool call failure rates stayed flat at 5–8%. Same directional change for every runner.
Output tokens fell by similar margins across runners. Per PR, Karpathy was faster and cheaper on about 30 of 40 PRs. The pattern held across all three agents.
A 3–10% efficiency gain from a small prompt change isn't a model breakthrough. If you're running a coding agent at scale, it's still real money, real latency, and real capacity.
Less sidequesting with the same quality (mostly)
| Auggie | Claude Code | Codex | |
|---|---|---|---|
| Quality Score | +0.00 | -0.07 | +0.00 |
By the judge's score, the code didn't get better.
Open circles are baseline, solid green markers are Karpathy. Each score is in the [-1.0, 1.0] range, delta is showcased on chart.

Results on per-run quality score
Auggie, Codex did the same, but Claude Code has -0.07 (similar to Opus → Sonnet drop). It degraded on multiple sub-metrics. Correctness fell by 0.07, completeness by 0.06. A closer look at the data reveals why: Claude Code with Karpathy guidelines has more conservative trajectories, touching ~5% fewer files per task.
Karpathy-style guidelines don’t transfer uniformly across agent harnesses and repositories. While objectives there are meaningful for software engineering tasks, the baseline system prompt and orchestration of each system are different. In Codex, the guidelines likely add useful structure (improving efficiency). In Augment, the baseline prompt already encodes similar constraints, so the marginal impact is smaller. In Claude Code, the system prompt may already be highly constrained, so layering additional constraints could reduce exploration and degrade performance.
Less extra documentation
| Auggie | Claude Code | Codex | |
|---|---|---|---|
| Overall score | +0.00 | -0.07 | +0.00 |
| Completeness | -0.02 | -0.06 | +0.01 |
| Correctness | -0.01 | -0.07 | -0.02 |
| Best Practices | -0.01 | -0.03 | −0.00 |
| Code Reuse | -0.00 | -0.05 | +0.02 |
| Unsolicited Docs | +0.06 | +0.05 | +0.06 |
One sub-metric moved in the same direction for every runner: Unsolicited Docs scored better.
That tracks. The Karpathy rules told the agents not to add features beyond the ask and not to "improve" nearby code, comments, or formatting. The agents listened and did less unrequested work.
Per PR
Averaged across all runners per PR:
- Cost, Duration: Karpathy is faster and cheaper on 30 of 40 PRs, consistent across all runners
- Quality: Karpathy is stable with 20/20 wins, apart from Claude Code
The code itself didn't reliably get better, but the path got shorter.
AGENTS.md is a cheap control layer - use it wisely
A good AGENTS.md is a cheap control layer, and in a sense is an extension (tailored for specific repos) of your system prompt. It provides behavioral and domain constraints tailored to your repository. This experiment shows that such guidelines consistently result in faster, cheaper runs with fewer tool calls, however the quality stays the same or drops, depending on the harness.
If you're running coding agents at scale, adding Karpathy-style guidelines could give you a free lunch, but it should be calibrated to your existing setup.
Written by

Slava Zhenylenko
Member of Technical Staff
Slava is an applied AI engineer with over a decade of experience and worked across wide range of domains, such as GenAI, CV, Deep Learning and the operationalization of AI systems.
