The bottleneck moved to verification. So we automated that too.

Internally, a significant amount of engineering time is consumed by manually verifying that the code an agent writes works E2E in dev. We sought to change that.

We built an internal agent, the Verifier, to take over that step. On every run it deploys the change to a live environment, exercises the affected behavior, and posts what it observed (logs, API responses, and screenshots) back to the PR. It runs on Cosmos, our cloud-agent platform.

Why not write more tests?

We weighed a maintained suite of end-to-end and Playwright tests against real environments, and chose against it on conviction as much as cost. The cost is real: such a suite has to be extended by hand for every feature, it flakes, and it needs a permanent owner. The deeper objection is about direction. A maintained suite adds maintenance with every change; we would rather build something an agent drives than something we have to keep maintaining.

Building on Cosmos

Choosing an agent over a test suite is easy to say and, until recently, was hard to justify building. A verification agent needs a lot before it does anything useful: compute, a deployable copy of the codebase, the systems it needs to drive, and a way to start itself when a pull request appears. Assembling that from scratch is a platform project of its own, and the size of that project is what used to make an agent the impractical choice next to a test suite you could grow incrementally.

Cosmos removed that barrier. It provides the agent runtime out of the box:

VMs with real dev tooling and a browser, where the Verifier checks out, builds, deploys, runs commands, and drives the deployed UI.
Triggers that start a run when a pull request is published, or when its author issues an explicit run command in a comment.
An expert model, where the whole workflow is a reusable definition shared across the team.

That turned the work from building infrastructure into composing a workflow on top of it, which is what made the Verifier practical to build.

Cosmos gives us the runtime, but the test environment is still ours to manage. For each run, the Verifier deploys the change to its own isolated instance of our application, exercises that instance end to end, then tears it down or resets it before the next run. That isolation keeps concurrent verification runs from interfering with one another.

Making it extensible

The Verifier runs on a library of composable skills. Each SKILL.md holds the exact commands and edge cases for one task: mint a token for an internal gRPC service, create a user or a team, provision a paid admin, drive a flow in the customer UI. A top-level verification index points at all of them.

On each run the Verifier reads the index, works out which surfaces the diff touches, reads the relevant skills, and combines them, since a single PR often spans more than one surface. Supporting a new surface then means writing one more skill, not rewriting the Verifier.

These skills are not locked to the Verifier. They live in the repo, so any expert can pull them into its own workflow. The skill that provisions a paid admin for a verification run provisions it just as well for anything else.

Anatomy of a run

Architecture of a verification run: Cosmos triggers (PR published, author comment, or manual launch) start a run on a Cosmos VM, which leases a dev-e2e namespace, deploys the change, provisions state, exercises it end-to-end, and reports back. The verification skills library feeds the exercise step, logs and screenshots are captured to Cosmos artifacts, the report is posted to the GitHub pull request, and a human reviewer reads it and decides to merge.

This run, from June 29, 2026, verified a change to the Cosmos frontend itself: a pull request that added a copy-link button for experts (the configurable agents users build in Cosmos), a way to copy a shareable link to an expert, both from a button on the expert's own page and from the Actions menu in the experts list. Here is the report it posted, lightly trimmed:

Verifier Report for PR #57466

TL;DR: Drove the new expert copy-link affordance end-to-end, both the experts-table row action and the detail-page header button, against the live Cosmos web app on a dev namespace, capturing the exact URL each writes to the clipboard.

Walkthrough

The detail-page header gains a link-icon button that copies a deep link to the expert. I intercepted the clipboard and captured the written URL (/home?expertId=...), and a Link copied toast fired.

The expert detail-page header with the new link-icon button, between Version history and Duplicate

The experts-table row menu gains a Copy link item that copies the same deep link, with the same toast.

The experts-table row Actions menu open, showing the new Copy link item

Both surfaces emit the identical string, confirming the shared helper and the wiring on both call sites.

Falsifiable proof: before each action I monkey-patched navigator.clipboard.writeText to record every written string, then triggered the affordance and read it back. The row action and the detail button produced the same URL.

Scope. Under test: the copy-link affordance for experts, exercised against a live dev namespace. Not tested: helper unit cases covered by CI, and round-trip navigation of the copied link.

Sometimes the run finds something. On a PR meant to preserve message-sender attribution, every test passed and CI was green, but the real path told a different story. The Verifier exercised the endpoint the clients actually call and found the fix was never reaching it:

Drove the new Poseidon sender-attribution forwarding end-to-end against a live dev namespace and found it never takes effect on the streaming chat endpoint that the real agent clients actually call. […] The fix is wired into the non-streaming /chat handler only, but the Poseidon CLI agent uses /chat-stream […] so the attribution the PR aims to preserve is silently dropped on the production path. The PR's unit test exercises only run_chat (the /chat path), so CI stays green.

To make sure the missing sender id was a real failure and not a broken test, it added a control: on the same /chat-stream request, an unrelated field saved to storage correctly while the sender id did not. The storage path worked; only the change's field was dropped, pointing the failure at the PR rather than the pipeline.

Earning trust, and retaining it

We made a handful of deliberate decisions about what goes into a report, each in the interest of earning the trust of the author and reviewer alike:

Honest limits. When it cannot drive a change, it does not pretend to; it states why. If it can validate part of a change but not all of it, it does that much and reports exactly how far it got.
Evidence you can check. Every finding is backed by something the reviewer can open. The run's logs, metrics, screenshots, and captured outputs are saved to Cosmos's artifacts service, including the Playwright trace, so a reviewer can replay the exact run in their own browser locally.
No verdict. The Verifier never posts a pass-or-fail result. A clean run means only that the behavior we tested matched the author's intent, not that the PR is sound. It gathers evidence and leaves the judgment to the reviewer.

What verification unlocks

Verification is a means, not the end. What follows has not shipped yet, but it is where we are headed:

A tour of the change, in the product. The report is already a walk-through of what changed; the next step is to give it inside Cosmos as an interactive tour of the PR, so a reviewer steps through the change and its evidence in the product instead of reading a comment.
Verification inside the coding loop. The highest-value place to run it is while a coding agent is still working, so it writes, verifies against runtime evidence, and iterates on its own.
Whole change types, automated end-to-end. Once the agent can prove its own work, entire classes of change can be owned start to finish: read the bug report, write the fix, prove it behaves, and open the PR with the evidence attached.

None of this is finished. Today the Verifier runs on every pull request, exercises the change, and puts the evidence in front of the reviewer. We started with agents writing the code and agents reviewing it; now an agent verifies it too.