AI / ML

Prompt eval regression gate for LLM releases

Run the eval suite for every model checkpoint, compare quality, safety, latency, and cost against baseline, and block regressions.

mlllmevalsprompt engineeringregression testingrelease gatingllmopsmodel evaluation

[ workflow / ai ]

Prompt eval regression gate for LLM releases

Cosmos runs on every model checkpoint or release-candidate branch. It fetches the last shipped baseline, reruns the same scenarios, and blocks critical regressions with a Linear or Jira incident. Clean runs promote the new baseline; near misses are queued for ML triage and logged for trend dashboards.

11 nodes

10 edges

Trigger[trigger]

Checkpoint or RC push

New model or release candidate

System step[fetch-baseline]

Fetch baseline

Last shipped scores + outputs

System step[run-eval]

Run eval suite

Accuracy, safety, latency, cost

AI Agent step[compare]

Compare to baseline

Per-scenario score deltas

Decision

Any critical regression?

Past floor on critical scenario

Yes

Output / Result[open-incident]

Open release-block incident

Linear or Jira with deltas

Output / Result[open-incident]

Open release-block incident

Linear or Jira with deltas

Decision

Any critical regression?

Past floor on critical scenario

YES

Safety filter[flag-near-miss]

Flag near-miss edges

Inside warning band, queue review

System step[promote-baseline]

Promote new baseline

Candidate scores become baseline

Output / Result[allow-release]

Allow release

Mark gate green for deploy

Monitor path[log-history]

Append eval history

Append-only trend log

Workflow prompt

Paste this into Augment to reproduce the workflow end-to-end.

Build a Cosmos workflow that gates every release candidate of an LLM-powered feature on a prompt-evaluation regression suite. Replay the same eval set against the candidate, compare every scenario to the last shipped baseline, block the release if anything critical regressed, and only promote the baseline once the candidate is clean.

Trigger: every new model checkpoint published to the model registry, plus every push to a release-candidate branch in the repo (for example branches matching `release/*` or `rc-*`).

Steps:
1. Fetch the baseline scores and example outputs from the last shipped version (the current production checkpoint or release tag) so every comparison is apples-to-apples.
2. Run the eval suite against the candidate. Replay the full scenario set: golden prompts, adversarial probes, safety jailbreak attempts, latency / cost benchmarks: and capture per-scenario scores for accuracy, safety, latency (p50 / p95) and cost per call.
3. Compare candidate to baseline. Compute deltas per scenario and per metric, plus aggregate movement on each metric. Flag any scenario that crossed its configured floor (accuracy drop, safety score drop, latency or cost ceiling breached).
4. Decision: "Any critical scenario regressed?". A critical regression is a scenario tagged critical that crosses its floor, or aggregate movement past the release-gate threshold.
- If yes, open an incident in Linear / Jira with the failing scenarios, candidate vs baseline scores, the actual model outputs side-by-side, and a one-paragraph summary. Block the release gate and notify the ML team channel so the fix loop can start. The workflow exits to the eval-history log; the next checkpoint or RC push re-enters the gate.
- If no, continue.
5. Flag near-miss edge cases: any scenario inside the warning band, close to the floor but not over it: and queue them for the ML team to triage asynchronously so the next baseline isn't blind to drift.
6. Promote the candidate scores and example outputs as the new baseline so the next candidate is compared against this version, not the old one.

The clean path ends by marking the release gate green so the deploy pipeline can ship the candidate. Every run: block or allow: appends the run id, candidate identifier, baseline identifier, every per-scenario score and delta, the decision, the near-miss queue and the incident link when one was opened to the append-only eval-history log so trend dashboards (regression rate, recurring failure scenarios, accuracy / safety drift, MTTR on blocked releases) can be built on top.

Constraints:
- Never ship a candidate with a known critical regression. The gate is mandatory; no manual override that bypasses the recorded comparison.
- Always attach example outputs and per-scenario score deltas to the incident; the ML team must be able to reproduce the regression without rerunning the suite.
- Always keep the eval-history log append-only: never overwrite a prior run record, even when the same candidate is re-evaluated later.
- Always update the baseline only after a clean gate, never before, so a regression cannot poison the comparison floor for the next candidate.

← All Workflows