DevOps

Canary deployment SLO breach auto-rollback

Watch canary traffic against the stable baseline, halt the ramp when latency or error rate breaks SLOs, and roll back automatically.

kubernetescanarysloprogressive deliveryauto-rollbackdatadogprometheusp99 latencyerror ratepagerduty

[ workflow / devops ]

Canary deployment SLO breach auto-rollback

Cosmos watches canary deployments and compares p50/p95/p99 latency, error rate, and saturation against stable traffic. Healthy canaries keep ramping; sustained SLO breaches halt the rollout, execute rollback, open an incident, and write an audit entry. Missing baselines or weak traffic fail safely to manual review.

14 nodes

11 edges

Trigger[trigger]

New canary deployed

K8s rollout event

System step[baseline]

Pull baseline SLOs

Datadog + Grafana, 7d

Decision

Baseline available?

Sufficient stable data

Human-in-the-loop[manual-gate]

Manual approval gate

Operator confirms ramp

Decision

Baseline available?

Sufficient stable data

Human-in-the-loop[manual-gate]

Manual approval gate

Operator confirms ramp

YES

System step[sample]

Sample canary metrics

5-min window + probe

Decision

Network healthy?

Probes return on time

Output / Result[pause-alert]

Pause + alert ops

Slack ping, no rollback

Decision

Network healthy?

Probes return on time

Output / Result[pause-alert]

Pause + alert ops

Slack ping, no rollback

YES

Decision

Significant traffic?

≥ statistical threshold

No, extend

System step[sample]

Sample canary metrics

5-min window + probe

Decision

Significant traffic?

≥ statistical threshold

System step[sample]

Sample canary metrics

5-min window + probe

YES

AI Agent step[compare]

Compare vs baseline

Latency, errors, saturation

Decision

Breached 2 windows?

Err >2× or p99 >1.5×

Yes

Output / Result[rollback]

Rollback + page on-call

Undo k8s + PagerDuty + log

Decision

Breached 2 windows?

Err >2× or p99 >1.5×

Output / Result[rollback]

Rollback + page on-call

Undo k8s + PagerDuty + log

YES

Decision

At 100% traffic?

End of ramp ladder

Yes

Output / Result[success]

Promote + Slack notify

Audit log entry

Decision

At 100% traffic?

End of ramp ladder

Output / Result[success]

Promote + Slack notify

Audit log entry

YES

System step[advance]

Advance ramp step

10→25→50→100

Workflow prompt

Paste this into Augment to reproduce the workflow end-to-end.

Build a Cosmos workflow that gates a Kubernetes canary deployment on live SLO metrics, advances traffic when the canary is healthy, and rolls back automatically when it isn't.

Trigger: a new canary deployment event from the Kubernetes control plane (Argo Rollouts / Flagger / native Deployment with a canary track), keyed on the workload name and the canary revision.

Steps:
1. Pull the baseline SLOs for the workload from Datadog and Prometheus through Grafana: p50, p95, p99 latency, error rate, and saturation over the last 7 days of stable traffic on the same service. Cache the baseline keyed by workload + route + method.
2. Decision: "Baseline available?". Sufficient stable history (≥ 24h of clean traffic, ≥ N requests / route).
- If no, fail safe to a manual approval gate before any traffic shift.
- If yes, continue to sampling.
3. Manual approval gate: page the release operator on Slack, surface the workload, the missing-baseline reason, and the proposed ramp ladder. Operator must explicitly approve before the canary receives traffic.
4. Sample canary metrics over a 5-minute window: p50/p95/p99 latency, error rate (5xx + application errors), saturation (CPU, memory, in-flight requests). In the same step, probe canary pods for reachability from the SLO scraper.
5. Decision: "Network healthy?". All scrape probes returned within timeout, no partition between SLO collectors and the canary.
- If no, pause the rollout (hold the current traffic split, do not roll back), alert release-ops on Slack, and stop: a partitioned reading is not a real SLO breach.
- If yes, continue.
6. Decision: "Significant traffic?". Canary received ≥ the per-route minimum sample size for a meaningful comparison (configurable; default ~1k requests / route or the same threshold the baseline used).
- If no, extend the observation window: wait one more 5-minute cycle and resample.
- If yes, continue.
7. Compare canary vs baseline. Score each metric against the baseline window: error_rate_canary / error_rate_baseline, p99_canary / p99_baseline, saturation deltas. Record the verdict for this window in the canary audit log.
8. Decision: "Breached for 2 consecutive windows?". Breach = error rate > 2× baseline OR p99 latency > 1.5× baseline, sustained across the previous and current 5-minute window.
- If yes, halt the traffic ramp, execute the automated rollback (kubectl rollout undo / Argo Rollouts abort), open a PagerDuty incident with the breach evidence and audit trail, and stop.
- If no, continue to the ramp evaluation.
9. Decision: "At 100% traffic?". The canary is currently receiving full production traffic and has just cleared the final healthy window.
- If yes, promote the canary to stable, post a Slack notification with the rollout summary, append the success entry to the audit log, and stop.
- If no, continue to advance.
10. Advance the ramp one step on the 10% → 25% → 50% → 100% ladder over a 30–60 minute total budget, then loop back to sampling for the next window.

Constraints:
- Never roll back on a single bad window: require two consecutive breaches to filter transient noise.
- Never roll back on a network-partition reading: pause and alert instead, the data is unreliable.
- Always gate on a baseline; missing baseline means human approval, never auto-advance.
- Always write every window verdict, the final outcome, and the rollback / promotion action to an append-only audit log keyed by workload + revision so trends and post-mortems can be built later.
- Never expose raw scrape credentials or PagerDuty routing keys in Slack messages or the audit log.

← All Workflows