Observability

Runbook auto-execution on incidents

Match incidents to runbooks, execute safe remediation steps automatically, and page a human only when the runbook fails or requires judgment.

incident-responserunbookautomationon-callobservabilitysre

[ workflow / observability ]

Runbook auto-execution on incidents

Cosmos enriches PagerDuty or Opsgenie alerts with live telemetry, matches the signature to the right runbook, and executes the first safe remediation steps automatically. If the issue clears, on-call is notified without a page. If it fails or needs judgment, the engineer gets the full transcript.

08 nodes

06 edges

Trigger[trigger]

Alert fired

PagerDuty / Opsgenie

AI Agent step[enrich]

Enrich alert

Metrics · logs · recent deploys

AI Agent step[match]

Match runbook

Confluence / Notion library

AI Agent step[execute]

Execute auto-safe steps

Restarts · flushes · scaling

Monitor path[monitor]

Check resolution

Poll alert status after each step

Decision

Alert resolved?

Human-in-the-loop[escalate]

Page on-call

Full transcript + next step

Decision

Alert resolved?

Human-in-the-loop[escalate]

Page on-call

Full transcript + next step

YES

Output / Result[close]

Close incident

Slack summary · no page

Workflow prompt

Paste this into Augment to reproduce the workflow end-to-end.

Cosmos, when an alert fires in PagerDuty or Opsgenie, perform the following: (1) Enrich the alert with the last 30 minutes of metrics and error logs from the affected service, plus any deployments in the last 2 hours. (2) Match the alert signature against the runbook library (stored in Confluence/Notion) to find the relevant runbook. (3) Execute all runbook steps marked 'auto-safe' (restarts, cache flushes, queue drains, scaling actions) in sequence, capturing output at each step. (4) After each step, check if the alert has resolved. If resolved, post a summary to the incident Slack channel and close the PagerDuty incident without paging on-call. (5) If all auto-safe steps are exhausted without resolution, or a step requires human judgment, immediately page the on-call with a full execution transcript and the next manual step highlighted.

← All Workflows