Observability
Runbook auto-execution on incidents
Match incidents to runbooks, execute safe remediation steps automatically, and page a human only when the runbook fails or requires judgment.
[ workflow / observability ]
Runbook auto-execution on incidents
Cosmos enriches PagerDuty or Opsgenie alerts with live telemetry, matches the signature to the right runbook, and executes the first safe remediation steps automatically. If the issue clears, on-call is notified without a page. If it fails or needs judgment, the engineer gets the full transcript.
08 nodes
06 edges
PagerDuty / Opsgenie
Metrics · logs · recent deploys
Confluence / Notion library
Restarts · flushes · scaling
Poll alert status after each step
Decision
Alert resolved?
Full transcript + next step
Decision
Alert resolved?
Full transcript + next step
Slack summary · no page
Workflow prompt
Paste this into Augment to reproduce the workflow end-to-end.
Cosmos, when an alert fires in PagerDuty or Opsgenie, perform the following: (1) Enrich the alert with the last 30 minutes of metrics and error logs from the affected service, plus any deployments in the last 2 hours. (2) Match the alert signature against the runbook library (stored in Confluence/Notion) to find the relevant runbook. (3) Execute all runbook steps marked 'auto-safe' (restarts, cache flushes, queue drains, scaling actions) in sequence, capturing output at each step. (4) After each step, check if the alert has resolved. If resolved, post a summary to the incident Slack channel and close the PagerDuty incident without paging on-call. (5) If all auto-safe steps are exhausted without resolution, or a step requires human judgment, immediately page the on-call with a full execution transcript and the next manual step highlighted.