Incident Investigator
Reacts to the PagerDuty alert in-thread, then classifies urgency, scope, and the affected service.
Incident Management
Modern observability still makes engineers reconstruct context under pressure.
With Cosmos, agents drive the repetitive investigation of incident management while pulling humans in primarily for judgment, prioritization, and remediation decisions.
Condition “Completions API, HTTP 5xx rate” was over the threshold for the last 5m
💨 Urgency: Low
🌐 Service: Completions API
🔍 investigating
Incident Investigator · View report
✅
Incident Summary:
Completions API streaming began returning HTTP 500s at 9:42, ~2 min after deploy b4e2f7. Error rate rose from ~0.1% to 4.2%.
Likely Cause:
Deploy b4e2f7 cut the upstream gRPC deadline 30s → 5s; slow responses now exceed it.
Meet Cosmos
The Incident Investigator triages every alert, correlates the evidence, and posts a structured RCA in Slack before a human has even looked. You stay in the loop only for the judgment calls.
Measured on Augment’s own on-call
Cosmos orchestrates the Incident Investigator across triage, evidence-gathering, root-cause analysis, and remediation, handing code-fixes to a PR Author that works with the review experts. Each loop is a Cosmos Expert, and they all share memory.
Reacts to the PagerDuty alert in-thread, then classifies urgency, scope, and the affected service.
Pulls logs, metrics, recent deploys, GitHub history, ownership, and prior incidents on the channel.
Forms and validates hypotheses, then posts a structured RCA: summary, cause, evidence, owners.
Recommends a code-fix, rollback, escalation, or monitor action, then hands code-fixes to the PR Author.
Scans the RCA and makes the call in under a minute for the average alert.
Incident Memory
Captures tribal knowledge · Distills runbooks · Shared with Code Review
Where humans stay in the loop
The bottleneck was never judgment. It was the repetitive investigation before anyone could make a call. Cosmos drives that work and pulls you in only for the decisions that touch production.
We deployed the Incident Investigator across five on-call channels and compared the month before and after. Human investigation dropped, RCAs landed faster, and on-call engineers got their focus back.
Talk to Cosmos Advisor to tailor the Incident Investigator: runbooks, escalation rules, scope boundaries, and the queries it runs. Swap in your own observability stack while keeping the same operational workflow.
Slack and PagerDuty. GCP Cloud Logging and Managed Prometheus. GitHub. SAML, OIDC, SCIM. Audit logs and SIEM, with self-hosted deployment available.