Incident Management

When agents write and review the code, on-call becomes the bottleneck.

Modern observability still makes engineers reconstruct context under pressure.

With Cosmos, agents drive the repetitive investigation of incident management. Tag Cosmos in Slack or trigger it from an alert, and it pulls humans in primarily for judgment, prioritization, and remediation decisions.

Book a demo Try now

cosmos / incident-management

Meet Cosmos

An expert on every alert.

The Incident Investigator triages every alert or Slack tag, correlates Datadog metrics, Sentry errors, GitLab changes, and code context, then posts a structured RCA before a human has even looked. You stay in the loop only for the judgment calls.

Measured on Augment’s own on-call

81%

Less human on-call investigation

6.2 min

Median time to first RCA

44%

More PRs merged by on-call engineers

19.9 min

Median time to resolution

A fleet of experts takes ownership of incident response.

Cosmos orchestrates the Incident Investigator across triage, evidence-gathering, root-cause analysis, and remediation, handing code-fixes to a PR Author that works with the review experts. Each loop is a Cosmos Expert, and they all share memory.

Incident lifecycleone alert, end to end

Triage

Incident Investigator

Reacts to a Slack tag, PagerDuty alert, or Datadog alert, then classifies urgency, scope, and the affected service.

Root cause

Root-Cause Analysis

Forms and validates hypotheses, then posts a structured RCA: summary, cause, evidence, owners.

Triage

Investigate

Remediate

Outcome

Investigate

Evidence Gathering

Pulls Datadog logs and metrics, Sentry issues, recent GitLab or GitHub changes, ownership, and prior incidents on the channel.

Remediate

Remediation

Writes the fix and opens the PR when code is the answer. Recommends rollback, escalation, or a monitor action when it is not.

Human

Scans the RCA and makes the call in under a minute for the average alert.

Triage

Incident Investigator

Reacts to a Slack tag, PagerDuty alert, or Datadog alert, then classifies urgency, scope, and the affected service.

Investigate

Evidence Gathering

Pulls Datadog logs and metrics, Sentry issues, recent GitLab or GitHub changes, ownership, and prior incidents on the channel.

Root cause

Root-Cause Analysis

Forms and validates hypotheses, then posts a structured RCA: summary, cause, evidence, owners.

Remediate

Remediation

Writes the fix and opens the PR when code is the answer. Recommends rollback, escalation, or a monitor action when it is not.

Human

Scans the RCA and makes the call in under a minute for the average alert.

RCA posted · fix PR opened

Incident Memory

Captures tribal knowledge · Distills runbooks · Shared with Code Review

Fig 1 · Incident response fleet

Book a demo Try now

Where humans stay in the loop

Agents investigate. Humans decide.

The bottleneck was never judgment. It was the repetitive investigation before anyone could make a call. Cosmos drives that work and pulls you in only for the decisions that touch production.

How Cosmos automated our on-call

Driven by agents

The repetitive investigation

Correlating Datadog, Sentry, deploys, logs, and metrics
Searching Slack threads and dashboards
Finding ownership and prior incidents
Drafting the RCA and next steps
Writing the fix and opening the PR or MR

Owned by humans

The judgment calls

Prioritization under pressure
Approving the remediation
Production-impacting decisions

Watch it work

From a Datadog alert to a fix PR.

A live run of the incident automation. A memory leak trips a Datadog monitor, the webhook spins up an incident expert in Cosmos, the expert investigates through the Datadog MCP server, finds the cause, and opens the fix PR in GitHub for review.

cosmos / incident-response · datadogDemo · alert to fix PR

Measured on our own on-call.

We deployed the Incident Investigator across five on-call channels and compared the month before and after. Human investigation dropped, RCAs landed faster, and on-call engineers got their focus back.

Read the full breakdown

Incident ownership[ fig. 01 / ownership ]

Share of incidents handled by agentsManual vs. agent-led, before and after

Before: 0.4% agents; 99.6% handled manually.
After: 81.3% agents; 18.7% still needed manual handling.

Time to resolve[ fig. 02 / speed ]

Median time to resolveTime to first RCA and MTTR, in minutes

First RCA: 6.2 min; Down from 30.1 minutes before Cosmos.
MTTR: 19.9 min; Down from 29.5 minutes before Cosmos.

Human effort[ fig. 03 / effort ]

Human time per incidentActive triage time, before and after

Before: 30 min; Active human triage per incident.
After: <1 min; Active human triage per incident.

SlackPagerDutyDatadogSentryGitLabGCP Logging · Prometheus · GitLab CI

Highly customizable to your stack.

Talk to Cosmos Advisor to tailor the Incident Investigator: runbooks, escalation rules, scope boundaries, and the queries it runs. Swap in your own observability, issue, and source-control stack while keeping the same operational workflow.

Works with GitHub and GitLab, with audit logs, SIEM, and self-hosted deployment available.

Talk to an advisor

When agents write and review the code, on-call becomes the bottleneck.

An expert on every alert.

A fleet of experts takes ownership of incident response.

Incident Investigator

Root-Cause Analysis

Evidence Gathering

Remediation

Incident Investigator

Evidence Gathering

Root-Cause Analysis

Remediation

Agents investigate. Humans decide.

The repetitive investigation

The judgment calls

From a Datadog alert to a fix PR.

Measured on our own on-call.

Highly customizable to your stack.

Less time on-call.More time shipping.