What should an incident postmortem include?

An incident postmortem should cover summary, timeline, impact, root cause, contributing factors, action items, and lessons learned. Those sections make incidents searchable and comparable across the same repository. Each action item needs an owner, a verifiable verb, a measurable outcome, a tracker entry, and a deadline.

How long should it take to write a postmortem?

Postmortem writing takes longer when teams lack centralized evidence for an on-call incident. Teams without automated internal tooling often must manually reconstruct timelines from Slack messages, alerts, deployment events, and monitoring snapshots, a process that becomes harder as post-incident memory degrades.

What does blameless mean in a postmortem?

Blameless means the document focuses on contributing causes without indicting any individual or team. A blameless postmortem assumes that everyone acted in good faith with the information available to them. The goal is to fix systems and processes because production systems can add guardrails that support better decisions under pressure.

Are the five whys enough for root cause analysis?

The five whys work well for incidents with a single root cause but fall short for complex failures. Strict linear five-whys thinking oversimplifies how incidents actually develop. Apply the five whys to each contributing factor separately and accept that most incidents have multiple root causes.

How do postmortems connect to incident response runbooks?

Postmortem action items often seed or update an incident response runbook. During post-incident analysis, teams should ask directly whether the findings suggest adding or improving runbook steps. Treating "update the runbook" as an owned, dated action item keeps procedures current and prevents engineers from following stale steps during the next incident.

How does the error budget formula work in a postmortem?

The formula is: Error Budget = (1 − SLO target) × total time period. For a 99.9% SLO over 30 days, the error budget is 43.2 minutes. An incident lasting 32 minutes consumes roughly 74% of that budget. Expressing downtime as a percentage of the error budget consumed helps leadership prioritize remediation based on measurable risk.

How to Write an Incident Postmortem: Template and Process Guide

Teams use an incident postmortem to review an incident's summary, timeline, impact, root cause, contributing factors, action items, and lessons learned. A fixed template ensures every postmortem answers the same questions, keeping incidents searchable within the same repository. Engineers who were not in the room can compare incidents because each document follows the same structure.

The seven searchable postmortem fields are: summary, timeline, impact, root cause, contributing factors, action items, and lessons learned.

TL;DR

A postmortem combines logs, chat, deployment records, and alerts into a single evidence-backed timeline. Action items need five closure fields: owner, verifiable verb, measurable outcome, tracker entry, and deadline. Runbook updates turn outage findings into reusable operational knowledge when each follow-up carries all five fields.

Incident postmortem writing breaks down when outage reviews rely on memory rather than on sourced logs, chat, deploys, and alerts. Evidence-first drafting preserves the timeline before diagnostic context disappears. First-time postmortem authors often write after off-hours incidents, when Slack threads have scrolled past, monitoring timestamps do not align with alerting records, and diagnostic decisions were made on a call nobody recorded.

Engineering teams routinely reconstruct what happened from scattered sources. When an incident spans multiple microservices and several teams, no single person can rebuild the sequence from memory. That reconstruction effort loses its closure value when the postmortem has no named owner, action items lack deadlines, and root-cause analysis stops at a single shallow symptom rather than tracing contributing factors.

This guide explains how to reconstruct a timeline from real data sources, quantify impact in terms that leadership understands, trace root causes behind symptoms using the five whys, and write action items that actually close. Those steps break down when evidence is scattered across code history, linked issues, PR feedback, and ticketing context, each living in a different tool. Augment Cosmos pulls those sources into one view before the review begins, so reconstruction starts from what actually happened rather than what people remember.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat

— Thu, Jul 9 // 9:45 AM PDT

Why Postmortems Matter

Incident postmortems turn a reviewed outage into searchable institutional knowledge through documented timelines, contributing causes, and owned preventive actions. They usually serve three goals: document the incident, understand contributing root causes, and define effective preventive actions.

The learning loop turns outages into knowledge. Without a formal process of learning from incidents, the same failure patterns recur. A postmortem closes that loop by making findings discoverable and turning unresolved findings into concrete action items. Future team members use the postmortem document to understand what happened, even if they did not attend the review. That record gives new engineers the failure sequence, decisions, and follow-up actions behind past production decisions. Teams can use it alongside AI developer onboarding workflows to ensure that incident context is preserved as new hires join.

Google's SRE practice holds that the primary goals of writing a postmortem are to ensure the incident is documented, that all contributing root causes are well understood, and that effective preventive actions are in place to reduce the likelihood or impact of recurrence.

Prerequisites and Setup

Incident postmortem setup starts with evidence collection before the review meeting. Sourced logs, alerts, chat, deploy records, and recordings keep the document grounded in observable events. Create the postmortem artifact before the review meeting starts, not during it. The most important discipline is sourcing: keep unsourced claims out of the timeline and move them to an unverified section at the bottom. Treat the timeline as log data, not memory.

The postmortem owner collects these inputs before drafting:

Alert logs from PagerDuty or Opsgenie for hard timestamp anchors, including when the alert fired and when on-call was paged
Monitoring snapshots from Datadog, Grafana, or Prometheus with the time windows preserved
Slack and chat threads for decisions, hypotheses, and human context
Deployment and CI/CD history covering every deploy, config change, and scaling event in the 30 minutes before impact
Call recordings or transcripts, since half the decisions happen in calls, nobody documents

Those inputs give the postmortem owner enough sourced evidence to draft before the review meeting starts.

Assign a single owner immediately after resolution, ideally the Incident Commander. Having a single owner within 24 hours reduces ownership ambiguity because one person controls evidence collection, review scheduling, and tracker handoff. Start reconstruction within 24 hours of resolution. Within that boundary, log data and chat records are still easier to reconcile before narrative replaces evidence.

Step-by-Step Workflow

The incident postmortem workflow turns raw alerts, deploy records, monitoring data, chat evidence, and call notes into a reviewed artifact that preserves incident knowledge. Reconstruct the timeline first, then assess impact, analyze root cause, identify contributing factors, and write action items before assembling the template. Each step produces a section of the final document.

Step 1: Reconstruct the Incident Timeline

Incident timeline reconstruction anchors the postmortem on hard timestamps first. The owner then fills in the gaps with evidence from sources, allowing reviewers to follow the diagnostic sequence. Alert logs from PagerDuty and Opsgenie give fixed points: "alert fired at 02:47," "on-call paged at 02:49." Deployment markers from the CI/CD pipeline give more fixed points.

The reconstruction gap sits between hard anchors. The span from "alert fired at 02:47" to "root cause identified at 03:12" contains diagnostic steps and decisions that rarely appear in alert logs. The postmortem owner reconstructs the span from Slack scroll-back, call recordings, and monitoring snapshots.

Capture timestamped hypotheses along with final facts. A responder hypothesis, such as "the issue correlates with the 2:30 AM deployment," belongs in the timeline even if it turns out to be wrong. Three weeks later, pattern review may show that the team initially suspected deployments while the real culprit was a configuration change. Use exact UTC timestamps, abstract lengthy transcripts into summaries, and link the unedited versions.

Step 2: Assess and Quantify Impact

Incident impact assessment quantifies duration, affected users, SLO burn, and business impact, enabling engineering and leadership to prioritize remediation based on measurable harm. State duration in minutes and, when the team tracks resolution metrics, record it as Mean Time to Resolution. Express affected users as a percentage: "This SEV1 incident affected X% of customers." Report SLO or error budget burn using the formula: Error Budget = (1 − SLO target) × total time period.

Record impact in the same format each time:

Duration: state the incident length in minutes
Affected users: express the affected population as a percentage
SLO or error budget burn: calculate the budget consumed during the incident
Revenue or business impact: add business effect where it applies

With a 99.9% SLO over 30 days, the error budget is 43.2 minutes, so 32 minutes of downtime burns roughly 74% of it. Add revenue or business impact where it applies.

Severity framing sets remediation priority when impact is expressed as percentages, minutes, or error-budget burn. A single recurrence that burns 40% of a monthly error budget may warrant more urgent architectural remediation than three low-impact recurrences that together consume 10%. Use exact timestamps in this section too. "11:14 AM UTC" lets reviewers measure the interval between when the impact started and when the team notified customers, which "around 11" never could. Incident impact data pairs directly with code quality metrics when teams want to connect outage frequency to measurable engineering follow-through.

Step 3: Analyze Root Cause Without Stopping at Symptoms

Root-cause analysis distinguishes symptoms from root cause by tracing causal chains past the first visible failure. That tracing lets remediation prevent a class of incidents rather than one observed error. Shallow root-cause analysis limits action items to impermanent fixes or surface patches. A symptom is "the database was locked." The root cause sits at the point in the causal chain where a change prevents the entire class of incidents.

The five whys is a common pattern for getting there. Tool output from automated analysis does not replace a written causal chain; a reviewed explanation of why the failure became possible is what makes remediation defensible and repeatable.

Use the five whys as a written sequence: describe the problem in writing, ask why it happened, record the answer, and repeat until the team agrees they have found the root cause. That sequence keeps the analysis grounded in the causal chain instead of stopping at the first visible symptom.

A worked example shows the depth:

The application had an outage because the database was locked → The database locked because there were too many writes → The service change created elevated writes the team did not expect → The development process did not include load testing for this type of change → Load testing had not been treated as necessary before the system reached this level of scale.

Treat the five whys as a starting point. Strict linear five-whys thinking can force a simplistic explanation onto failures that actually depend on multiple interacting conditions. Real incidents usually have multiple contributing conditions, each necessary but only jointly sufficient. Apply the five whys to each contributing factor separately, and resist the urge to crown a single root cause when the evidence points to several.

Step 4: Identify Contributing Factors

Contributing-factor analysis separates systemic, process, and tooling gaps from the root cause. This section explains why the incident became possible or worse without attributing blame to individuals. Contributing factors include systemic, process, and tooling gaps that allow the root cause to persist or worsen. List two to five of them.

Open source

augmentcode/augment-swebench-agent★873

Star on GitHub

Common contributing-factor categories include process gaps, tool limitations, missing alerts, and documentation failures. These categories let the postmortem describe fixable conditions without turning the review into individual blame.

Blameless wording shapes whether this section produces fixable changes. Instead of "the team missed a warning sign," write "runbooks did not document warning signs, making them easy to miss under pressure." The first sentence indicts a person; the second produces a system change. A blameless review asks "what about the system or operating process allowed this mistake to happen?" rather than "who made an error?"

Cosmos's Context Engine processes entire codebases across 400,000+ files through semantic dependency graph analysis, including call graphs and cross-repo dependencies that commit messages rarely capture. Teams investigating which change introduced a fault can use that analysis to examine code locations, dependencies, and usage sites before choosing a remediation path, all in the same view, alongside code change history, linked incident tickets, and runbook context.

Step 5: Write Action Items That Actually Get Resolved

Incident postmortem action items close the learning loop by including five fields: owner, verifiable verb, measurable outcome, tracker entry, and deadline. Without those fields, reviewers cannot verify closure. The verb matters. Use add, remove, change, update, test, or deploy, not "review," "explore," or "investigate."

Compare weak action-item wording against wording that produces a verifiable end state:

Action item test	Poorly Worded	Better	Verifiable Verb	Measurable Outcome
Monitoring remediation	Investigate monitoring for this scenario.	Add alerting for all cases where this service returns more than 1% errors.	Add	Alerting covers >1% of errors.
Input handling remediation	Fix the issue that caused the outage.	Safely handle invalid postal codes in the user address form input.	Handle	The form handles invalid postal code input safely.
Schema-change remediation	Make sure the engineer checks the schema before updating.	Add an automated presubmit check for schema changes.	Add	Presubmit check covers schema changes.

Assign every item to a named person, never a team, to avoid ownership ambiguity during tracker handoff. Separate mitigative actions, which fix the immediate gap, from preventative actions, which prevent the class of problem. Put each item in Linear, Jira, or Asana with a deadline, and set 30-, 60-, and 90-day follow-ups.

When remediation depends on deployment, test, or review automation, keep the tracker handoff visible so the action item owner can confirm the fix went in. A review should ask whether the team should add or improve runbook steps. An item like "Update certificate runbook with current infrastructure details," with an owner and a due date, bridges the postmortem and the runbook update. Some teams make this explicit in their completion criteria: the postmortem is not done until runbook updates are merged and reviewed.

Step 6: Copyable Postmortem Template

Use a fixed incident postmortem template to standardize the seven fields from this guide. The template below follows common incident-review conventions. Copy it directly into the incident document.

Post-mortem: [Incident Title]

Date: YYYY-MM-DD | Severity: P1 / P2 / P3 | Owner: [Name] | Status: Draft / In Review / Published

Summary

[2-3 sentences: what broke, when, for how long, what was affected]

Impact

Impact field	Value
Users affected	[number or %]
Duration	[X minutes]
SLO budget consumed	[X%]
Revenue/business impact	[if applicable]

Timeline (UTC)

Time	Event
HH:MM	First alert fired in [tool]
HH:MM	On-call paged
HH:MM	Incident channel created, [name] assigned as IC
HH:MM	[Key diagnostic action/hypothesis]
HH:MM	Root cause identified: [brief description]
HH:MM	Fix deployed / mitigation applied
HH:MM	Incident resolved

Root Cause

[Five whys chain ending at the systemic cause]

Contributing Factors

#	Factor
1	[Process/tool/system gap, framed blamelessly]
2	[Process/tool/system gap, framed blamelessly]

Action Items

Action (verb-led)	Owner	Type	Tracker #	Due
Add alerting for >1% error rate	[Name]	Preventative	JIRA-123	Sprint 14

Lessons Learned

Reflection area	Note
What went well	[behavior or practice that helped]
What went poorly	[gap to address]

Once drafted, the incident postmortem needs to be reviewed. At least one reviewer should validate the timeline, root cause, and action items before the document enters the learning loop. Tag it with the relevant service names so the failure mode stays searchable, then add it to the team or organization repository.

From Reviewed Document to Real Change

Writing the postmortem produces the reviewed record. Closing action items produces the system, process, or runbook change that the next on-call engineer inherits. The postmortem owner should choose the highest-priority action item from the most recent incident, move it into the real task tracker, and then check whether that item updates an incident response runbook. A stale runbook followed under pressure makes the next incident worse.

Cosmos connects code, tickets, documentation, and external sources through linked services via MCP, pulling those sources into the same view teams use to scope the fix. That view covers the tracker entry, the runbook update, and the code paths named in the postmortem, without switching between tools. The same organizational memory layer that speeds evidence collection before the review meeting also accelerates remediation routing after it.

How to Write an Incident Postmortem: Template and Process Guide

TL;DR

The New Code Review Workflow for AI-Native Engineering Teams

Why Postmortems Matter

Prerequisites and Setup

Step-by-Step Workflow

Step 1: Reconstruct the Incident Timeline

Step 2: Assess and Quantify Impact

Step 3: Analyze Root Cause Without Stopping at Symptoms

Step 4: Identify Contributing Factors

Step 5: Write Action Items That Actually Get Resolved

Step 6: Copyable Postmortem Template

Post-mortem: [Incident Title]

Summary

Impact

Timeline (UTC)

Root Cause

Contributing Factors

Action Items

Lessons Learned

From Reviewed Document to Real Change

Frequently Asked Questions About Incident Postmortems

Written by

Ani Galstian

Give your codebase the agents it deserves

TL;DR

The New Code Review Workflow for AI-Native Engineering Teams

Why Postmortems Matter

Prerequisites and Setup

Step-by-Step Workflow

Step 1: Reconstruct the Incident Timeline

Step 2: Assess and Quantify Impact

Step 3: Analyze Root Cause Without Stopping at Symptoms

Step 4: Identify Contributing Factors

Step 5: Write Action Items That Actually Get Resolved

Step 6: Copyable Postmortem Template

Post-mortem: [Incident Title]

Summary

Impact

Timeline (UTC)

Root Cause

Contributing Factors

Action Items

Lessons Learned

From Reviewed Document to Real Change

Frequently Asked Questions About Incident Postmortems

What should an incident postmortem include?

How long should it take to write a postmortem?

What does blameless mean in a postmortem?

Are the five whys enough for root cause analysis?

How do postmortems connect to incident response runbooks?

How does the error budget formula work in a postmortem?

Related Guides

Written by

Ani Galstian

Give your codebase the agents it deserves