Skip to content
Book demo
Back to Guides

How to Write an Incident Postmortem: Template and Process Guide

Jun 23, 2026
Ani Galstian
Ani Galstian
How to Write an Incident Postmortem: Template and Process Guide

Teams use an incident postmortem to review an incident's summary, timeline, impact, root cause, contributing factors, action items, and lessons learned. A fixed template ensures every postmortem answers the same questions, keeping incidents searchable within the same repository. Engineers who were not in the room can compare incidents because each document follows the same structure.

The seven searchable postmortem fields are: summary, timeline, impact, root cause, contributing factors, action items, and lessons learned.

TL;DR

A postmortem combines logs, chat, deployment records, and alerts into a single evidence-backed timeline. Action items need five closure fields: owner, verifiable verb, measurable outcome, tracker entry, and deadline. Runbook updates turn outage findings into reusable operational knowledge when each follow-up carries all five fields.

Incident postmortem writing breaks down when outage reviews rely on memory rather than on sourced logs, chat, deploys, and alerts. Evidence-first drafting preserves the timeline before diagnostic context disappears. First-time postmortem authors often write after off-hours incidents, when Slack threads have scrolled past, monitoring timestamps do not align with alerting records, and diagnostic decisions were made on a call nobody recorded.

Engineering teams routinely reconstruct what happened from scattered sources. When an incident spans multiple microservices and several teams, no single person can rebuild the sequence from memory. That reconstruction effort loses its closure value when the postmortem has no named owner, action items lack deadlines, and root-cause analysis stops at a single shallow symptom rather than tracing contributing factors.

This guide explains how to reconstruct a timeline from real data sources, quantify impact in terms that leadership understands, trace root causes behind symptoms using the five whys, and write action items that actually close. Those steps break down when evidence is scattered across code history, linked issues, PR feedback, and ticketing context, each living in a different tool. Augment Cosmos pulls those sources into one view before the review begins, so reconstruction starts from what actually happened rather than what people remember.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat
Thu, Jul 9 // 9:45 AM PDT

Why Postmortems Matter

Incident postmortems turn a reviewed outage into searchable institutional knowledge through documented timelines, contributing causes, and owned preventive actions. They usually serve three goals: document the incident, understand contributing root causes, and define effective preventive actions.

The learning loop turns outages into knowledge. Without a formal process of learning from incidents, the same failure patterns recur. A postmortem closes that loop by making findings discoverable and turning unresolved findings into concrete action items. Future team members use the postmortem document to understand what happened, even if they did not attend the review. That record gives new engineers the failure sequence, decisions, and follow-up actions behind past production decisions. Teams can use it alongside AI developer onboarding workflows to ensure that incident context is preserved as new hires join.

Google's SRE practice holds that the primary goals of writing a postmortem are to ensure the incident is documented, that all contributing root causes are well understood, and that effective preventive actions are in place to reduce the likelihood or impact of recurrence.

Prerequisites and Setup

Incident postmortem setup starts with evidence collection before the review meeting. Sourced logs, alerts, chat, deploy records, and recordings keep the document grounded in observable events. Create the postmortem artifact before the review meeting starts, not during it. The most important discipline is sourcing: keep unsourced claims out of the timeline and move them to an unverified section at the bottom. Treat the timeline as log data, not memory.

The postmortem owner collects these inputs before drafting:

  • Alert logs from PagerDuty or Opsgenie for hard timestamp anchors, including when the alert fired and when on-call was paged
  • Monitoring snapshots from Datadog, Grafana, or Prometheus with the time windows preserved
  • Slack and chat threads for decisions, hypotheses, and human context
  • Deployment and CI/CD history covering every deploy, config change, and scaling event in the 30 minutes before impact
  • Call recordings or transcripts, since half the decisions happen in calls, nobody documents

Those inputs give the postmortem owner enough sourced evidence to draft before the review meeting starts.

Assign a single owner immediately after resolution, ideally the Incident Commander. Having a single owner within 24 hours reduces ownership ambiguity because one person controls evidence collection, review scheduling, and tracker handoff. Start reconstruction within 24 hours of resolution. Within that boundary, log data and chat records are still easier to reconcile before narrative replaces evidence.

Step-by-Step Workflow

The incident postmortem workflow turns raw alerts, deploy records, monitoring data, chat evidence, and call notes into a reviewed artifact that preserves incident knowledge. Reconstruct the timeline first, then assess impact, analyze root cause, identify contributing factors, and write action items before assembling the template. Each step produces a section of the final document.

Step 1: Reconstruct the Incident Timeline

Incident timeline reconstruction anchors the postmortem on hard timestamps first. The owner then fills in the gaps with evidence from sources, allowing reviewers to follow the diagnostic sequence. Alert logs from PagerDuty and Opsgenie give fixed points: "alert fired at 02:47," "on-call paged at 02:49." Deployment markers from the CI/CD pipeline give more fixed points.

The reconstruction gap sits between hard anchors. The span from "alert fired at 02:47" to "root cause identified at 03:12" contains diagnostic steps and decisions that rarely appear in alert logs. The postmortem owner reconstructs the span from Slack scroll-back, call recordings, and monitoring snapshots.

Capture timestamped hypotheses along with final facts. A responder hypothesis, such as "the issue correlates with the 2:30 AM deployment," belongs in the timeline even if it turns out to be wrong. Three weeks later, pattern review may show that the team initially suspected deployments while the real culprit was a configuration change. Use exact UTC timestamps, abstract lengthy transcripts into summaries, and link the unedited versions.

Step 2: Assess and Quantify Impact

Incident impact assessment quantifies duration, affected users, SLO burn, and business impact, enabling engineering and leadership to prioritize remediation based on measurable harm. State duration in minutes and, when the team tracks resolution metrics, record it as Mean Time to Resolution. Express affected users as a percentage: "This SEV1 incident affected X% of customers." Report SLO or error budget burn using the formula: Error Budget = (1 − SLO target) × total time period.

Record impact in the same format each time:

  • Duration: state the incident length in minutes
  • Affected users: express the affected population as a percentage
  • SLO or error budget burn: calculate the budget consumed during the incident
  • Revenue or business impact: add business effect where it applies

With a 99.9% SLO over 30 days, the error budget is 43.2 minutes, so 32 minutes of downtime burns roughly 74% of it. Add revenue or business impact where it applies.

Severity framing sets remediation priority when impact is expressed as percentages, minutes, or error-budget burn. A single recurrence that burns 40% of a monthly error budget may warrant more urgent architectural remediation than three low-impact recurrences that together consume 10%. Use exact timestamps in this section too. "11:14 AM UTC" lets reviewers measure the interval between when the impact started and when the team notified customers, which "around 11" never could. Incident impact data pairs directly with code quality metrics when teams want to connect outage frequency to measurable engineering follow-through.

Step 3: Analyze Root Cause Without Stopping at Symptoms

Root-cause analysis distinguishes symptoms from root cause by tracing causal chains past the first visible failure. That tracing lets remediation prevent a class of incidents rather than one observed error. Shallow root-cause analysis limits action items to impermanent fixes or surface patches. A symptom is "the database was locked." The root cause sits at the point in the causal chain where a change prevents the entire class of incidents.

The five whys is a common pattern for getting there. Tool output from automated analysis does not replace a written causal chain; a reviewed explanation of why the failure became possible is what makes remediation defensible and repeatable.

Use the five whys as a written sequence: describe the problem in writing, ask why it happened, record the answer, and repeat until the team agrees they have found the root cause. That sequence keeps the analysis grounded in the causal chain instead of stopping at the first visible symptom.

A worked example shows the depth:

The application had an outage because the database was locked → The database locked because there were too many writes → The service change created elevated writes the team did not expect → The development process did not include load testing for this type of change → Load testing had not been treated as necessary before the system reached this level of scale.

Treat the five whys as a starting point. Strict linear five-whys thinking can force a simplistic explanation onto failures that actually depend on multiple interacting conditions. Real incidents usually have multiple contributing conditions, each necessary but only jointly sufficient. Apply the five whys to each contributing factor separately, and resist the urge to crown a single root cause when the evidence points to several.

Step 4: Identify Contributing Factors

Contributing-factor analysis separates systemic, process, and tooling gaps from the root cause. This section explains why the incident became possible or worse without attributing blame to individuals. Contributing factors include systemic, process, and tooling gaps that allow the root cause to persist or worsen. List two to five of them.

Open source
augmentcode/augment-swebench-agent873
Star on GitHub

Common contributing-factor categories include process gaps, tool limitations, missing alerts, and documentation failures. These categories let the postmortem describe fixable conditions without turning the review into individual blame.

Blameless wording shapes whether this section produces fixable changes. Instead of "the team missed a warning sign," write "runbooks did not document warning signs, making them easy to miss under pressure." The first sentence indicts a person; the second produces a system change. A blameless review asks "what about the system or operating process allowed this mistake to happen?" rather than "who made an error?"

Cosmos's Context Engine processes entire codebases across 400,000+ files through semantic dependency graph analysis, including call graphs and cross-repo dependencies that commit messages rarely capture. Teams investigating which change introduced a fault can use that analysis to examine code locations, dependencies, and usage sites before choosing a remediation path, all in the same view, alongside code change history, linked incident tickets, and runbook context.

Step 5: Write Action Items That Actually Get Resolved

Incident postmortem action items close the learning loop by including five fields: owner, verifiable verb, measurable outcome, tracker entry, and deadline. Without those fields, reviewers cannot verify closure. The verb matters. Use add, remove, change, update, test, or deploy, not "review," "explore," or "investigate."

Compare weak action-item wording against wording that produces a verifiable end state:

Action item testPoorly WordedBetterVerifiable VerbMeasurable Outcome
Monitoring remediationInvestigate monitoring for this scenario.Add alerting for all cases where this service returns more than 1% errors.AddAlerting covers >1% of errors.
Input handling remediationFix the issue that caused the outage.Safely handle invalid postal codes in the user address form input.HandleThe form handles invalid postal code input safely.
Schema-change remediationMake sure the engineer checks the schema before updating.Add an automated presubmit check for schema changes.AddPresubmit check covers schema changes.

Assign every item to a named person, never a team, to avoid ownership ambiguity during tracker handoff. Separate mitigative actions, which fix the immediate gap, from preventative actions, which prevent the class of problem. Put each item in Linear, Jira, or Asana with a deadline, and set 30-, 60-, and 90-day follow-ups.

When remediation depends on deployment, test, or review automation, keep the tracker handoff visible so the action item owner can confirm the fix went in. A review should ask whether the team should add or improve runbook steps. An item like "Update certificate runbook with current infrastructure details," with an owner and a due date, bridges the postmortem and the runbook update. Some teams make this explicit in their completion criteria: the postmortem is not done until runbook updates are merged and reviewed.

Step 6: Copyable Postmortem Template

Use a fixed incident postmortem template to standardize the seven fields from this guide. The template below follows common incident-review conventions. Copy it directly into the incident document.

Post-mortem: [Incident Title]

Date: YYYY-MM-DD | Severity: P1 / P2 / P3 | Owner: [Name] | Status: Draft / In Review / Published

Summary

[2-3 sentences: what broke, when, for how long, what was affected]

Impact

Impact fieldValue
Users affected[number or %]
Duration[X minutes]
SLO budget consumed[X%]
Revenue/business impact[if applicable]

Timeline (UTC)

TimeEvent
HH:MMFirst alert fired in [tool]
HH:MMOn-call paged
HH:MMIncident channel created, [name] assigned as IC
HH:MM[Key diagnostic action/hypothesis]
HH:MMRoot cause identified: [brief description]
HH:MMFix deployed / mitigation applied
HH:MMIncident resolved

Root Cause

[Five whys chain ending at the systemic cause]

Contributing Factors

#Factor
1[Process/tool/system gap, framed blamelessly]
2[Process/tool/system gap, framed blamelessly]

Action Items

Action (verb-led)OwnerTypeTracker #Due
Add alerting for >1% error rate[Name]PreventativeJIRA-123Sprint 14

Lessons Learned

Reflection areaNote
What went well[behavior or practice that helped]
What went poorly[gap to address]

Once drafted, the incident postmortem needs to be reviewed. At least one reviewer should validate the timeline, root cause, and action items before the document enters the learning loop. Tag it with the relevant service names so the failure mode stays searchable, then add it to the team or organization repository.

From Reviewed Document to Real Change

Writing the postmortem produces the reviewed record. Closing action items produces the system, process, or runbook change that the next on-call engineer inherits. The postmortem owner should choose the highest-priority action item from the most recent incident, move it into the real task tracker, and then check whether that item updates an incident response runbook. A stale runbook followed under pressure makes the next incident worse.

Cosmos connects code, tickets, documentation, and external sources through linked services via MCP, pulling those sources into the same view teams use to scope the fix. That view covers the tracker entry, the runbook update, and the code paths named in the postmortem, without switching between tools. The same organizational memory layer that speeds evidence collection before the review meeting also accelerates remediation routing after it.

Frequently Asked Questions About Incident Postmortems

Written by

Ani Galstian

Ani Galstian

Ani writes about enterprise-scale AI coding tool evaluation, agentic development security, and the operational patterns that make AI agents reliable in production. His guides cover topics like AGENTS.md context files, spec-as-source-of-truth workflows, and how engineering teams should assess AI coding tools across dimensions like auditability and security compliance

Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.