Post-Mortem: [INCIDENT TITLE]
Metadata
- Incident ID: INC-XXXX
- Severity: P1/P2/P3
- Date: YYYY-MM-DD
- Duration: X hours Y minutes
- Start Time: HH:MM UTC
- End Time: HH:MM UTC
- Authors: @engineer1, @engineer2
- Status: Draft/Final
Executive Summary
[1-2 sentences: what happened, customer impact, duration]
Impact
- Customers affected: X / Y (Z%)
- Requests failed: X
- Revenue impact: $X
- Error budget consumed: X% of 30d budget
Timeline (UTC)
| Time | Event | Action |
|---|---|---|
| HH:MM | Alert fired | On-call paged |
| HH:MM | Root cause identified | [What was found] |
| HH:MM | Mitigation applied | [What was done] |
| HH:MM | Service restored | [Confirmation] |
| HH:MM | All-clear | Incident closed |
Root Cause
[5 Whys analysis]
- Why did the incident occur?
- Why was that condition present?
- Why was that not caught?
- Why was there no automated prevention?
- Why was this not in our risk model?
What Went Well
- [Detection was fast / alert was clear / etc.]
What Went Wrong
- [Response was slow / runbook was missing / etc.]
Action Items
| # | Action | Owner | Priority | Due Date | Type |
|---|---|---|---|---|---|
| 1 | [Fix] | @eng | P1 | YYYY-MM-DD | Remediate |
| 2 | [Prevent] | @eng | P2 | YYYY-MM-DD | Automate |
| 3 | [Detect] | @eng | P2 | YYYY-MM-DD | Monitoring |
Lessons Learned
- [Key takeaway 1]
- [Key takeaway 2]
Appendices
- Grafana dashboard screenshots
- Alert screenshots
- Log excerpts