| # Post-Mortem: [INCIDENT TITLE] |
|
|
| ## Metadata |
| - **Incident ID**: INC-XXXX |
| - **Severity**: P1/P2/P3 |
| - **Date**: YYYY-MM-DD |
| - **Duration**: X hours Y minutes |
| - **Start Time**: HH:MM UTC |
| - **End Time**: HH:MM UTC |
| - **Authors**: @engineer1, @engineer2 |
| - **Status**: Draft/Final |
|
|
| ## Executive Summary |
| [1-2 sentences: what happened, customer impact, duration] |
|
|
| ## Impact |
| - **Customers affected**: X / Y (Z%) |
| - **Requests failed**: X |
| - **Revenue impact**: $X |
| - **Error budget consumed**: X% of 30d budget |
|
|
| ## Timeline (UTC) |
| | Time | Event | Action | |
| |------|-------|--------| |
| | HH:MM | Alert fired | On-call paged | |
| | HH:MM | Root cause identified | [What was found] | |
| | HH:MM | Mitigation applied | [What was done] | |
| | HH:MM | Service restored | [Confirmation] | |
| | HH:MM | All-clear | Incident closed | |
|
|
| ## Root Cause |
| [5 Whys analysis] |
| 1. Why did the incident occur? |
| 2. Why was that condition present? |
| 3. Why was that not caught? |
| 4. Why was there no automated prevention? |
| 5. Why was this not in our risk model? |
|
|
| ## What Went Well |
| - [Detection was fast / alert was clear / etc.] |
|
|
| ## What Went Wrong |
| - [Response was slow / runbook was missing / etc.] |
|
|
| ## Action Items |
| | # | Action | Owner | Priority | Due Date | Type | |
| |---|--------|-------|----------|----------|------| |
| | 1 | [Fix] | @eng | P1 | YYYY-MM-DD | Remediate | |
| | 2 | [Prevent] | @eng | P2 | YYYY-MM-DD | Automate | |
| | 3 | [Detect] | @eng | P2 | YYYY-MM-DD | Monitoring | |
|
|
| ## Lessons Learned |
| - [Key takeaway 1] |
| - [Key takeaway 2] |
|
|
| ## Appendices |
| - Grafana dashboard screenshots |
| - Alert screenshots |
| - Log excerpts |
|
|