devsecops-platform / infrastructure /postmortem-template.md
shaikhsalman's picture
refactor: merged structure - model at center, DevSecOps wrapped around it
9d4d5c7 verified
# Post-Mortem: [INCIDENT TITLE]
## Metadata
- **Incident ID**: INC-XXXX
- **Severity**: P1/P2/P3
- **Date**: YYYY-MM-DD
- **Duration**: X hours Y minutes
- **Start Time**: HH:MM UTC
- **End Time**: HH:MM UTC
- **Authors**: @engineer1, @engineer2
- **Status**: Draft/Final
## Executive Summary
[1-2 sentences: what happened, customer impact, duration]
## Impact
- **Customers affected**: X / Y (Z%)
- **Requests failed**: X
- **Revenue impact**: $X
- **Error budget consumed**: X% of 30d budget
## Timeline (UTC)
| Time | Event | Action |
|------|-------|--------|
| HH:MM | Alert fired | On-call paged |
| HH:MM | Root cause identified | [What was found] |
| HH:MM | Mitigation applied | [What was done] |
| HH:MM | Service restored | [Confirmation] |
| HH:MM | All-clear | Incident closed |
## Root Cause
[5 Whys analysis]
1. Why did the incident occur?
2. Why was that condition present?
3. Why was that not caught?
4. Why was there no automated prevention?
5. Why was this not in our risk model?
## What Went Well
- [Detection was fast / alert was clear / etc.]
## What Went Wrong
- [Response was slow / runbook was missing / etc.]
## Action Items
| # | Action | Owner | Priority | Due Date | Type |
|---|--------|-------|----------|----------|------|
| 1 | [Fix] | @eng | P1 | YYYY-MM-DD | Remediate |
| 2 | [Prevent] | @eng | P2 | YYYY-MM-DD | Automate |
| 3 | [Detect] | @eng | P2 | YYYY-MM-DD | Monitoring |
## Lessons Learned
- [Key takeaway 1]
- [Key takeaway 2]
## Appendices
- Grafana dashboard screenshots
- Alert screenshots
- Log excerpts