Upload incident-response/postmortem/template.md with huggingface_hub
Browse files
incident-response/postmortem/template.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Post-Mortem: [INCIDENT TITLE]
|
| 2 |
+
|
| 3 |
+
## Metadata
|
| 4 |
+
- **Incident ID**: INC-XXXX
|
| 5 |
+
- **Severity**: P1/P2/P3
|
| 6 |
+
- **Date**: YYYY-MM-DD
|
| 7 |
+
- **Duration**: X hours Y minutes
|
| 8 |
+
- **Start Time**: HH:MM UTC
|
| 9 |
+
- **End Time**: HH:MM UTC
|
| 10 |
+
- **Authors**: @engineer1, @engineer2
|
| 11 |
+
- **Status**: Draft/Final
|
| 12 |
+
|
| 13 |
+
## Executive Summary
|
| 14 |
+
[1-2 sentences: what happened, customer impact, duration]
|
| 15 |
+
|
| 16 |
+
## Impact
|
| 17 |
+
- **Customers affected**: X / Y (Z%)
|
| 18 |
+
- **Requests failed**: X
|
| 19 |
+
- **Revenue impact**: $X
|
| 20 |
+
- **Error budget consumed**: X% of 30d budget
|
| 21 |
+
|
| 22 |
+
## Timeline (UTC)
|
| 23 |
+
| Time | Event | Action |
|
| 24 |
+
|------|-------|--------|
|
| 25 |
+
| HH:MM | Alert fired | On-call paged |
|
| 26 |
+
| HH:MM | Root cause identified | [What was found] |
|
| 27 |
+
| HH:MM | Mitigation applied | [What was done] |
|
| 28 |
+
| HH:MM | Service restored | [Confirmation] |
|
| 29 |
+
| HH:MM | All-clear | Incident closed |
|
| 30 |
+
|
| 31 |
+
## Root Cause
|
| 32 |
+
[5 Whys analysis]
|
| 33 |
+
1. Why did the incident occur?
|
| 34 |
+
2. Why was that condition present?
|
| 35 |
+
3. Why was that not caught?
|
| 36 |
+
4. Why was there no automated prevention?
|
| 37 |
+
5. Why was this not in our risk model?
|
| 38 |
+
|
| 39 |
+
## What Went Well
|
| 40 |
+
- [Detection was fast / alert was clear / etc.]
|
| 41 |
+
|
| 42 |
+
## What Went Wrong
|
| 43 |
+
- [Response was slow / runbook was missing / etc.]
|
| 44 |
+
|
| 45 |
+
## Action Items
|
| 46 |
+
| # | Action | Owner | Priority | Due Date | Type |
|
| 47 |
+
|---|--------|-------|----------|----------|------|
|
| 48 |
+
| 1 | [Fix] | @eng | P1 | YYYY-MM-DD | Remediate |
|
| 49 |
+
| 2 | [Prevent] | @eng | P2 | YYYY-MM-DD | Automate |
|
| 50 |
+
| 3 | [Detect] | @eng | P2 | YYYY-MM-DD | Monitoring |
|
| 51 |
+
|
| 52 |
+
## Lessons Learned
|
| 53 |
+
- [Key takeaway 1]
|
| 54 |
+
- [Key takeaway 2]
|
| 55 |
+
|
| 56 |
+
## Appendices
|
| 57 |
+
- Grafana dashboard screenshots
|
| 58 |
+
- Alert screenshots
|
| 59 |
+
- Log excerpts
|