shaikhsalman commited on
Commit
845a3fb
·
verified ·
1 Parent(s): dd25ceb

Upload incident-response/postmortem/template.md with huggingface_hub

Browse files
incident-response/postmortem/template.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Post-Mortem: [INCIDENT TITLE]
2
+
3
+ ## Metadata
4
+ - **Incident ID**: INC-XXXX
5
+ - **Severity**: P1/P2/P3
6
+ - **Date**: YYYY-MM-DD
7
+ - **Duration**: X hours Y minutes
8
+ - **Start Time**: HH:MM UTC
9
+ - **End Time**: HH:MM UTC
10
+ - **Authors**: @engineer1, @engineer2
11
+ - **Status**: Draft/Final
12
+
13
+ ## Executive Summary
14
+ [1-2 sentences: what happened, customer impact, duration]
15
+
16
+ ## Impact
17
+ - **Customers affected**: X / Y (Z%)
18
+ - **Requests failed**: X
19
+ - **Revenue impact**: $X
20
+ - **Error budget consumed**: X% of 30d budget
21
+
22
+ ## Timeline (UTC)
23
+ | Time | Event | Action |
24
+ |------|-------|--------|
25
+ | HH:MM | Alert fired | On-call paged |
26
+ | HH:MM | Root cause identified | [What was found] |
27
+ | HH:MM | Mitigation applied | [What was done] |
28
+ | HH:MM | Service restored | [Confirmation] |
29
+ | HH:MM | All-clear | Incident closed |
30
+
31
+ ## Root Cause
32
+ [5 Whys analysis]
33
+ 1. Why did the incident occur?
34
+ 2. Why was that condition present?
35
+ 3. Why was that not caught?
36
+ 4. Why was there no automated prevention?
37
+ 5. Why was this not in our risk model?
38
+
39
+ ## What Went Well
40
+ - [Detection was fast / alert was clear / etc.]
41
+
42
+ ## What Went Wrong
43
+ - [Response was slow / runbook was missing / etc.]
44
+
45
+ ## Action Items
46
+ | # | Action | Owner | Priority | Due Date | Type |
47
+ |---|--------|-------|----------|----------|------|
48
+ | 1 | [Fix] | @eng | P1 | YYYY-MM-DD | Remediate |
49
+ | 2 | [Prevent] | @eng | P2 | YYYY-MM-DD | Automate |
50
+ | 3 | [Detect] | @eng | P2 | YYYY-MM-DD | Monitoring |
51
+
52
+ ## Lessons Learned
53
+ - [Key takeaway 1]
54
+ - [Key takeaway 2]
55
+
56
+ ## Appendices
57
+ - Grafana dashboard screenshots
58
+ - Alert screenshots
59
+ - Log excerpts