File size: 1,698 Bytes
06b4790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Runbook: Silent Data Corruption

## What Makes This Hard
Silent data corruption does NOT trigger standard error-rate or latency alerts.
All services appear healthy. The signal is in business-logic metrics:
- Price mismatches in validation logs (WARN level, not ERROR)
- Anomalous average order values in analytics
- Write operations succeeding (HTTP 200) but writing wrong values

## How to Detect
1. Read logs for price-validation-service — look for PRICE_MISMATCH warnings
2. Read metrics for analytics-service — look for avg_order_value anomalies
3. Read logs for data-pipeline-service — check for recent deployment
4. Correlate: did the mismatch rate spike immediately after a pipeline deployment?

## Root Cause Pattern
A data pipeline deployment introduced a bug that writes incorrect values
to the product catalog. Writes succeed at the DB level (no errors),
but the values are wrong (e.g., decimal point off by 10x).

## Remediation — Two Steps Required

### Step 1: Stop the corruption
Rollback the pipeline service to stop new corrupt writes.

```
action: rollback
service: data-pipeline-service
version: previous
```

### Step 2: Audit existing corrupt data
Rollback stops NEW corruption but does NOT fix data already written.
You MUST page the data engineering team to run a correction job.

```
action: alert_oncall
reason: Data corruption detected — price-validation mismatch rate 15%. 
        Pipeline rolled back. Need audit and correction of product-catalog prices.
```

## Do NOT
- Restart services (won't fix written data)
- Scale up services (more replicas = more corrupt writes)
- Close the incident after rollback only — corrupted data persists until corrected