Runbook: Silent Data Corruption
What Makes This Hard
Silent data corruption does NOT trigger standard error-rate or latency alerts. All services appear healthy. The signal is in business-logic metrics:
- Price mismatches in validation logs (WARN level, not ERROR)
- Anomalous average order values in analytics
- Write operations succeeding (HTTP 200) but writing wrong values
How to Detect
- Read logs for price-validation-service — look for PRICE_MISMATCH warnings
- Read metrics for analytics-service — look for avg_order_value anomalies
- Read logs for data-pipeline-service — check for recent deployment
- Correlate: did the mismatch rate spike immediately after a pipeline deployment?
Root Cause Pattern
A data pipeline deployment introduced a bug that writes incorrect values to the product catalog. Writes succeed at the DB level (no errors), but the values are wrong (e.g., decimal point off by 10x).
Remediation — Two Steps Required
Step 1: Stop the corruption
Rollback the pipeline service to stop new corrupt writes.
action: rollback
service: data-pipeline-service
version: previous
Step 2: Audit existing corrupt data
Rollback stops NEW corruption but does NOT fix data already written. You MUST page the data engineering team to run a correction job.
action: alert_oncall
reason: Data corruption detected — price-validation mismatch rate 15%.
Pipeline rolled back. Need audit and correction of product-catalog prices.
Do NOT
- Restart services (won't fix written data)
- Scale up services (more replicas = more corrupt writes)
- Close the incident after rollback only — corrupted data persists until corrected