Spaces:
Sleeping
Sleeping
| # Runbook: Deployment Rollback | |
| ## When to Rollback | |
| - Error rate spike immediately following a deployment | |
| - Latency increase correlated with a new version going live | |
| - A service was recently deployed (`last_deployed` within the last hour) | |
| - Logs show errors that did not exist before the deployment | |
| ## How to Identify the Bad Deployment | |
| 1. Check `current_version` and `last_deployed` in service metrics | |
| 2. Correlate the deployment timestamp with the incident start time | |
| 3. Read the service logs — new errors after deployment = likely cause | |
| ## Remediation | |
| ``` | |
| action: rollback | |
| service: <service-that-was-deployed> | |
| version: <previous-stable-version> | |
| ``` | |
| If you don't know the exact previous version, use `previous` and the | |
| system will revert to the last known-good artifact. | |
| ## Post-Rollback | |
| - Monitor error rate for 5 minutes to confirm recovery | |
| - Downstream services should recover automatically as upstream stabilises | |
| - Alert the owning team so they can investigate the bad release | |
| ## Do NOT | |
| - Rollback services that were NOT recently deployed | |
| - Rollback before confirming the new deployment is actually the cause | |
| - Restart services instead of rolling back (restart keeps the bad version) | |