devops-incident-response / data /runbooks /deployment_rollback.md
Arijit-07's picture
Initial submission: DevOps Incident Response OpenEnv
06b4790

Runbook: Deployment Rollback

When to Rollback

  • Error rate spike immediately following a deployment
  • Latency increase correlated with a new version going live
  • A service was recently deployed (last_deployed within the last hour)
  • Logs show errors that did not exist before the deployment

How to Identify the Bad Deployment

  1. Check current_version and last_deployed in service metrics
  2. Correlate the deployment timestamp with the incident start time
  3. Read the service logs — new errors after deployment = likely cause

Remediation

action: rollback
service: <service-that-was-deployed>
version: <previous-stable-version>

If you don't know the exact previous version, use previous and the system will revert to the last known-good artifact.

Post-Rollback

  • Monitor error rate for 5 minutes to confirm recovery
  • Downstream services should recover automatically as upstream stabilises
  • Alert the owning team so they can investigate the bad release

Do NOT

  • Rollback services that were NOT recently deployed
  • Rollback before confirming the new deployment is actually the cause
  • Restart services instead of rolling back (restart keeps the bad version)