Spaces:

Arijit-07
/

devops-incident-response

Sleeping

App Files Files Community

devops-incident-response / data /runbooks /deployment_rollback.md

Arijit-07's picture

Initial submission: DevOps Incident Response OpenEnv

06b4790 about 2 months ago

|

history blame contribute delete

1.2 kB

Runbook: Deployment Rollback

When to Rollback

Error rate spike immediately following a deployment
Latency increase correlated with a new version going live
A service was recently deployed (last_deployed within the last hour)
Logs show errors that did not exist before the deployment

How to Identify the Bad Deployment

Check current_version and last_deployed in service metrics
Correlate the deployment timestamp with the incident start time
Read the service logs — new errors after deployment = likely cause

Remediation

action: rollback
service: <service-that-was-deployed>
version: <previous-stable-version>

If you don't know the exact previous version, use previous and the system will revert to the last known-good artifact.

Post-Rollback

Monitor error rate for 5 minutes to confirm recovery
Downstream services should recover automatically as upstream stabilises
Alert the owning team so they can investigate the bad release

Do NOT

Rollback services that were NOT recently deployed
Rollback before confirming the new deployment is actually the cause
Restart services instead of rolling back (restart keeps the bad version)