devops-incident-response / data /runbooks /database_performance.md
Arijit-07's picture
Add Task 6: Database Performance Degradation (missing index investigation)
0a14522

Database Performance Degradation

This runbook outlines the recommended procedure for handling database performance issues, specifically focusing on slow queries, high CPU caused by sequential table scans, and missing query indexes.

1. Diagnose Database Load

If the database (postgres-primary) is exhibiting high CPU or degraded performance without actual service crashes, use the read_metrics action on the database.

  • Look at the Sequential scans/min.
  • If this value is massively elevated (e.g. 500+ instead of single digits), it means queries are scanning entire tables instead of looking up rows in an index.

2. Check Slow Query Logs

Use read_logs on the database to verify the slow queries.

  • Slow query logs will identify specific query strings taking >1000ms.
  • They will likely append [seq_scan] indicating they hit the table sequentially.
  • The logs may also include automated schema anomaly warnings, such as "MISSING INDEX DETECTED".

3. Resolving Missing Indexes

If a missing index is detected, it is highly likely that a recent schema migration added a field but forgot the index.

  • Action Option 1: Use the create_index action, specifying the target table and column (e.g. table="orders", column="user_segment"). This is the best approach if the data is already deployed, as it fixes the issue instantly without breaking backend code.
  • Action Option 2: Use the rollback action on the database service. This will revert the schema migration. It fixes the performance, but causes downstream code applying to the new schema to error until patched.

4. What NOT to do

  • Do NOT restart_service. Connection pool exhaustion is a symptom, not the cause. Restarting only temporarily drops connections before being immediately overwhelmed again.
  • Do NOT scale_up. Adding more replicas/workers will only hammer the slow database harder, increasing lock contention and further starving the CPU.