| # Phase 5 Operations Runbook | |
| **Version**: 1.0 | |
| **Last Updated**: 2026-02-04 | |
| **Purpose**: Operational procedures for Phase 5 Todo Application | |
| --- | |
| ## Table of Contents | |
| 1. [Daily Operations](#daily-operations) | |
| 2. [Alerting & On-Call](#alerting--on-call) | |
| 3. [Incident Response](#incident-response) | |
| 4. [Common Issues & Solutions](#common-issues--solutions) | |
| 5. [Maintenance Procedures](#maintenance-procedures) | |
| 6. [Performance Tuning](#performance-tuning) | |
| 7. [Capacity Planning](#capacity-planning) | |
| 8. [Disaster Recovery](#disaster-recovery) | |
| --- | |
| ## Daily Operations | |
| ### Morning Checklist (Daily 9 AM) | |
| - [ ] Check Grafana dashboards for anomalies | |
| - [ ] Verify all pods are running | |
| - [ ] Check error rates in Prometheus | |
| - [ ] Review overnight alerts | |
| - [ ] Verify backups completed successfully | |
| **Commands**: | |
| ```bash | |
| # Check pod health | |
| kubectl get pods -n phase-5 | |
| # Check backup jobs | |
| kubectl get cronjobs -n phase-5 | |
| kubectl logs -l job-name=database-backup -n phase-5 --tail=-1 | |
| # Check system metrics | |
| kubectl top pods -n phase-5 | |
| kubectl top nodes | |
| ``` | |
| ### Weekly Review (Friday 4 PM) | |
| - [ ] Review weekly performance metrics | |
| - [ ] Check SSL certificate expiry | |
| - [ ] Review and rotate secrets (if needed) | |
| - [ ] Clean up old backups | |
| - [ ] Review and update runbook | |
| **Commands**: | |
| ```bash | |
| # Check certificate expiry | |
| kubectl get certificates -n phase-5 | |
| kubectl describe certificate backend-api-tls -n phase-5 | grep "Not After" | |
| # Clean up old backups (keep last 30 days) | |
| aws s3 ls s3://todo-app-backups/snapshots/ | \ | |
| awk '{print $4}' | \ | |
| head -n -30 | \ | |
| xargs -I {} aws s3 rm s3://todo-app-backups/snapshots/{} | |
| ``` | |
| --- | |
| ## Alerting & On-Call | |
| ### Alert Severity Levels | |
| | Severity | Response Time | Escalation | Examples | | |
| |----------|---------------|------------|----------| | |
| | **P1 - Critical** | 15 minutes | 30 min | Complete service outage, data loss | | |
| | **P2 - High** | 1 hour | 2 hours | Service degradation, high error rate | | |
| | **P3 - Medium** | 4 hours | Next business day | Performance issues, minor bugs | | |
| | **P4 - Low** | 1 week | N/A | Documentation, cosmetic issues | | |
| ### Common Alerts | |
| #### 1. HighErrorRate (P1) | |
| **Trigger**: Error rate > 5% for 5 minutes | |
| **Investigation**: | |
| ```bash | |
| # Check error rate in Prometheus | |
| open http://localhost:9090/graph?g0 expr=rate(http_requests_total{status="error"}[5m]) | |
| # View recent logs | |
| kubectl logs -l app=backend -n phase-5 --tail=100 | |
| # Check pod status | |
| kubectl get pods -n phase-5 | |
| ``` | |
| **Resolution**: | |
| - Identify failing component from logs | |
| - Check database connectivity | |
| - Verify external services (Kafka, Ollama) | |
| - Rollback if recent deployment caused issues | |
| #### 2. HighLatency (P2) | |
| **Trigger**: P95 latency > 3 seconds for 5 minutes | |
| **Investigation**: | |
| ```bash | |
| # Check latency metrics | |
| open http://localhost:9090/graph?g0.expr=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) | |
| # Check database query performance | |
| kubectl logs -l app=backend -n phase-5 | grep "slow query" | |
| # Check resource utilization | |
| kubectl top pods -n phase-5 | |
| ``` | |
| **Resolution**: | |
| - Scale up pods if CPU/memory constrained | |
| - Optimize slow database queries | |
| - Restart stuck pods | |
| - Enable caching for frequently accessed data | |
| #### 3. PodCrashLooping (P1) | |
| **Trigger**: Pod restart count > 5 in 10 minutes | |
| **Investigation**: | |
| ```bash | |
| # Check pod status | |
| kubectl get pods -n phase-5 | |
| # Describe pod | |
| kubectl describe pod <pod-name> -n phase-5 | |
| # View logs | |
| kubectl logs <pod-name> -n phase-5 --previous | |
| ``` | |
| **Resolution**: | |
| - Check logs for error messages | |
| - Verify secrets and configuration | |
| - Check resource limits | |
| - Restart deployment if needed | |
| #### 4. DatabaseConnectionFailed (P1) | |
| **Trigger**: Cannot connect to database | |
| **Investigation**: | |
| ```bash | |
| # Check PostgreSQL pod | |
| kubectl get pods -n phase-5 -l app=postgres | |
| # Test connection | |
| kubectl exec -it <backend-pod> -n phase-5 -- psql ${DATABASE_URL} | |
| # Check database credentials | |
| kubectl describe secret db-credentials -n phase-5 | |
| ``` | |
| **Resolution**: | |
| - Verify database pod is running | |
| - Check network policies | |
| - Rotate credentials if compromised | |
| - Restart backend pods after fixing | |
| --- | |
| ## Incident Response | |
| ### Incident Lifecycle | |
| 1. **Detection** - Alert triggered | |
| 2. **Acknowledgement** - On-call engineer acknowledges | |
| 3. **Investigation** - Gather diagnostic information | |
| 4. **Mitigation** - Apply workaround or fix | |
| 5. **Resolution** - Verify service is restored | |
| 6. **Post-Mortem** - Document incident and improvements | |
| ### Incident Commands | |
| **Create Incident Channel**: | |
| ```bash | |
| # Create Slack channel | |
| /slack create channel #incident-$(date +%Y%m%d) | |
| # Set topic | |
| /slack set topic "P1 - High Error Rate - Investigating" | |
| ``` | |
| **Declare Incident**: | |
| ```bash | |
| # Post to team | |
| echo "π¨ INCIDENT DECLARED: High Error Rate | |
| Severity: P1 | |
| Time: $(date) | |
| Lead: @on-call | |
| Channel: #incident-$(date +%Y%m%d) | |
| Status: Investigating" | slack post | |
| ``` | |
| **Update Incident**: | |
| ```bash | |
| # Update status | |
| /slack post "UPDATE: Identified issue in database connection pool. Working on fix." | |
| ``` | |
| **Close Incident**: | |
| ```bash | |
| /slack post "RESOLVED: Error rate back to normal. Post-mortem to follow." | |
| ``` | |
| ### Major Incident Template | |
| ```markdown | |
| # Major Incident Report | |
| **Date**: YYYY-MM-DD | |
| **Incident ID**: INC-YYYY-MM | |
| **Severity**: P1/P2/P3 | |
| **Duration**: X hours | |
| **Impact**: Y users affected | |
| ## Summary | |
| Brief description of what happened | |
| ## Timeline | |
| - HH:MM - Incident detected | |
| - HH:MM - Investigation started | |
| - HH:MM - Root cause identified | |
| - HH:MM - Fix applied | |
| - HH:MM - Service restored | |
| ## Root Cause | |
| Technical root cause analysis | |
| ## Resolution | |
| What was done to fix it | |
| ## Prevention | |
| What will be done to prevent recurrence | |
| ## Action Items | |
| - [ ] Action item 1 | |
| - [ ] Action item 2 | |
| ``` | |
| --- | |
| ## Common Issues & Solutions | |
| ### Issue: API Returns 500 Errors | |
| **Symptoms**: | |
| - API endpoints returning 500 status codes | |
| - Logs show database connection errors | |
| **Diagnosis**: | |
| ```bash | |
| # Check backend logs | |
| kubectl logs -l app=backend -n phase-5 --tail=50 | |
| # Check database connectivity | |
| kubectl exec -it <backend-pod> -n phase-5 -- pgsql ${DATABASE_URL} | |
| ``` | |
| **Solutions**: | |
| 1. Restart backend pods | |
| ```bash | |
| kubectl rollout restart deployment/backend -n phase-5 | |
| ``` | |
| 2. Check database credentials | |
| ```bash | |
| kubectl describe secret db-credentials -n phase-5 | |
| ``` | |
| 3. Scale database if needed | |
| ```bash | |
| kubectl patch postgresql postgres -n phase-5 --type='json' \ | |
| -p='[{"op": "replace", "path": "/spec/resources/limits/memory", "value":"2Gi"}]' | |
| ``` | |
| ### Issue: WebSocket Connections Dropping | |
| **Symptoms**: | |
| - Clients disconnected frequently | |
| - WebSocket errors in logs | |
| **Diagnosis**: | |
| ```bash | |
| # Check WebSocket connections | |
| kubectl logs -l app=backend -n phase-5 | grep -i websocket | |
| # Check ingress timeout | |
| kubectl describe ingress websocket-ingress -n phase-5 | |
| ``` | |
| **Solutions**: | |
| 1. Increase ingress timeout | |
| ```yaml | |
| nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" | |
| nginx.ingress.kubernetes.io/proxy-send-timeout: "3600" | |
| ``` | |
| 2. Enable WebSocket keep-alive | |
| ```yaml | |
| nginx.ingress.kubernetes.io/proxy-http-version: "1.1" | |
| nginx.ingress.kubernetes.io/enable-websocket: "true" | |
| ``` | |
| ### Issue: Reminders Not Sending | |
| **Symptoms**: | |
| - Reminders scheduled but not delivered | |
| - No email notifications received | |
| **Diagnosis**: | |
| ```bash | |
| # Check notification service logs | |
| kubectl logs -l app=notification -n phase-5 | |
| # Check reminder scheduler logs | |
| kubectl logs -l app=backend -c reminder-scheduler -n phase-5 | |
| # Check Kafka topic | |
| docker exec redpanda-1 rpk topic consume reminders -n 10 | |
| ``` | |
| **Solutions**: | |
| 1. Restart notification service | |
| ```bash | |
| kubectl rollout restart deployment/notification -n phase-5 | |
| ``` | |
| 2. Verify SendGrid credentials | |
| ```bash | |
| kubectl describe secret sendgrid-config -n phase-5 | |
| ``` | |
| 3. Check Kafka connectivity | |
| ```bash | |
| kubectl exec -it <backend-pod> -n phase-5 -- nc -zv kafka 9092 | |
| ``` | |
| ### Issue: High CPU/Memory Usage | |
| **Symptoms**: | |
| - Pods running out of resources | |
| - OOMKilled errors | |
| **Diagnosis**: | |
| ```bash | |
| # Check resource usage | |
| kubectl top pods -n phase-5 | |
| kubectl describe pod <pod-name> -n phase-5 | grep -A 5 "Limits" | |
| ``` | |
| **Solutions**: | |
| 1. Adjust resource limits | |
| ```bash | |
| kubectl set resources deployment/backend \ | |
| --limits=cpu=2000m,memory=2Gi \ | |
| --requests=cpu=500m,memory=512Mi \ | |
| -n phase-5 | |
| ``` | |
| 2. Enable HPA | |
| ```bash | |
| kubectl apply -f k8s/autoscaler.yaml | |
| ``` | |
| 3. Profile application for memory leaks | |
| ```bash | |
| kubectl exec -it <backend-pod> -n phase-5 -- python -m memory_profiler src/main.py | |
| ``` | |
| --- | |
| ## Maintenance Procedures | |
| ### Rolling Update | |
| ```bash | |
| # Update image | |
| kubectl set image deployment/backend \ | |
| backend=YOUR_REGISTRY/todo-app-backend:v2.0 \ | |
| -n phase-5 | |
| # Watch rollout status | |
| kubectl rollout status deployment/backend -n phase-5 | |
| # If issues occur, rollback | |
| kubectl rollout undo deployment/backend -n phase-5 | |
| ``` | |
| ### Database Migration | |
| ```bash | |
| # Get backend pod | |
| BACKEND_POD=$(kubectl get pod -n phase-5 -l app=backend -o jsonpath='{.items[0].metadata.name}') | |
| # Run migration script | |
| kubectl exec -n phase-5 ${BACKEND_POD} -- python scripts/migrate.py | |
| # Verify migration | |
| kubectl exec -n phase-5 ${BACKEND_POD} -- python scripts/verify_migration.py | |
| ``` | |
| ### Secret Rotation | |
| ```bash | |
| # 1. Generate new secret | |
| NEW_PASSWORD=$(openssl rand -base64 32) | |
| # 2. Update secret | |
| kubectl create secret generic db-credentials-new \ | |
| --from-literal=username=postgres \ | |
| --from-literal=password=${NEW_PASSWORD} \ | |
| --from-literal=host=postgres.postgres.svc.cluster.local \ | |
| -n phase-5 | |
| # 3. Update deployment to use new secret | |
| kubectl set env deployment/backend \ | |
| --from=secret/db-credentials-new \ | |
| -n phase-5 | |
| # 4. Rollout restart | |
| kubectl rollout restart deployment/backend -n phase-5 | |
| # 5. Delete old secret | |
| kubectl delete secret db-credentials -n phase-5 | |
| kubectl rename secret/db-credentials-new db-credentials -n phase-5 | |
| ``` | |
| --- | |
| ## Performance Tuning | |
| ### Database Optimization | |
| ```sql | |
| -- Check slow queries | |
| SELECT query, mean_exec_time, calls | |
| FROM pg_stat_statements | |
| ORDER BY mean_exec_time DESC | |
| LIMIT 10; | |
| -- Create indexes | |
| CREATE INDEX idx_tasks_user_id_status ON tasks(user_id, status); | |
| CREATE INDEX idx_tasks_due_date ON tasks(due_date); | |
| -- Analyze query performance | |
| EXPLAIN ANALYZE SELECT * FROM tasks WHERE user_id = 'xxx' AND status = 'active'; | |
| ``` | |
| ### API Caching | |
| ```python | |
| # Enable Redis caching (if using Redis) | |
| from functools import lru_cache | |
| @lru_cache(maxsize=1000) | |
| def get_user_tasks(user_id: str): | |
| # Cached implementation | |
| pass | |
| ``` | |
| ### Connection Pooling | |
| ```python | |
| # Adjust database pool size in config.py | |
| DATABASE_POOL_SIZE = int(os.getenv("DATABASE_POOL_SIZE", "20")) | |
| DATABASE_MAX_OVERFLOW = int(os.getenv("DATABASE_MAX_OVERFLOW", "10")) | |
| ``` | |
| --- | |
| ## Capacity Planning | |
| ### Scaling Metrics | |
| **Current Capacity**: | |
| - Backend: 3-10 pods (via HPA) | |
| - Database: 1 pod (can scale vertically) | |
| - Kafka: 3 brokers, 6 partitions | |
| **When to Scale**: | |
| - CPU > 70% for 5 minutes β Scale up | |
| - Memory > 80% for 5 minutes β Scale up | |
| - Request rate > 100 req/sec/pod β Scale up | |
| **Quarterly Review**: | |
| - Analyze growth trends | |
| - Plan capacity upgrades | |
| - Budget for additional resources | |
| --- | |
| ## Disaster Recovery | |
| ### RTO & RPO Targets | |
| | Service | Recovery Time Objective (RTO) | Recovery Point Objective (RPO) | | |
| |---------|-------------------------------|-------------------------------| | |
| | Backend API | 15 minutes | 5 minutes | | |
| | Database | 1 hour | 15 minutes | | |
| | Kafka | 30 minutes | 10 minutes | | |
| ### Recovery Procedures | |
| **1. Restore from Backup**: | |
| ```bash | |
| ./scripts/backup-database.sh restore todo-app-backup-20260204_020000.sql.gz | |
| ``` | |
| **2. Failover to Backup Region**: | |
| ```bash | |
| # Update DNS to point to backup region | |
| # Deploy backup stack | |
| kubectl apply -f k8s/backup-region/ | |
| # Verify failover | |
| curl https://backup.todo-app.com/health | |
| ``` | |
| **3. Full Disaster Recovery**: | |
| ```bash | |
| # 1. Provision new cluster | |
| # 2. Install dependencies (Dapr, Kafka, etc.) | |
| # 3. Deploy from Helm charts | |
| # 4. Restore database from backup | |
| # 5. Update DNS to new cluster | |
| # 6. Verify all services | |
| ``` | |
| --- | |
| ## Contact Information | |
| - **On-Call**: +1-XXX-XXX-XXXX (PagerDuty) | |
| - **Slack**: #todo-app-ops | |
| - **Email**: ops@yourdomain.com | |
| - **GitHub**: https://github.com/your-org/todo-app/issues | |
| --- | |
| **Last Updated**: 2026-02-04 | |
| **Version**: 1.0 | |
| **Maintained By**: DevOps Team | |