Spaces:

ammaraak
/

todo-api

Configuration error

App Files Files Community

todo-api / phase-5 /docs /OPERATIONS.md

Nanny7

feat: Phase 5 Complete - Production-Ready AI Todo Application 🎉

edcd2ef 25 days ago

preview code

raw

history blame contribute delete

12.3 kB

Phase 5 Operations Runbook

Version: 1.0 Last Updated: 2026-02-04 Purpose: Operational procedures for Phase 5 Todo Application

Daily Operations
Alerting & On-Call
Incident Response
Common Issues & Solutions
Maintenance Procedures
Performance Tuning
Capacity Planning
Disaster Recovery

Daily Operations

Morning Checklist (Daily 9 AM)

Check Grafana dashboards for anomalies
Verify all pods are running
Check error rates in Prometheus
Review overnight alerts
Verify backups completed successfully

Commands:

# Check pod health
kubectl get pods -n phase-5

# Check backup jobs
kubectl get cronjobs -n phase-5
kubectl logs -l job-name=database-backup -n phase-5 --tail=-1

# Check system metrics
kubectl top pods -n phase-5
kubectl top nodes

Weekly Review (Friday 4 PM)

Review weekly performance metrics
Check SSL certificate expiry
Review and rotate secrets (if needed)
Clean up old backups
Review and update runbook

Commands:

# Check certificate expiry
kubectl get certificates -n phase-5
kubectl describe certificate backend-api-tls -n phase-5 | grep "Not After"

# Clean up old backups (keep last 30 days)
aws s3 ls s3://todo-app-backups/snapshots/ | \
  awk '{print $4}' | \
  head -n -30 | \
  xargs -I {} aws s3 rm s3://todo-app-backups/snapshots/{}

Alerting & On-Call

Alert Severity Levels

Severity	Response Time	Escalation	Examples
P1 - Critical	15 minutes	30 min	Complete service outage, data loss
P2 - High	1 hour	2 hours	Service degradation, high error rate
P3 - Medium	4 hours	Next business day	Performance issues, minor bugs
P4 - Low	1 week	N/A	Documentation, cosmetic issues

Common Alerts

1. HighErrorRate (P1)

Trigger: Error rate > 5% for 5 minutes

Investigation:

# Check error rate in Prometheus
open http://localhost:9090/graph?g0 expr=rate(http_requests_total{status="error"}[5m])

# View recent logs
kubectl logs -l app=backend -n phase-5 --tail=100

# Check pod status
kubectl get pods -n phase-5

Resolution:

Identify failing component from logs
Check database connectivity
Verify external services (Kafka, Ollama)
Rollback if recent deployment caused issues

2. HighLatency (P2)

Trigger: P95 latency > 3 seconds for 5 minutes

Investigation:

# Check latency metrics
open http://localhost:9090/graph?g0.expr=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Check database query performance
kubectl logs -l app=backend -n phase-5 | grep "slow query"

# Check resource utilization
kubectl top pods -n phase-5

Resolution:

Scale up pods if CPU/memory constrained
Optimize slow database queries
Restart stuck pods
Enable caching for frequently accessed data

3. PodCrashLooping (P1)

Trigger: Pod restart count > 5 in 10 minutes

Investigation:

# Check pod status
kubectl get pods -n phase-5

# Describe pod
kubectl describe pod <pod-name> -n phase-5

# View logs
kubectl logs <pod-name> -n phase-5 --previous

Resolution:

Check logs for error messages
Verify secrets and configuration
Check resource limits
Restart deployment if needed

4. DatabaseConnectionFailed (P1)

Trigger: Cannot connect to database

Investigation:

# Check PostgreSQL pod
kubectl get pods -n phase-5 -l app=postgres

# Test connection
kubectl exec -it <backend-pod> -n phase-5 -- psql ${DATABASE_URL}

# Check database credentials
kubectl describe secret db-credentials -n phase-5

Resolution:

Verify database pod is running
Check network policies
Rotate credentials if compromised
Restart backend pods after fixing

Incident Response

Incident Lifecycle

Detection - Alert triggered
Acknowledgement - On-call engineer acknowledges
Investigation - Gather diagnostic information
Mitigation - Apply workaround or fix
Resolution - Verify service is restored
Post-Mortem - Document incident and improvements

Incident Commands

Create Incident Channel:

# Create Slack channel
/slack create channel #incident-$(date +%Y%m%d)

# Set topic
/slack set topic "P1 - High Error Rate - Investigating"

Declare Incident:

# Post to team
echo "🚨 INCIDENT DECLARED: High Error Rate
Severity: P1
Time: $(date)
Lead: @on-call
Channel: #incident-$(date +%Y%m%d)
Status: Investigating" | slack post

Update Incident:

# Update status
/slack post "UPDATE: Identified issue in database connection pool. Working on fix."

Close Incident:

/slack post "RESOLVED: Error rate back to normal. Post-mortem to follow."

Major Incident Template

# Major Incident Report

**Date**: YYYY-MM-DD
**Incident ID**: INC-YYYY-MM
**Severity**: P1/P2/P3
**Duration**: X hours
**Impact**: Y users affected

## Summary
Brief description of what happened

## Timeline
- HH:MM - Incident detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Fix applied
- HH:MM - Service restored

## Root Cause
Technical root cause analysis

## Resolution
What was done to fix it

## Prevention
What will be done to prevent recurrence

## Action Items
- [ ] Action item 1
- [ ] Action item 2

Common Issues & Solutions

Issue: API Returns 500 Errors

Symptoms:

API endpoints returning 500 status codes
Logs show database connection errors

Diagnosis:

# Check backend logs
kubectl logs -l app=backend -n phase-5 --tail=50

# Check database connectivity
kubectl exec -it <backend-pod> -n phase-5 -- pgsql ${DATABASE_URL}

Solutions:

Restart backend pods

kubectl rollout restart deployment/backend -n phase-5

Check database credentials

kubectl describe secret db-credentials -n phase-5

Scale database if needed

kubectl patch postgresql postgres -n phase-5 --type='json' \
  -p='[{"op": "replace", "path": "/spec/resources/limits/memory", "value":"2Gi"}]'

Issue: WebSocket Connections Dropping

Symptoms:

Clients disconnected frequently
WebSocket errors in logs

Diagnosis:

# Check WebSocket connections
kubectl logs -l app=backend -n phase-5 | grep -i websocket

# Check ingress timeout
kubectl describe ingress websocket-ingress -n phase-5

Solutions:

Increase ingress timeout

nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"

Enable WebSocket keep-alive

nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
nginx.ingress.kubernetes.io/enable-websocket: "true"

Issue: Reminders Not Sending

Symptoms:

Reminders scheduled but not delivered
No email notifications received

Diagnosis:

# Check notification service logs
kubectl logs -l app=notification -n phase-5

# Check reminder scheduler logs
kubectl logs -l app=backend -c reminder-scheduler -n phase-5

# Check Kafka topic
docker exec redpanda-1 rpk topic consume reminders -n 10

Solutions:

Restart notification service

kubectl rollout restart deployment/notification -n phase-5

Verify SendGrid credentials

kubectl describe secret sendgrid-config -n phase-5

Check Kafka connectivity

kubectl exec -it <backend-pod> -n phase-5 -- nc -zv kafka 9092

Issue: High CPU/Memory Usage

Symptoms:

Pods running out of resources
OOMKilled errors

Diagnosis:

# Check resource usage
kubectl top pods -n phase-5
kubectl describe pod <pod-name> -n phase-5 | grep -A 5 "Limits"

Solutions:

Adjust resource limits

kubectl set resources deployment/backend \
  --limits=cpu=2000m,memory=2Gi \
  --requests=cpu=500m,memory=512Mi \
  -n phase-5

Enable HPA
```
kubectl apply -f k8s/autoscaler.yaml
```

Profile application for memory leaks

kubectl exec -it <backend-pod> -n phase-5 -- python -m memory_profiler src/main.py

Maintenance Procedures

Rolling Update

# Update image
kubectl set image deployment/backend \
  backend=YOUR_REGISTRY/todo-app-backend:v2.0 \
  -n phase-5

# Watch rollout status
kubectl rollout status deployment/backend -n phase-5

# If issues occur, rollback
kubectl rollout undo deployment/backend -n phase-5

Database Migration

# Get backend pod
BACKEND_POD=$(kubectl get pod -n phase-5 -l app=backend -o jsonpath='{.items[0].metadata.name}')

# Run migration script
kubectl exec -n phase-5 ${BACKEND_POD} -- python scripts/migrate.py

# Verify migration
kubectl exec -n phase-5 ${BACKEND_POD} -- python scripts/verify_migration.py

Secret Rotation

# 1. Generate new secret
NEW_PASSWORD=$(openssl rand -base64 32)

# 2. Update secret
kubectl create secret generic db-credentials-new \
  --from-literal=username=postgres \
  --from-literal=password=${NEW_PASSWORD} \
  --from-literal=host=postgres.postgres.svc.cluster.local \
  -n phase-5

# 3. Update deployment to use new secret
kubectl set env deployment/backend \
  --from=secret/db-credentials-new \
  -n phase-5

# 4. Rollout restart
kubectl rollout restart deployment/backend -n phase-5

# 5. Delete old secret
kubectl delete secret db-credentials -n phase-5
kubectl rename secret/db-credentials-new db-credentials -n phase-5

Performance Tuning

Database Optimization

-- Check slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Create indexes
CREATE INDEX idx_tasks_user_id_status ON tasks(user_id, status);
CREATE INDEX idx_tasks_due_date ON tasks(due_date);

-- Analyze query performance
EXPLAIN ANALYZE SELECT * FROM tasks WHERE user_id = 'xxx' AND status = 'active';

API Caching

# Enable Redis caching (if using Redis)
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_user_tasks(user_id: str):
    # Cached implementation
    pass

Connection Pooling

# Adjust database pool size in config.py
DATABASE_POOL_SIZE = int(os.getenv("DATABASE_POOL_SIZE", "20"))
DATABASE_MAX_OVERFLOW = int(os.getenv("DATABASE_MAX_OVERFLOW", "10"))

Capacity Planning

Scaling Metrics

Current Capacity:

Backend: 3-10 pods (via HPA)
Database: 1 pod (can scale vertically)
Kafka: 3 brokers, 6 partitions

When to Scale:

CPU > 70% for 5 minutes → Scale up
Memory > 80% for 5 minutes → Scale up
Request rate > 100 req/sec/pod → Scale up

Quarterly Review:

Analyze growth trends
Plan capacity upgrades
Budget for additional resources

Disaster Recovery

RTO & RPO Targets

Service	Recovery Time Objective (RTO)	Recovery Point Objective (RPO)
Backend API	15 minutes	5 minutes
Database	1 hour	15 minutes
Kafka	30 minutes	10 minutes

Recovery Procedures

1. Restore from Backup:

./scripts/backup-database.sh restore todo-app-backup-20260204_020000.sql.gz

2. Failover to Backup Region:

# Update DNS to point to backup region
# Deploy backup stack
kubectl apply -f k8s/backup-region/

# Verify failover
curl https://backup.todo-app.com/health

3. Full Disaster Recovery:

# 1. Provision new cluster
# 2. Install dependencies (Dapr, Kafka, etc.)
# 3. Deploy from Helm charts
# 4. Restore database from backup
# 5. Update DNS to new cluster
# 6. Verify all services

Contact Information

On-Call: +1-XXX-XXX-XXXX (PagerDuty)
Slack: #todo-app-ops
Email: ops@yourdomain.com
GitHub: https://github.com/your-org/todo-app/issues

Last Updated: 2026-02-04 Version: 1.0 Maintained By: DevOps Team

Phase 5 Operations Runbook

Table of Contents

Daily Operations

Morning Checklist (Daily 9 AM)

Weekly Review (Friday 4 PM)

Alerting & On-Call

Alert Severity Levels

Common Alerts

1. HighErrorRate (P1)

2. HighLatency (P2)

3. PodCrashLooping (P1)

4. DatabaseConnectionFailed (P1)

Incident Response

Incident Lifecycle

Incident Commands

Major Incident Template

Common Issues & Solutions

Issue: API Returns 500 Errors

Issue: WebSocket Connections Dropping

Issue: Reminders Not Sending

Issue: High CPU/Memory Usage

Maintenance Procedures

Rolling Update

Database Migration

Secret Rotation

Performance Tuning

Database Optimization

API Caching

Connection Pooling

Capacity Planning

Scaling Metrics

Disaster Recovery

RTO & RPO Targets

Recovery Procedures

Contact Information