Spaces:
Running
Running
| # SENTI AI — OPERATIONS RUNBOOK | |
| ## STARTING THE SYSTEM | |
| Full stack: | |
| docker-compose up -d (databases) | |
| ollama serve & (LLM) | |
| python scratch/start_all.py (16 superpacks) | |
| python senti/main.py (main gateway) | |
| Verify health: | |
| curl http://localhost:8000/health | |
| curl http://localhost:11434/api/tags | |
| python scratch/check_ports.py | |
| ## DAILY HEALTH CHECKS | |
| curl http://localhost:8000/health | |
| python senti/tests/battle/run_all_battle_tests.py | |
| tail -50 logs/senti_gateway.log | grep ERROR | |
| ## BACKUP PROCEDURE | |
| Run daily (add to cron or Windows Task Scheduler): | |
| # Postgres backup | |
| docker exec senti_ai_postgres_1 pg_dump \ | |
| -U senti senti_db > backups/senti_db_$(date +%Y%m%d).sql | |
| # Keep last 7 days | |
| find backups/ -name "*.sql" -mtime +7 -delete | |
| # Qdrant backup | |
| docker exec senti_ai_qdrant_1 \ | |
| tar czf /tmp/qdrant_backup.tar.gz /qdrant/storage | |
| docker cp senti_ai_qdrant_1:/tmp/qdrant_backup.tar.gz \ | |
| backups/qdrant_$(date +%Y%m%d).tar.gz | |
| ## RESTORE PROCEDURE | |
| # Postgres restore | |
| docker exec -i senti_ai_postgres_1 psql \ | |
| -U senti senti_db < backups/senti_db_20260101.sql | |
| # Restart after restore | |
| docker-compose restart | |
| ## ROLLBACK PROCEDURE | |
| If a deployment breaks the system: | |
| # 1. Stop the broken version | |
| docker-compose down | |
| # 2. Revert the code | |
| git stash (or git checkout last-known-good-tag) | |
| # 3. Restore the last working database | |
| # (from backups/) | |
| # 4. Restart | |
| docker-compose up -d | |
| python senti/main.py | |
| ## LOG ROTATION | |
| Logs will grow. Set up rotation: | |
| Add to logrotate (/etc/logrotate.d/senti): | |
| /path/to/senti_ai/logs/*.log { | |
| daily | |
| rotate 14 | |
| compress | |
| missingok | |
| notifempty | |
| } | |
| ## MONITORING ALERTS | |
| What to alert on: | |
| - /health endpoint returns non-200 | |
| - Error rate > 5% in last 10 minutes | |
| - LLM response time > 30 seconds | |
| - Database connection errors | |
| - Memory usage > 80% | |
| Simple monitoring script: | |
| python senti/monitoring/health_check.py | |
| (runs every 5 minutes, sends Telegram alert if down) | |
| ## WHAT TO DO WHEN SOMETHING BREAKS | |
| LLM returns garbage: | |
| ollama restart | |
| Check: does senti-shujaa still load? | |
| Fallback: set PRIMARY_MODEL_NAME=qwen2.5 in .env | |
| Database connection errors: | |
| docker-compose restart postgres | |
| Check: is disk full? | |
| Superpack 404 errors: | |
| python scratch/start_all.py | |
| Check: are ports 9201-9216 listening? | |
| python scratch/check_ports.py | |
| Memory leaks: | |
| Restart the gateway | |
| Check Redis: redis-cli info memory | |
| Check session memory TTL is 2 hours | |
| Slow responses (> 30 seconds): | |
| Check: is Ollama using CPU or GPU? | |
| Check: is Redis available? | |
| Check: is FEE handling simple queries (not LLM)? | |
| ## WHAT NEVER TO DO IN PRODUCTION | |
| Never: | |
| rm -rf postgres_data/ | |
| docker-compose down --volumes | |
| Delete senti_db.sqlite without backup | |
| Change ENCRYPTION_KEY after users have data | |
| (all encrypted data becomes unreadable) | |
| Run migration scripts without testing on backup first | |