Spaces:
Running
SENTI AI — OPERATIONS RUNBOOK
STARTING THE SYSTEM
Full stack: docker-compose up -d (databases) ollama serve & (LLM) python scratch/start_all.py (16 superpacks) python senti/main.py (main gateway)
Verify health: curl http://localhost:8000/health curl http://localhost:11434/api/tags python scratch/check_ports.py
DAILY HEALTH CHECKS
curl http://localhost:8000/health python senti/tests/battle/run_all_battle_tests.py tail -50 logs/senti_gateway.log | grep ERROR
BACKUP PROCEDURE
Run daily (add to cron or Windows Task Scheduler):
Postgres backup
docker exec senti_ai_postgres_1 pg_dump
-U senti senti_db > backups/senti_db_$(date +%Y%m%d).sql
Keep last 7 days
find backups/ -name "*.sql" -mtime +7 -delete
Qdrant backup
docker exec senti_ai_qdrant_1
tar czf /tmp/qdrant_backup.tar.gz /qdrant/storage
docker cp senti_ai_qdrant_1:/tmp/qdrant_backup.tar.gz
backups/qdrant_$(date +%Y%m%d).tar.gz
RESTORE PROCEDURE
Postgres restore
docker exec -i senti_ai_postgres_1 psql
-U senti senti_db < backups/senti_db_20260101.sql
Restart after restore
docker-compose restart
ROLLBACK PROCEDURE
If a deployment breaks the system:
1. Stop the broken version
docker-compose down
2. Revert the code
git stash (or git checkout last-known-good-tag)
3. Restore the last working database
(from backups/)
4. Restart
docker-compose up -d python senti/main.py
LOG ROTATION
Logs will grow. Set up rotation:
Add to logrotate (/etc/logrotate.d/senti): /path/to/senti_ai/logs/*.log { daily rotate 14 compress missingok notifempty }
MONITORING ALERTS
What to alert on: - /health endpoint returns non-200 - Error rate > 5% in last 10 minutes - LLM response time > 30 seconds - Database connection errors - Memory usage > 80%
Simple monitoring script: python senti/monitoring/health_check.py (runs every 5 minutes, sends Telegram alert if down)
WHAT TO DO WHEN SOMETHING BREAKS
LLM returns garbage: ollama restart Check: does senti-shujaa still load? Fallback: set PRIMARY_MODEL_NAME=qwen2.5 in .env
Database connection errors: docker-compose restart postgres Check: is disk full?
Superpack 404 errors: python scratch/start_all.py Check: are ports 9201-9216 listening? python scratch/check_ports.py
Memory leaks: Restart the gateway Check Redis: redis-cli info memory Check session memory TTL is 2 hours
Slow responses (> 30 seconds): Check: is Ollama using CPU or GPU? Check: is Redis available? Check: is FEE handling simple queries (not LLM)?
WHAT NEVER TO DO IN PRODUCTION
Never: rm -rf postgres_data/ docker-compose down --volumes Delete senti_db.sqlite without backup Change ENCRYPTION_KEY after users have data (all encrypted data becomes unreadable) Run migration scripts without testing on backup first