senti-beta / OPERATIONS.md
joseph njoroge kariuki
Deploy Senti AI to Hugging Face Spaces
021e065

SENTI AI — OPERATIONS RUNBOOK

STARTING THE SYSTEM

Full stack: docker-compose up -d (databases) ollama serve & (LLM) python scratch/start_all.py (16 superpacks) python senti/main.py (main gateway)

Verify health: curl http://localhost:8000/health curl http://localhost:11434/api/tags python scratch/check_ports.py

DAILY HEALTH CHECKS

curl http://localhost:8000/health python senti/tests/battle/run_all_battle_tests.py tail -50 logs/senti_gateway.log | grep ERROR

BACKUP PROCEDURE

Run daily (add to cron or Windows Task Scheduler):

Postgres backup

docker exec senti_ai_postgres_1 pg_dump
-U senti senti_db > backups/senti_db_$(date +%Y%m%d).sql

Keep last 7 days

find backups/ -name "*.sql" -mtime +7 -delete

Qdrant backup

docker exec senti_ai_qdrant_1
tar czf /tmp/qdrant_backup.tar.gz /qdrant/storage docker cp senti_ai_qdrant_1:/tmp/qdrant_backup.tar.gz
backups/qdrant_$(date +%Y%m%d).tar.gz

RESTORE PROCEDURE

Postgres restore

docker exec -i senti_ai_postgres_1 psql
-U senti senti_db < backups/senti_db_20260101.sql

Restart after restore

docker-compose restart

ROLLBACK PROCEDURE

If a deployment breaks the system:

1. Stop the broken version

docker-compose down

2. Revert the code

git stash (or git checkout last-known-good-tag)

3. Restore the last working database

(from backups/)

4. Restart

docker-compose up -d python senti/main.py

LOG ROTATION

Logs will grow. Set up rotation:

Add to logrotate (/etc/logrotate.d/senti): /path/to/senti_ai/logs/*.log { daily rotate 14 compress missingok notifempty }

MONITORING ALERTS

What to alert on: - /health endpoint returns non-200 - Error rate > 5% in last 10 minutes - LLM response time > 30 seconds - Database connection errors - Memory usage > 80%

Simple monitoring script: python senti/monitoring/health_check.py (runs every 5 minutes, sends Telegram alert if down)

WHAT TO DO WHEN SOMETHING BREAKS

LLM returns garbage: ollama restart Check: does senti-shujaa still load? Fallback: set PRIMARY_MODEL_NAME=qwen2.5 in .env

Database connection errors: docker-compose restart postgres Check: is disk full?

Superpack 404 errors: python scratch/start_all.py Check: are ports 9201-9216 listening? python scratch/check_ports.py

Memory leaks: Restart the gateway Check Redis: redis-cli info memory Check session memory TTL is 2 hours

Slow responses (> 30 seconds): Check: is Ollama using CPU or GPU? Check: is Redis available? Check: is FEE handling simple queries (not LLM)?

WHAT NEVER TO DO IN PRODUCTION

Never: rm -rf postgres_data/ docker-compose down --volumes Delete senti_db.sqlite without backup Change ENCRYPTION_KEY after users have data (all encrypted data becomes unreadable) Run migration scripts without testing on backup first