senti-beta / OPERATIONS.md
joseph njoroge kariuki
Deploy Senti AI to Hugging Face Spaces
021e065
# SENTI AI — OPERATIONS RUNBOOK
## STARTING THE SYSTEM
Full stack:
docker-compose up -d (databases)
ollama serve & (LLM)
python scratch/start_all.py (16 superpacks)
python senti/main.py (main gateway)
Verify health:
curl http://localhost:8000/health
curl http://localhost:11434/api/tags
python scratch/check_ports.py
## DAILY HEALTH CHECKS
curl http://localhost:8000/health
python senti/tests/battle/run_all_battle_tests.py
tail -50 logs/senti_gateway.log | grep ERROR
## BACKUP PROCEDURE
Run daily (add to cron or Windows Task Scheduler):
# Postgres backup
docker exec senti_ai_postgres_1 pg_dump \
-U senti senti_db > backups/senti_db_$(date +%Y%m%d).sql
# Keep last 7 days
find backups/ -name "*.sql" -mtime +7 -delete
# Qdrant backup
docker exec senti_ai_qdrant_1 \
tar czf /tmp/qdrant_backup.tar.gz /qdrant/storage
docker cp senti_ai_qdrant_1:/tmp/qdrant_backup.tar.gz \
backups/qdrant_$(date +%Y%m%d).tar.gz
## RESTORE PROCEDURE
# Postgres restore
docker exec -i senti_ai_postgres_1 psql \
-U senti senti_db < backups/senti_db_20260101.sql
# Restart after restore
docker-compose restart
## ROLLBACK PROCEDURE
If a deployment breaks the system:
# 1. Stop the broken version
docker-compose down
# 2. Revert the code
git stash (or git checkout last-known-good-tag)
# 3. Restore the last working database
# (from backups/)
# 4. Restart
docker-compose up -d
python senti/main.py
## LOG ROTATION
Logs will grow. Set up rotation:
Add to logrotate (/etc/logrotate.d/senti):
/path/to/senti_ai/logs/*.log {
daily
rotate 14
compress
missingok
notifempty
}
## MONITORING ALERTS
What to alert on:
- /health endpoint returns non-200
- Error rate > 5% in last 10 minutes
- LLM response time > 30 seconds
- Database connection errors
- Memory usage > 80%
Simple monitoring script:
python senti/monitoring/health_check.py
(runs every 5 minutes, sends Telegram alert if down)
## WHAT TO DO WHEN SOMETHING BREAKS
LLM returns garbage:
ollama restart
Check: does senti-shujaa still load?
Fallback: set PRIMARY_MODEL_NAME=qwen2.5 in .env
Database connection errors:
docker-compose restart postgres
Check: is disk full?
Superpack 404 errors:
python scratch/start_all.py
Check: are ports 9201-9216 listening?
python scratch/check_ports.py
Memory leaks:
Restart the gateway
Check Redis: redis-cli info memory
Check session memory TTL is 2 hours
Slow responses (> 30 seconds):
Check: is Ollama using CPU or GPU?
Check: is Redis available?
Check: is FEE handling simple queries (not LLM)?
## WHAT NEVER TO DO IN PRODUCTION
Never:
rm -rf postgres_data/
docker-compose down --volumes
Delete senti_db.sqlite without backup
Change ENCRYPTION_KEY after users have data
(all encrypted data becomes unreadable)
Run migration scripts without testing on backup first