Spaces:
Running
Running
File size: 3,061 Bytes
021e065 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | # SENTI AI — OPERATIONS RUNBOOK
## STARTING THE SYSTEM
Full stack:
docker-compose up -d (databases)
ollama serve & (LLM)
python scratch/start_all.py (16 superpacks)
python senti/main.py (main gateway)
Verify health:
curl http://localhost:8000/health
curl http://localhost:11434/api/tags
python scratch/check_ports.py
## DAILY HEALTH CHECKS
curl http://localhost:8000/health
python senti/tests/battle/run_all_battle_tests.py
tail -50 logs/senti_gateway.log | grep ERROR
## BACKUP PROCEDURE
Run daily (add to cron or Windows Task Scheduler):
# Postgres backup
docker exec senti_ai_postgres_1 pg_dump \
-U senti senti_db > backups/senti_db_$(date +%Y%m%d).sql
# Keep last 7 days
find backups/ -name "*.sql" -mtime +7 -delete
# Qdrant backup
docker exec senti_ai_qdrant_1 \
tar czf /tmp/qdrant_backup.tar.gz /qdrant/storage
docker cp senti_ai_qdrant_1:/tmp/qdrant_backup.tar.gz \
backups/qdrant_$(date +%Y%m%d).tar.gz
## RESTORE PROCEDURE
# Postgres restore
docker exec -i senti_ai_postgres_1 psql \
-U senti senti_db < backups/senti_db_20260101.sql
# Restart after restore
docker-compose restart
## ROLLBACK PROCEDURE
If a deployment breaks the system:
# 1. Stop the broken version
docker-compose down
# 2. Revert the code
git stash (or git checkout last-known-good-tag)
# 3. Restore the last working database
# (from backups/)
# 4. Restart
docker-compose up -d
python senti/main.py
## LOG ROTATION
Logs will grow. Set up rotation:
Add to logrotate (/etc/logrotate.d/senti):
/path/to/senti_ai/logs/*.log {
daily
rotate 14
compress
missingok
notifempty
}
## MONITORING ALERTS
What to alert on:
- /health endpoint returns non-200
- Error rate > 5% in last 10 minutes
- LLM response time > 30 seconds
- Database connection errors
- Memory usage > 80%
Simple monitoring script:
python senti/monitoring/health_check.py
(runs every 5 minutes, sends Telegram alert if down)
## WHAT TO DO WHEN SOMETHING BREAKS
LLM returns garbage:
ollama restart
Check: does senti-shujaa still load?
Fallback: set PRIMARY_MODEL_NAME=qwen2.5 in .env
Database connection errors:
docker-compose restart postgres
Check: is disk full?
Superpack 404 errors:
python scratch/start_all.py
Check: are ports 9201-9216 listening?
python scratch/check_ports.py
Memory leaks:
Restart the gateway
Check Redis: redis-cli info memory
Check session memory TTL is 2 hours
Slow responses (> 30 seconds):
Check: is Ollama using CPU or GPU?
Check: is Redis available?
Check: is FEE handling simple queries (not LLM)?
## WHAT NEVER TO DO IN PRODUCTION
Never:
rm -rf postgres_data/
docker-compose down --volumes
Delete senti_db.sqlite without backup
Change ENCRYPTION_KEY after users have data
(all encrypted data becomes unreadable)
Run migration scripts without testing on backup first
|