Spaces:

ipswy
/

senti-beta

Running

File size: 3,061 Bytes

021e065

# SENTI AI — OPERATIONS RUNBOOK

## STARTING THE SYSTEM

Full stack:
  docker-compose up -d          (databases)
  ollama serve &                (LLM)
  python scratch/start_all.py   (16 superpacks)
  python senti/main.py          (main gateway)

Verify health:
  curl http://localhost:8000/health
  curl http://localhost:11434/api/tags
  python scratch/check_ports.py

## DAILY HEALTH CHECKS

  curl http://localhost:8000/health
  python senti/tests/battle/run_all_battle_tests.py
  tail -50 logs/senti_gateway.log | grep ERROR

## BACKUP PROCEDURE

Run daily (add to cron or Windows Task Scheduler):

  # Postgres backup
  docker exec senti_ai_postgres_1 pg_dump \
    -U senti senti_db > backups/senti_db_$(date +%Y%m%d).sql
  
  # Keep last 7 days
  find backups/ -name "*.sql" -mtime +7 -delete
  
  # Qdrant backup
  docker exec senti_ai_qdrant_1 \
    tar czf /tmp/qdrant_backup.tar.gz /qdrant/storage
  docker cp senti_ai_qdrant_1:/tmp/qdrant_backup.tar.gz \
    backups/qdrant_$(date +%Y%m%d).tar.gz

## RESTORE PROCEDURE

  # Postgres restore
  docker exec -i senti_ai_postgres_1 psql \
    -U senti senti_db < backups/senti_db_20260101.sql
  
  # Restart after restore
  docker-compose restart

## ROLLBACK PROCEDURE

If a deployment breaks the system:

  # 1. Stop the broken version
  docker-compose down
  
  # 2. Revert the code
  git stash (or git checkout last-known-good-tag)
  
  # 3. Restore the last working database
  # (from backups/)
  
  # 4. Restart
  docker-compose up -d
  python senti/main.py

## LOG ROTATION

Logs will grow. Set up rotation:

  Add to logrotate (/etc/logrotate.d/senti):
    /path/to/senti_ai/logs/*.log {
      daily
      rotate 14
      compress
      missingok
      notifempty
    }

## MONITORING ALERTS

  What to alert on:
    - /health endpoint returns non-200
    - Error rate > 5% in last 10 minutes
    - LLM response time > 30 seconds
    - Database connection errors
    - Memory usage > 80%

  Simple monitoring script:
    python senti/monitoring/health_check.py
    (runs every 5 minutes, sends Telegram alert if down)

## WHAT TO DO WHEN SOMETHING BREAKS

  LLM returns garbage:
    ollama restart
    Check: does senti-shujaa still load?
    Fallback: set PRIMARY_MODEL_NAME=qwen2.5 in .env

  Database connection errors:
    docker-compose restart postgres
    Check: is disk full?

  Superpack 404 errors:
    python scratch/start_all.py
    Check: are ports 9201-9216 listening?
    python scratch/check_ports.py

  Memory leaks:
    Restart the gateway
    Check Redis: redis-cli info memory
    Check session memory TTL is 2 hours

  Slow responses (> 30 seconds):
    Check: is Ollama using CPU or GPU?
    Check: is Redis available?
    Check: is FEE handling simple queries (not LLM)?

## WHAT NEVER TO DO IN PRODUCTION

  Never:
    rm -rf postgres_data/
    docker-compose down --volumes
    Delete senti_db.sqlite without backup
    Change ENCRYPTION_KEY after users have data
      (all encrypted data becomes unreadable)
    Run migration scripts without testing on backup first