| # Codette Production Deployment Guide |
|
|
| ## Overview |
|
|
| This guide walks through deploying Codette's reasoning engine to production with pre-configured GGUF models and LORA adapters. |
|
|
| **Status**: Production-Ready ✅ |
| **Current Correctness**: 78.6% (target: 70%+) |
| **Test Suite**: 52/52 passing |
| **Architecture**: 7-layer consciousness stack (Session 13-14) |
|
|
| --- |
|
|
| ## Pre-Deployment Checklist |
|
|
| - [ ] **Hardware**: Min 8GB RAM, 5GB disk (see specs below) |
| - [ ] **Python**: 3.8+ installed (`python --version`) |
| - [ ] **Git**: Repository cloned |
| - [ ] **Ports**: 7860 available (or reconfigure) |
| - [ ] **Network**: For API calls (optional HuggingFace token) |
|
|
| --- |
|
|
| ## Step 1: Environment Setup |
|
|
| ### 1.1 Clone Repository |
| ```bash |
| git clone https://github.com/YOUR_USERNAME/codette-reasoning.git |
| cd codette-reasoning |
| ``` |
|
|
| ### 1.2 Create Virtual Environment (Recommended) |
| ```bash |
| python -m venv venv |
| |
| # Activate |
| # On Linux/Mac: |
| source venv/bin/activate |
| |
| # On Windows: |
| venv\Scripts\activate |
| ``` |
|
|
| ### 1.3 Install Dependencies |
| ```bash |
| pip install --upgrade pip |
| pip install -r requirements.txt |
| ``` |
|
|
| **Expected output**: All packages install without errors |
|
|
| --- |
|
|
| ## Step 2: Verify Models & Adapters |
|
|
| ### 2.1 Check Model Files |
| ```bash |
| ls -lh models/base/ |
| # Should show: |
| # - Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (4.6GB) |
| # - llama-3.2-1b-instruct-q8_0.gguf (1.3GB) |
| # - Meta-Llama-3.1-8B-Instruct.F16.gguf (3.4GB) |
| ``` |
|
|
| ### 2.2 Check Adapters |
| ```bash |
| ls -lh adapters/ |
| # Should show 8 .gguf files (27MB each) |
| ``` |
|
|
| ### 2.3 Verify Model Loader |
| ```bash |
| python -c " |
| from inference.model_loader import ModelLoader |
| loader = ModelLoader() |
| models = loader.list_available_models() |
| print(f'Found {len(models)} models') |
| for m in models: |
| print(f' - {m}') |
| " |
| # Expected: Found 3 models |
| ``` |
|
|
| --- |
|
|
| ## Step 3: Run Tests (Pre-Flight Check) |
|
|
| ### 3.1 Run Core Integration Tests |
| ```bash |
| python -m pytest test_integration.py -v |
| # Expected: All passed |
| |
| python -m pytest test_tier2_integration.py -v |
| # Expected: 18 passed |
| |
| python -m pytest test_integration_phase6.py -v |
| # Expected: 7 passed |
| ``` |
|
|
| ### 3.2 Run Correctness Benchmark |
| ```bash |
| python correctness_benchmark.py |
| # Expected output: |
| # Phase 6+13+14 accuracy: 78.6% |
| # Meta-loops reduced: 90% → 5% |
| ``` |
|
|
| **If any test fails**: See "Troubleshooting" section below |
|
|
| --- |
|
|
| ## Step 4: Configure for Your Hardware |
|
|
| ### Option A: Default (Llama 3.1 8B Q4 + GPU) |
| ```bash |
| # Automatic - GPU acceleration enabled |
| python inference/codette_server.py |
| ``` |
|
|
| ### Option B: CPU-Only (Lightweight) |
| ```bash |
| # Use Llama 3.2 1B model |
| export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf" |
| export CODETTE_GPU_LAYERS=0 |
| python inference/codette_server.py |
| ``` |
|
|
| ### Option C: Maximum Quality (Llama 3.1 8B F16) |
| ```bash |
| # Use full-precision model (slower, higher quality) |
| export CODETTE_MODEL_PATH="models/base/Meta-Llama-3.1-8B-Instruct.F16.gguf" |
| python inference/codette_server.py |
| ``` |
|
|
| ### Option D: Custom Configuration |
| Edit `inference/codette_server.py` line ~50: |
|
|
| ```python |
| MODEL_CONFIG = { |
| "model_path": "models/base/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf", |
| "n_gpu_layers": 32, # Increase/decrease based on GPU VRAM |
| "n_threads": 8, # CPU parallel threads |
| "n_ctx": 2048, # Context window (tokens) |
| "temperature": 0.7, # 0.0=deterministic, 1.0=creative |
| "top_k": 40, # Top-K sampling |
| "top_p": 0.95, # Nucleus sampling |
| } |
| ``` |
|
|
| --- |
|
|
| ## Step 5: Start Server |
|
|
| ### 5.1 Launch |
| ```bash |
| python inference/codette_server.py |
| ``` |
|
|
| **Expected output**: |
| ``` |
| Loading model: models/base/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf... |
| Loading adapters from: adapters/ |
| ✓ consciousness-lora-f16.gguf |
| ✓ davinci-lora-f16.gguf |
| ✓ empathy-lora-f16.gguf |
| ✓ guardian-spindle (logical validation) |
| ✓ colleen-conscience (ethical validation) |
| Starting server on http://0.0.0.0:7860 |
| Ready for requests! |
| ``` |
|
|
| ### 5.2 Check Server Health |
| ```bash |
| # In another terminal: |
| curl http://localhost:7860/api/health |
| |
| # Expected response: |
| # {"status": "ready", "version": "14.0", "model": "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"} |
| ``` |
|
|
| --- |
|
|
| ## Step 6: Test Live Queries |
|
|
| ### 6.1 Simple Query |
| ```bash |
| curl -X POST http://localhost:7860/api/chat \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "query": "What is quantum computing?", |
| "max_adapters": 3 |
| }' |
| ``` |
|
|
| **Expected**: Multi-perspective response with 3 adapters active |
|
|
| ### 6.2 Complex Reasoning Query |
| ```bash |
| curl -X POST http://localhost:7860/api/chat \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "query": "Should we implement AI for hiring decisions? Provide ethical analysis.", |
| "max_adapters": 8 |
| }' |
| ``` |
|
|
| **Expected**: Full consciousness stack (7 layers + ethical validation) |
|
|
| ### 6.3 Web Interface |
| ``` |
| Visit: http://localhost:7860 |
| ``` |
|
|
| --- |
|
|
| ## Step 7: Performance Validation |
|
|
| ### 7.1 Check Latency |
| ```bash |
| time python -c " |
| from inference.codette_forge_bridge import CodetteForgeBridge |
| bridge = CodetteForgeBridge() |
| response = bridge.reason('Explain photosynthesis') |
| print(f'Response: {response[:100]}...') |
| " |
| # Note execution time |
| ``` |
|
|
| ### 7.2 Monitor Memory Usage |
| ```bash |
| # During server run, in another terminal: |
| # Linux/Mac: |
| watch -n 1 'ps aux | grep codette_server' |
| |
| # Windows: |
| Get-Process -Name python |
| ``` |
|
|
| ### 7.3 Validate Adapter Activity |
| ```bash |
| python -c " |
| from reasoning_forge.forge_engine import ForgeEngine |
| engine = ForgeEngine() |
| adapters = engine.get_loaded_adapters() |
| print(f'Active adapters: {len(adapters)}/8') |
| for adapter in adapters: |
| print(f' ✓ {adapter}') |
| " |
| ``` |
|
|
| --- |
|
|
| ## Production Deployment Patterns |
|
|
| ### Pattern 1: Local Development |
| ```bash |
| # Simple one-liner for local testing |
| python inference/codette_server.py |
| ``` |
|
|
| ### Pattern 2: Docker Container |
| ```dockerfile |
| FROM python:3.10-slim |
| |
| WORKDIR /app |
| COPY . . |
| |
| RUN pip install -r requirements.txt |
| |
| EXPOSE 7860 |
| |
| CMD ["python", "inference/codette_server.py"] |
| ``` |
|
|
| ```bash |
| docker build -t codette:latest . |
| docker run -p 7860:7860 codette:latest |
| ``` |
|
|
| ### Pattern 3: Kubernetes Deployment |
| ```yaml |
| apiVersion: apps/v1 |
| kind: Deployment |
| metadata: |
| name: codette |
| spec: |
| replicas: 2 |
| containers: |
| - name: codette |
| image: codette:latest |
| ports: |
| - containerPort: 7860 |
| resources: |
| limits: |
| memory: "16Gi" |
| nvidia.com/gpu: 1 |
| ``` |
|
|
| ### Pattern 4: Systemd Service (Linux) |
| Create `/etc/systemd/system/codette.service`: |
|
|
| ```ini |
| [Unit] |
| Description=Codette Reasoning Engine |
| After=network.target |
| |
| [Service] |
| Type=simple |
| User=codette |
| WorkingDirectory=/opt/codette |
| ExecStart=/usr/bin/python /opt/codette/inference/codette_server.py |
| Restart=always |
| RestartSec=10 |
| |
| [Install] |
| WantedBy=multi-user.target |
| ``` |
|
|
| ```bash |
| sudo systemctl start codette |
| sudo systemctl enable codette |
| sudo systemctl status codette |
| ``` |
|
|
| --- |
|
|
| ## Hardware Configuration Guide |
|
|
| ### Minimal (CPU-Only) |
| ``` |
| Requirements: |
| - CPU: i5 or equivalent |
| - RAM: 8 GB |
| - Disk: 3 GB |
| - GPU: None |
| |
| Setup: |
| export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf" |
| export CODETTE_GPU_LAYERS=0 |
| |
| Performance: |
| - Warmup: 2-3 seconds |
| - Inference: ~2-5 tokens/sec |
| - Batch size: 1-2 |
| ``` |
|
|
| ### Standard (GPU-Accelerated) |
| ``` |
| Requirements: |
| - CPU: i7 or Ryzen 5+ |
| - RAM: 16 GB |
| - Disk: 6 GB |
| - GPU: RTX 3070 or equivalent (8GB VRAM) |
| |
| Setup: |
| # Default configuration |
| python inference/codette_server.py |
| |
| Performance: |
| - Warmup: 3-5 seconds |
| - Inference: ~15-25 tokens/sec |
| - Batch size: 4-8 |
| ``` |
|
|
| ### High-Performance (Production) |
| ``` |
| Requirements: |
| - CPU: Intel Xeon / AMD Ryzen 9 |
| - RAM: 32 GB |
| - Disk: 10 GB (SSD recommended) |
| - GPU: RTX 4090 or A100 (24GB+ VRAM) |
| |
| Setup: |
| export CODETTE_GPU_LAYERS=80 # Max acceleration |
| export CODETTE_BATCH_SIZE=16 |
| python inference/codette_server.py |
| |
| Performance: |
| - Warmup: 4-6 seconds |
| - Inference: ~80-120 tokens/sec |
| - Batch size: 16-32 |
| ``` |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### Issue: "CUDA device not found" |
| ```bash |
| # Verify GPU availability |
| python -c "import torch; print(torch.cuda.is_available())" |
| |
| # If False, switch to CPU: |
| export CODETTE_GPU_LAYERS=0 |
| python inference/codette_server.py |
| ``` |
|
|
| ### Issue: "out of memory" error |
| ```bash |
| # Reduce GPU layer allocation |
| export CODETTE_GPU_LAYERS=16 # Try 16 instead of 32 |
| |
| # Or use smaller model: |
| export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf" |
| |
| # Check current memory usage: |
| nvidia-smi # For GPU |
| free -h # For system RAM |
| ``` |
|
|
| ### Issue: Model loads slowly |
| ```bash |
| # Model first loads to disk/memory - this is normal |
| # Actual startup time: 3-6 seconds depending on GPU |
| |
| # If permanently slow: |
| # 1. Check disk speed: |
| hdparm -t /dev/sda # Linux example |
| |
| # 2. Move models to SSD if on HDD: |
| cp -r models/ /mnt/ssd/codette/ |
| export CODETTE_MODEL_ROOT="/mnt/ssd/codette/models" |
| ``` |
|
|
| ### Issue: Test failures |
| ```bash |
| # Run individual test with verbose output: |
| python -m pytest test_tier2_integration.py::test_intent_analysis_low_risk -vv |
| |
| # Check imports: |
| python -c "from reasoning_forge.forge_engine import ForgeEngine; print('OK')" |
| |
| # If import fails, reinstall: |
| pip install --force-reinstall --no-cache-dir -r requirements.txt |
| ``` |
|
|
| ### Issue: Adapters not loading |
| ```bash |
| # Verify adapter files: |
| ls -lh adapters/ |
| # Should show 8 .gguf files |
| |
| # Check adapter loading: |
| python -c " |
| from reasoning_forge.forge_engine import ForgeEngine |
| engine = ForgeEngine() |
| print(f'Loaded: {len(engine.adapters)} adapters') |
| " |
| |
| # If 0 adapters, check file permissions: |
| chmod 644 adapters/*.gguf |
| ``` |
|
|
| ### Issue: API returns 500 errors |
| ```bash |
| # Check server logs: |
| tail -f reasoning_forge/.logs/codette_errors.log |
| |
| # Test with simpler query: |
| curl -X POST http://localhost:7860/api/chat \ |
| -H "Content-Type: application/json" \ |
| -d '{"query": "test"}' |
| |
| # Check if Colleen/Guardian validation is blocking: |
| # Edit inference/codette_server.py and disable validation temporarily |
| ``` |
|
|
| --- |
|
|
| ## Monitoring & Observability |
|
|
| ### Health Checks |
| ```bash |
| # Every 30 seconds: |
| watch -n 30 curl http://localhost:7860/api/health |
| |
| # In production, use automated monitoring: |
| # Example: Prometheus metrics endpoint |
| curl http://localhost:7860/metrics |
| ``` |
|
|
| ### Log Inspection |
| ```bash |
| # Application logs: |
| tail -f reasoning_forge/.logs/codette_reflection_journal.json |
| |
| # Error logs: |
| grep ERROR reasoning_forge/.logs/codette_errors.log |
| |
| # Performance metrics: |
| cat observatory_metrics.json | jq '.latency[]' |
| ``` |
|
|
| ### Resource Monitoring |
| ```bash |
| # GPU utilization: |
| nvidia-smi -l 1 |
| |
| # System load: |
| top # Or Activity Monitor on macOS, Task Manager on Windows |
| |
| # Memory per process: |
| ps aux | grep codette_server |
| ``` |
|
|
| --- |
|
|
| ## Scaling & Load Testing |
|
|
| ### Load Test 1: Sequential Requests |
| ```bash |
| for i in {1..100}; do |
| curl -s -X POST http://localhost:7860/api/chat \ |
| -H "Content-Type: application/json" \ |
| -d '{"query": "test query '$i'"}' > /dev/null |
| echo "Request $i/100" |
| done |
| ``` |
|
|
| ### Load Test 2: Concurrent Requests |
| ```bash |
| # Using GNU Parallel: |
| seq 1 50 | parallel -j 4 'curl -s http://localhost:7860/api/health' |
| |
| # Or using Apache Bench: |
| ab -n 100 -c 10 http://localhost:7860/api/health |
| ``` |
|
|
| ### Expected Performance |
| - Llama 3.1 8B Q4 + RTX 3090: **50-60 req/min** sustained |
| - Llama 3.2 1B + CPU: **5-10 req/min** sustained |
|
|
| --- |
|
|
| ## Security Considerations |
|
|
| ### 1. API Authentication (TODO for production) |
| ```python |
| # Add in inference/codette_server.py: |
| @app.post("/api/chat") |
| def chat_with_auth(request, token: str = Header(None)): |
| if token != os.getenv("CODETTE_API_TOKEN"): |
| raise HTTPException(status_code=401, detail="Invalid token") |
| # Process request |
| ``` |
|
|
| ### 2. Rate Limiting |
| ```python |
| from slowapi import Limiter |
| limiter = Limiter(key_func=get_remote_address) |
| |
| @app.post("/api/chat") |
| @limiter.limit("10/minute") |
| def chat(request): |
| # ... |
| ``` |
|
|
| ### 3. Input Validation |
| ```python |
| # Validate query length |
| if len(query) > 10000: |
| raise ValueError("Query too long (max 10000 chars)") |
| |
| # Check for injection attempts |
| if any(x in query for x in ["<script>", "drop table"]): |
| raise ValueError("Suspicious input detected") |
| ``` |
|
|
| ### 4. HTTPS in Production |
| ```bash |
| # Use Let's Encrypt: |
| certbot certonly --standalone -d codette.example.com |
| |
| # Configure in inference/codette_server.py: |
| uvicorn.run(app, host="0.0.0.0", port=443, |
| ssl_keyfile="/etc/letsencrypt/live/codette.example.com/privkey.pem", |
| ssl_certfile="/etc/letsencrypt/live/codette.example.com/fullchain.pem") |
| ``` |
|
|
| --- |
|
|
| ## Post-Deployment Checklist |
|
|
| - [ ] Server starts without errors |
| - [ ] All 3 models available (`/api/models`) |
| - [ ] All 8 adapters loaded |
| - [ ] Simple query returns response in <5 sec |
| - [ ] Complex query (max_adapters=8) returns response in <10 sec |
| - [ ] Correctness benchmark still shows 78.6%+ |
| - [ ] No errors in logs |
| - [ ] Memory stable after 1 hour of operation |
| - [ ] GPU utilization efficient (not pegged at 100%) |
| - [ ] Health endpoint responds |
| - [ ] Can toggle between models without restart |
| |
| --- |
| |
| ## Rollback Procedure |
| |
| If anything goes wrong: |
| |
| ```bash |
| # Stop server |
| Ctrl+C |
| |
| # Check last error: |
| tail -20 reasoning_forge/.logs/codette_errors.log |
| |
| # Revert to last known-good config: |
| git checkout inference/codette_server.py |
|
|
| # Or use previous model: |
| export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf" |
| |
| # Restart: |
| python inference/codette_server.py |
| ``` |
| |
| --- |
| |
| ## Support & Further Help |
| |
| For issues: |
| 1. Check **Troubleshooting** section above |
| 2. Review `MODEL_SETUP.md` for model-specific issues |
| 3. Check logs: `reasoning_forge/.logs/` |
| 4. Run tests: `pytest test_*.py -v` |
| 5. Consult `SESSION_14_VALIDATION_REPORT.md` for architecture details |
| |
| --- |
| |
| **Status**: Production Ready ✅ |
| **Last Updated**: 2026-03-20 |
| **Models Included**: 3 (Llama 3.1 8B Q4, Llama 3.2 1B, Llama 3.1 8B F16) |
| **Adapters**: 8 specialized LORA weights |
| **Expected Correctness**: 78.6% (validation passing) |
| |
| |