Codette Production Deployment Guide
Overview
This guide walks through deploying Codette's reasoning engine to production with pre-configured GGUF models and LORA adapters.
Status: Production-Ready ✅ Current Correctness: 78.6% (target: 70%+) Test Suite: 52/52 passing Architecture: 7-layer consciousness stack (Session 13-14)
Pre-Deployment Checklist
- Hardware: Min 8GB RAM, 5GB disk (see specs below)
- Python: 3.8+ installed (
python --version) - Git: Repository cloned
- Ports: 7860 available (or reconfigure)
- Network: For API calls (optional HuggingFace token)
Step 1: Environment Setup
1.1 Clone Repository
git clone https://github.com/YOUR_USERNAME/codette-reasoning.git
cd codette-reasoning
1.2 Create Virtual Environment (Recommended)
python -m venv venv
# Activate
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
1.3 Install Dependencies
pip install --upgrade pip
pip install -r requirements.txt
Expected output: All packages install without errors
Step 2: Verify Models & Adapters
2.1 Check Model Files
ls -lh models/base/
# Should show:
# - Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (4.6GB)
# - llama-3.2-1b-instruct-q8_0.gguf (1.3GB)
# - Meta-Llama-3.1-8B-Instruct.F16.gguf (3.4GB)
2.2 Check Adapters
ls -lh adapters/
# Should show 8 .gguf files (27MB each)
2.3 Verify Model Loader
python -c "
from inference.model_loader import ModelLoader
loader = ModelLoader()
models = loader.list_available_models()
print(f'Found {len(models)} models')
for m in models:
print(f' - {m}')
"
# Expected: Found 3 models
Step 3: Run Tests (Pre-Flight Check)
3.1 Run Core Integration Tests
python -m pytest test_integration.py -v
# Expected: All passed
python -m pytest test_tier2_integration.py -v
# Expected: 18 passed
python -m pytest test_integration_phase6.py -v
# Expected: 7 passed
3.2 Run Correctness Benchmark
python correctness_benchmark.py
# Expected output:
# Phase 6+13+14 accuracy: 78.6%
# Meta-loops reduced: 90% → 5%
If any test fails: See "Troubleshooting" section below
Step 4: Configure for Your Hardware
Option A: Default (Llama 3.1 8B Q4 + GPU)
# Automatic - GPU acceleration enabled
python inference/codette_server.py
Option B: CPU-Only (Lightweight)
# Use Llama 3.2 1B model
export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf"
export CODETTE_GPU_LAYERS=0
python inference/codette_server.py
Option C: Maximum Quality (Llama 3.1 8B F16)
# Use full-precision model (slower, higher quality)
export CODETTE_MODEL_PATH="models/base/Meta-Llama-3.1-8B-Instruct.F16.gguf"
python inference/codette_server.py
Option D: Custom Configuration
Edit inference/codette_server.py line ~50:
MODEL_CONFIG = {
"model_path": "models/base/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
"n_gpu_layers": 32, # Increase/decrease based on GPU VRAM
"n_threads": 8, # CPU parallel threads
"n_ctx": 2048, # Context window (tokens)
"temperature": 0.7, # 0.0=deterministic, 1.0=creative
"top_k": 40, # Top-K sampling
"top_p": 0.95, # Nucleus sampling
}
Step 5: Start Server
5.1 Launch
python inference/codette_server.py
Expected output:
Loading model: models/base/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf...
Loading adapters from: adapters/
✓ consciousness-lora-f16.gguf
✓ davinci-lora-f16.gguf
✓ empathy-lora-f16.gguf
✓ guardian-spindle (logical validation)
✓ colleen-conscience (ethical validation)
Starting server on http://0.0.0.0:7860
Ready for requests!
5.2 Check Server Health
# In another terminal:
curl http://localhost:7860/api/health
# Expected response:
# {"status": "ready", "version": "14.0", "model": "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"}
Step 6: Test Live Queries
6.1 Simple Query
curl -X POST http://localhost:7860/api/chat \
-H "Content-Type: application/json" \
-d '{
"query": "What is quantum computing?",
"max_adapters": 3
}'
Expected: Multi-perspective response with 3 adapters active
6.2 Complex Reasoning Query
curl -X POST http://localhost:7860/api/chat \
-H "Content-Type: application/json" \
-d '{
"query": "Should we implement AI for hiring decisions? Provide ethical analysis.",
"max_adapters": 8
}'
Expected: Full consciousness stack (7 layers + ethical validation)
6.3 Web Interface
Visit: http://localhost:7860
Step 7: Performance Validation
7.1 Check Latency
time python -c "
from inference.codette_forge_bridge import CodetteForgeBridge
bridge = CodetteForgeBridge()
response = bridge.reason('Explain photosynthesis')
print(f'Response: {response[:100]}...')
"
# Note execution time
7.2 Monitor Memory Usage
# During server run, in another terminal:
# Linux/Mac:
watch -n 1 'ps aux | grep codette_server'
# Windows:
Get-Process -Name python
7.3 Validate Adapter Activity
python -c "
from reasoning_forge.forge_engine import ForgeEngine
engine = ForgeEngine()
adapters = engine.get_loaded_adapters()
print(f'Active adapters: {len(adapters)}/8')
for adapter in adapters:
print(f' ✓ {adapter}')
"
Production Deployment Patterns
Pattern 1: Local Development
# Simple one-liner for local testing
python inference/codette_server.py
Pattern 2: Docker Container
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
EXPOSE 7860
CMD ["python", "inference/codette_server.py"]
docker build -t codette:latest .
docker run -p 7860:7860 codette:latest
Pattern 3: Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: codette
spec:
replicas: 2
containers:
- name: codette
image: codette:latest
ports:
- containerPort: 7860
resources:
limits:
memory: "16Gi"
nvidia.com/gpu: 1
Pattern 4: Systemd Service (Linux)
Create /etc/systemd/system/codette.service:
[Unit]
Description=Codette Reasoning Engine
After=network.target
[Service]
Type=simple
User=codette
WorkingDirectory=/opt/codette
ExecStart=/usr/bin/python /opt/codette/inference/codette_server.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
sudo systemctl start codette
sudo systemctl enable codette
sudo systemctl status codette
Hardware Configuration Guide
Minimal (CPU-Only)
Requirements:
- CPU: i5 or equivalent
- RAM: 8 GB
- Disk: 3 GB
- GPU: None
Setup:
export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf"
export CODETTE_GPU_LAYERS=0
Performance:
- Warmup: 2-3 seconds
- Inference: ~2-5 tokens/sec
- Batch size: 1-2
Standard (GPU-Accelerated)
Requirements:
- CPU: i7 or Ryzen 5+
- RAM: 16 GB
- Disk: 6 GB
- GPU: RTX 3070 or equivalent (8GB VRAM)
Setup:
# Default configuration
python inference/codette_server.py
Performance:
- Warmup: 3-5 seconds
- Inference: ~15-25 tokens/sec
- Batch size: 4-8
High-Performance (Production)
Requirements:
- CPU: Intel Xeon / AMD Ryzen 9
- RAM: 32 GB
- Disk: 10 GB (SSD recommended)
- GPU: RTX 4090 or A100 (24GB+ VRAM)
Setup:
export CODETTE_GPU_LAYERS=80 # Max acceleration
export CODETTE_BATCH_SIZE=16
python inference/codette_server.py
Performance:
- Warmup: 4-6 seconds
- Inference: ~80-120 tokens/sec
- Batch size: 16-32
Troubleshooting
Issue: "CUDA device not found"
# Verify GPU availability
python -c "import torch; print(torch.cuda.is_available())"
# If False, switch to CPU:
export CODETTE_GPU_LAYERS=0
python inference/codette_server.py
Issue: "out of memory" error
# Reduce GPU layer allocation
export CODETTE_GPU_LAYERS=16 # Try 16 instead of 32
# Or use smaller model:
export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf"
# Check current memory usage:
nvidia-smi # For GPU
free -h # For system RAM
Issue: Model loads slowly
# Model first loads to disk/memory - this is normal
# Actual startup time: 3-6 seconds depending on GPU
# If permanently slow:
# 1. Check disk speed:
hdparm -t /dev/sda # Linux example
# 2. Move models to SSD if on HDD:
cp -r models/ /mnt/ssd/codette/
export CODETTE_MODEL_ROOT="/mnt/ssd/codette/models"
Issue: Test failures
# Run individual test with verbose output:
python -m pytest test_tier2_integration.py::test_intent_analysis_low_risk -vv
# Check imports:
python -c "from reasoning_forge.forge_engine import ForgeEngine; print('OK')"
# If import fails, reinstall:
pip install --force-reinstall --no-cache-dir -r requirements.txt
Issue: Adapters not loading
# Verify adapter files:
ls -lh adapters/
# Should show 8 .gguf files
# Check adapter loading:
python -c "
from reasoning_forge.forge_engine import ForgeEngine
engine = ForgeEngine()
print(f'Loaded: {len(engine.adapters)} adapters')
"
# If 0 adapters, check file permissions:
chmod 644 adapters/*.gguf
Issue: API returns 500 errors
# Check server logs:
tail -f reasoning_forge/.logs/codette_errors.log
# Test with simpler query:
curl -X POST http://localhost:7860/api/chat \
-H "Content-Type: application/json" \
-d '{"query": "test"}'
# Check if Colleen/Guardian validation is blocking:
# Edit inference/codette_server.py and disable validation temporarily
Monitoring & Observability
Health Checks
# Every 30 seconds:
watch -n 30 curl http://localhost:7860/api/health
# In production, use automated monitoring:
# Example: Prometheus metrics endpoint
curl http://localhost:7860/metrics
Log Inspection
# Application logs:
tail -f reasoning_forge/.logs/codette_reflection_journal.json
# Error logs:
grep ERROR reasoning_forge/.logs/codette_errors.log
# Performance metrics:
cat observatory_metrics.json | jq '.latency[]'
Resource Monitoring
# GPU utilization:
nvidia-smi -l 1
# System load:
top # Or Activity Monitor on macOS, Task Manager on Windows
# Memory per process:
ps aux | grep codette_server
Scaling & Load Testing
Load Test 1: Sequential Requests
for i in {1..100}; do
curl -s -X POST http://localhost:7860/api/chat \
-H "Content-Type: application/json" \
-d '{"query": "test query '$i'"}' > /dev/null
echo "Request $i/100"
done
Load Test 2: Concurrent Requests
# Using GNU Parallel:
seq 1 50 | parallel -j 4 'curl -s http://localhost:7860/api/health'
# Or using Apache Bench:
ab -n 100 -c 10 http://localhost:7860/api/health
Expected Performance
- Llama 3.1 8B Q4 + RTX 3090: 50-60 req/min sustained
- Llama 3.2 1B + CPU: 5-10 req/min sustained
Security Considerations
1. API Authentication (TODO for production)
# Add in inference/codette_server.py:
@app.post("/api/chat")
def chat_with_auth(request, token: str = Header(None)):
if token != os.getenv("CODETTE_API_TOKEN"):
raise HTTPException(status_code=401, detail="Invalid token")
# Process request
2. Rate Limiting
from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)
@app.post("/api/chat")
@limiter.limit("10/minute")
def chat(request):
# ...
3. Input Validation
# Validate query length
if len(query) > 10000:
raise ValueError("Query too long (max 10000 chars)")
# Check for injection attempts
if any(x in query for x in ["<script>", "drop table"]):
raise ValueError("Suspicious input detected")
4. HTTPS in Production
# Use Let's Encrypt:
certbot certonly --standalone -d codette.example.com
# Configure in inference/codette_server.py:
uvicorn.run(app, host="0.0.0.0", port=443,
ssl_keyfile="/etc/letsencrypt/live/codette.example.com/privkey.pem",
ssl_certfile="/etc/letsencrypt/live/codette.example.com/fullchain.pem")
Post-Deployment Checklist
- Server starts without errors
- All 3 models available (
/api/models) - All 8 adapters loaded
- Simple query returns response in <5 sec
- Complex query (max_adapters=8) returns response in <10 sec
- Correctness benchmark still shows 78.6%+
- No errors in logs
- Memory stable after 1 hour of operation
- GPU utilization efficient (not pegged at 100%)
- Health endpoint responds
- Can toggle between models without restart
Rollback Procedure
If anything goes wrong:
# Stop server
Ctrl+C
# Check last error:
tail -20 reasoning_forge/.logs/codette_errors.log
# Revert to last known-good config:
git checkout inference/codette_server.py
# Or use previous model:
export CODETTE_MODEL_PATH="models/base/llama-3.2-1b-instruct-q8_0.gguf"
# Restart:
python inference/codette_server.py
Support & Further Help
For issues:
- Check Troubleshooting section above
- Review
MODEL_SETUP.mdfor model-specific issues - Check logs:
reasoning_forge/.logs/ - Run tests:
pytest test_*.py -v - Consult
SESSION_14_VALIDATION_REPORT.mdfor architecture details
Status: Production Ready ✅ Last Updated: 2026-03-20 Models Included: 3 (Llama 3.1 8B Q4, Llama 3.2 1B, Llama 3.1 8B F16) Adapters: 8 specialized LORA weights Expected Correctness: 78.6% (validation passing)