blitzkode / docs /PRODUCTION_RUNBOOK.md
neuralbroker's picture
Update clean backend-only project docs and eval
11f64d8 verified

BlitzKode Production Runbook

This runbook captures the operational path for serving BlitzKode as a local or self-hosted coding assistant.

1. Release artifacts

Expected production artifacts:

  • blitzkode.gguf — local GGUF model mounted into the container at /app/blitzkode.gguf.
  • Docker image built from Dockerfile — includes server.py and Python dependencies only.
  • Optional HuggingFace repos:
    • neuralbroker/blitzkode — GGUF distribution repo.
    • neuralbroker/blitzkode-1.5b-lora — 1.5B adapter repo.
    • neuralbroker/blitzkode-lora-0.5b — 0.5B adapter repo.

Do not commit model weights, checkpoints, .env files, or HuggingFace tokens to git.

2. Required environment

Minimum runtime:

  • Python 3.11+ when running directly.
  • Docker 24+ when running in containers.
  • 4 GB+ RAM for the Q8_0 1.5B GGUF artifact.
  • Optional NVIDIA container toolkit for GPU offload.

Key server variables:

Variable Production guidance
BLITZKODE_MODEL_PATH Set to /app/blitzkode.gguf in Docker or an absolute local path outside Docker.
BLITZKODE_PRELOAD_MODEL Use true for production so startup fails fast if the model cannot load.
BLITZKODE_API_KEY Set a strong bearer token for any network-accessible deployment.
BLITZKODE_CORS_ORIGINS Restrict to trusted API client origins instead of *.
BLITZKODE_RATE_LIMIT Keep true unless running behind another trusted limiter.
BLITZKODE_RATE_LIMIT_MAX Tune based on expected users and hardware.
BLITZKODE_WEB_SEARCH Set false for fully offline operation; keep true for research mode.
BLITZKODE_GPU_LAYERS 0 for CPU only, -1 for all possible layers on GPU, or tune gradually.
BLITZKODE_N_CTX Start with 2048; increase to 4096 or higher only if memory allows.
BLITZKODE_BATCH / BLITZKODE_UBATCH Start with 256 / 128; increase only after latency and memory checks.
BLITZKODE_PROMPT_CACHE Keep true for repeated system/history prefixes if supported by the installed llama-cpp-python.

3. Pre-deployment validation

Run these checks before tagging or deploying a release:

python -m pytest tests/ -v
python -m ruff check .
python -m mypy server.py --ignore-missing-imports
docker build -t blitzkode:ci .

For CI smoke tests without the real model, start the container with BLITZKODE_PRELOAD_MODEL=false and verify /health returns HTTP 200.

4. CPU Docker deployment

Place blitzkode.gguf next to docker-compose.yml, then run:

docker compose up --build -d

The default compose service mounts the model read-only into /app/blitzkode.gguf and exposes the app on http://localhost:7860.

Check service state:

docker compose ps
docker compose logs --tail=100 blitzkode
curl -sf http://localhost:7860/health
curl -sf http://localhost:7860/info

A healthy deployment should report:

  • status is healthy when the model file exists.
  • model_exists is true.
  • last_error is empty or null.
  • batch, ubatch, and thread settings match the intended deployment profile.

5. GPU Docker deployment

Prerequisites:

  1. NVIDIA driver installed on the host.
  2. nvidia-container-toolkit installed.
  3. Docker configured for the NVIDIA runtime.
  4. A llama-cpp-python build with compatible GPU acceleration.

Start the GPU profile:

BLITZKODE_GPU_LAYERS=35 docker compose --profile gpu up --build -d

If startup fails or inference crashes, lower BLITZKODE_GPU_LAYERS and restart. Use 0 to force CPU-only fallback.

6. Direct local deployment

For non-container operation:

pip install -r requirements.txt
BLITZKODE_MODEL_PATH=blitzkode.gguf BLITZKODE_PRELOAD_MODEL=true python server.py

On Windows shells, set environment variables using the shell-specific syntax before running python server.py.

7. Health checks and smoke tests

Recommended checks after each deployment:

curl -sf http://localhost:7860/health
curl -sf http://localhost:7860/info
curl -sf -X POST http://localhost:7860/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Return a short Python hello-world function.","max_tokens":64}'

If BLITZKODE_API_KEY is configured, include Authorization: Bearer <token> on protected requests.

8. Rollback plan

Rollback should be artifact-based and fast:

  1. Keep the last known-good Docker image tag available locally or in the registry.
  2. Keep the last known-good blitzkode.gguf artifact available outside the container.
  3. Stop the current service.
  4. Restore the previous image tag and/or previous model file.
  5. Start the service and run the health checks from section 7.

Example container rollback flow:

docker compose down
docker tag blitzkode:previous blitzkode:latest
docker compose up -d
curl -sf http://localhost:7860/health

9. HuggingFace publishing

Use a token only through environment variables or CI secrets:

HF_TOKEN=hf_xxx python scripts/push_all_to_hub.py

Before publishing, confirm:

  • blitzkode.gguf exists and loads locally.
  • Adapter directories contain adapter_config.json and adapter weights.
  • MODEL_CARD.md, README.md, and datasets/MANIFEST.md match the artifact versions.
  • The token has write access to the intended repos.

Never paste real tokens into documentation, committed scripts, or issue comments.

10. Common failure modes

Symptom Likely cause Fix
/health returns degraded Model file missing from configured path Mount or copy blitzkode.gguf; verify BLITZKODE_MODEL_PATH.
Startup hangs while loading Large context/batch or slow CPU disk load Reduce BLITZKODE_N_CTX / BLITZKODE_BATCH, check disk and RAM.
Container exits on first request llama.cpp cannot load model Verify GGUF file integrity and llama-cpp-python compatibility.
Browser cannot call API CORS origin mismatch Set BLITZKODE_CORS_ORIGINS to the deployed UI origin.
HTTP 401 Missing or wrong bearer token Send Authorization: Bearer <BLITZKODE_API_KEY>.
HTTP 429 Rate limit exceeded Increase BLITZKODE_RATE_LIMIT_MAX or add an upstream queue/limit policy.
Research mode fails Web search disabled or network blocked Set BLITZKODE_WEB_SEARCH=true and verify outbound HTTP access.

11. Operational notes

  • Treat generated code as assistant output, not an automatically trusted patch.
  • Prefer /generate/research for current APIs or documentation-sensitive questions.
  • Keep logs free of prompts if prompts may contain private code or secrets.
  • Rotate BLITZKODE_API_KEY and HuggingFace tokens regularly.
  • Re-run the full validation suite after changing dependencies, model artifacts, or Docker base images.