File size: 6,967 Bytes
11f64d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# BlitzKode Production Runbook

This runbook captures the operational path for serving BlitzKode as a local or self-hosted coding assistant.

## 1. Release artifacts

Expected production artifacts:

- `blitzkode.gguf` — local GGUF model mounted into the container at `/app/blitzkode.gguf`.
- Docker image built from `Dockerfile` — includes `server.py` and Python dependencies only.
- Optional HuggingFace repos:
  - `neuralbroker/blitzkode` — GGUF distribution repo.
  - `neuralbroker/blitzkode-1.5b-lora` — 1.5B adapter repo.
  - `neuralbroker/blitzkode-lora-0.5b` — 0.5B adapter repo.

Do not commit model weights, checkpoints, `.env` files, or HuggingFace tokens to git.

## 2. Required environment

Minimum runtime:

- Python 3.11+ when running directly.
- Docker 24+ when running in containers.
- 4 GB+ RAM for the Q8_0 1.5B GGUF artifact.

- Optional NVIDIA container toolkit for GPU offload.



Key server variables:



| Variable | Production guidance |

|---|---|

| `BLITZKODE_MODEL_PATH` | Set to `/app/blitzkode.gguf` in Docker or an absolute local path outside Docker. |

| `BLITZKODE_PRELOAD_MODEL` | Use `true` for production so startup fails fast if the model cannot load. |

| `BLITZKODE_API_KEY` | Set a strong bearer token for any network-accessible deployment. |

| `BLITZKODE_CORS_ORIGINS` | Restrict to trusted API client origins instead of `*`. |

| `BLITZKODE_RATE_LIMIT` | Keep `true` unless running behind another trusted limiter. |

| `BLITZKODE_RATE_LIMIT_MAX` | Tune based on expected users and hardware. |
| `BLITZKODE_WEB_SEARCH` | Set `false` for fully offline operation; keep `true` for research mode. |
| `BLITZKODE_GPU_LAYERS` | `0` for CPU only, `-1` for all possible layers on GPU, or tune gradually. |
| `BLITZKODE_N_CTX` | Start with `2048`; increase to `4096` or higher only if memory allows. |
| `BLITZKODE_BATCH` / `BLITZKODE_UBATCH` | Start with `256` / `128`; increase only after latency and memory checks. |
| `BLITZKODE_PROMPT_CACHE` | Keep `true` for repeated system/history prefixes if supported by the installed `llama-cpp-python`. |

## 3. Pre-deployment validation

Run these checks before tagging or deploying a release:

```bash

python -m pytest tests/ -v

python -m ruff check .

python -m mypy server.py --ignore-missing-imports

docker build -t blitzkode:ci .

```

For CI smoke tests without the real model, start the container with `BLITZKODE_PRELOAD_MODEL=false` and verify `/health` returns HTTP 200.

## 4. CPU Docker deployment

Place `blitzkode.gguf` next to `docker-compose.yml`, then run:

```bash

docker compose up --build -d

```

The default compose service mounts the model read-only into `/app/blitzkode.gguf` and exposes the app on `http://localhost:7860`.

Check service state:

```bash

docker compose ps

docker compose logs --tail=100 blitzkode

curl -sf http://localhost:7860/health

curl -sf http://localhost:7860/info

```

A healthy deployment should report:

- `status` is `healthy` when the model file exists.
- `model_exists` is `true`.
- `last_error` is empty or `null`.
- `batch`, `ubatch`, and thread settings match the intended deployment profile.

## 5. GPU Docker deployment

Prerequisites:

1. NVIDIA driver installed on the host.
2. `nvidia-container-toolkit` installed.
3. Docker configured for the NVIDIA runtime.
4. A `llama-cpp-python` build with compatible GPU acceleration.

Start the GPU profile:

```bash

BLITZKODE_GPU_LAYERS=35 docker compose --profile gpu up --build -d

```

If startup fails or inference crashes, lower `BLITZKODE_GPU_LAYERS` and restart. Use `0` to force CPU-only fallback.

## 6. Direct local deployment

For non-container operation:

```bash

pip install -r requirements.txt

BLITZKODE_MODEL_PATH=blitzkode.gguf BLITZKODE_PRELOAD_MODEL=true python server.py

```

On Windows shells, set environment variables using the shell-specific syntax before running `python server.py`.

## 7. Health checks and smoke tests

Recommended checks after each deployment:

```bash

curl -sf http://localhost:7860/health

curl -sf http://localhost:7860/info

curl -sf -X POST http://localhost:7860/generate \

  -H "Content-Type: application/json" \

  -d '{"prompt":"Return a short Python hello-world function.","max_tokens":64}'

```

If `BLITZKODE_API_KEY` is configured, include `Authorization: Bearer <token>` on protected requests.

## 8. Rollback plan

Rollback should be artifact-based and fast:

1. Keep the last known-good Docker image tag available locally or in the registry.
2. Keep the last known-good `blitzkode.gguf` artifact available outside the container.
3. Stop the current service.
4. Restore the previous image tag and/or previous model file.
5. Start the service and run the health checks from section 7.

Example container rollback flow:

```bash

docker compose down

docker tag blitzkode:previous blitzkode:latest

docker compose up -d

curl -sf http://localhost:7860/health

```

## 9. HuggingFace publishing

Use a token only through environment variables or CI secrets:

```bash

HF_TOKEN=hf_xxx python scripts/push_all_to_hub.py

```

Before publishing, confirm:

- `blitzkode.gguf` exists and loads locally.
- Adapter directories contain `adapter_config.json` and adapter weights.
- `MODEL_CARD.md`, `README.md`, and `datasets/MANIFEST.md` match the artifact versions.
- The token has write access to the intended repos.

Never paste real tokens into documentation, committed scripts, or issue comments.

## 10. Common failure modes

| Symptom | Likely cause | Fix |
|---|---|---|
| `/health` returns `degraded` | Model file missing from configured path | Mount or copy `blitzkode.gguf`; verify `BLITZKODE_MODEL_PATH`. |
| Startup hangs while loading | Large context/batch or slow CPU disk load | Reduce `BLITZKODE_N_CTX` / `BLITZKODE_BATCH`, check disk and RAM. |
| Container exits on first request | llama.cpp cannot load model | Verify GGUF file integrity and llama-cpp-python compatibility. |
| Browser cannot call API | CORS origin mismatch | Set `BLITZKODE_CORS_ORIGINS` to the deployed UI origin. |
| HTTP 401 | Missing or wrong bearer token | Send `Authorization: Bearer <BLITZKODE_API_KEY>`. |
| HTTP 429 | Rate limit exceeded | Increase `BLITZKODE_RATE_LIMIT_MAX` or add an upstream queue/limit policy. |
| Research mode fails | Web search disabled or network blocked | Set `BLITZKODE_WEB_SEARCH=true` and verify outbound HTTP access. |

## 11. Operational notes

- Treat generated code as assistant output, not an automatically trusted patch.
- Prefer `/generate/research` for current APIs or documentation-sensitive questions.
- Keep logs free of prompts if prompts may contain private code or secrets.
- Rotate `BLITZKODE_API_KEY` and HuggingFace tokens regularly.
- Re-run the full validation suite after changing dependencies, model artifacts, or Docker base images.