neuralbroker commited on
Commit
d5a79fa
·
verified ·
1 Parent(s): 06b342b

Update clean backend-only project docs and eval

Browse files
Files changed (1) hide show
  1. README.md +51 -77
README.md CHANGED
@@ -19,7 +19,7 @@ base_model:
19
 
20
  # BlitzKode
21
 
22
- BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs entirely on your machine no external API calls, no data leaving your device.
23
 
24
  ## Tech Stack
25
 
@@ -30,24 +30,21 @@ BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-In
30
  | Training | HuggingFace Transformers + TRL |
31
  | Inference | llama-cpp-python (GGUF Q8_0) |
32
  | Backend | Python 3.11+, FastAPI, uvicorn |
33
- | Frontend | React 18, Vite, Phosphor Icons |
34
 
35
  ## Features
36
 
37
- - **Local-first** — inference with the bundled GGUF, no cloud dependency
38
- - **Real-time streaming** SSE token-by-token via `/generate/stream`
39
- - **Web research mode** DuckDuckGo search → context-augmented generation via `/generate/research`
40
- - **Web search API** standalone `/search/web` endpoint for raw results
41
- - **React chat UI** streaming, conversation history, copy controls, research-mode toggle
42
- - **Multi-language** Python, JavaScript, Java, C++, TypeScript, SQL
43
- - **API key auth + rate limiting** production-ready security middleware
44
- - **Docker** — multi-stage production image with frontend baked in
45
 
46
  ## Prerequisites
47
 
48
  - Python 3.11+
49
- - Node.js 20+ (for frontend dev/builds only)
50
- - `blitzkode.gguf` at repo root (or set `BLITZKODE_MODEL_PATH`)
51
  - 4 GB+ RAM
52
 
53
  ## Quick Start
@@ -55,26 +52,9 @@ BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-In
55
  ```bash
56
  pip install -r requirements.txt
57
  python server.py
58
- # Open http://localhost:7860
59
- ```
60
-
61
- ## Frontend Development
62
-
63
- ```bash
64
- cd frontend
65
- npm install
66
- npm run dev # http://localhost:5173 — proxies /generate and /health to :7860
67
- ```
68
-
69
- ## Production Frontend Build
70
-
71
- ```bash
72
- cd frontend && npm install && npm run build && cd ..
73
- python server.py
74
  ```
75
 
76
- FastAPI serves `frontend/dist/index.html` and `/assets/*` from the same port.
77
-
78
  ## Docker
79
 
80
  ```bash
@@ -104,7 +84,7 @@ curl -X POST http://localhost:7860/search/web \
104
  -H "Content-Type: application/json" \
105
  -d '{"query":"FastAPI dependency injection","max_results":3}'
106
 
107
- # Research-augmented generation (search → inject → answer)
108
  curl -X POST http://localhost:7860/generate/research \
109
  -H "Content-Type: application/json" \
110
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'
@@ -130,7 +110,7 @@ curl http://localhost:7860/info
130
 
131
  ### Research (`/generate/research`)
132
 
133
- Same as above, plus:
134
 
135
  | Parameter | Type | Default | Description |
136
  |---|---|---|---|
@@ -151,100 +131,94 @@ Same as above, plus:
151
  | Variable | Default | Description |
152
  |---|---|---|
153
  | `BLITZKODE_MODEL_PATH` | `blitzkode.gguf` | GGUF model path |
154
- | `BLITZKODE_FRONTEND_PATH` | `frontend/dist/index.html` | Built frontend |
155
  | `BLITZKODE_HOST` | `0.0.0.0` | Server bind address |
156
  | `BLITZKODE_PORT` | `7860` | Server port |
157
- | `BLITZKODE_GPU_LAYERS` | `0` | GPU layers for llama.cpp |
158
  | `BLITZKODE_N_CTX` | `2048` | Context window |
159
- | `BLITZKODE_THREADS` | auto | CPU worker threads |
160
- | `BLITZKODE_BATCH` | `128` | Batch size |
 
 
 
 
 
 
 
161
  | `BLITZKODE_MAX_PROMPT_LENGTH` | `4000` | Max prompt chars |
162
  | `BLITZKODE_PRELOAD_MODEL` | `false` | Load model at startup |
163
  | `BLITZKODE_CORS_ORIGINS` | `http://localhost:7860` | CORS origins |
164
  | `BLITZKODE_API_KEY` | empty | Optional bearer token |
165
  | `BLITZKODE_WEB_SEARCH` | `true` | Enable web search endpoints |
166
- | `BLITZKODE_SEARCH_TIMEOUT` | `8` | Search HTTP timeout (s) |
167
  | `BLITZKODE_MAX_SEARCH_RESULTS` | `5` | Max search results |
 
168
  | `BLITZKODE_RATE_LIMIT` | `true` | Enable per-IP rate limiting |
169
  | `BLITZKODE_RATE_LIMIT_MAX` | `30` | Requests per IP per minute |
170
  | `BLITZKODE_MAX_REQUEST_BYTES` | `50000` | Request body size limit |
171
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  ## Training Pipeline
173
 
174
  BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):
175
 
176
  | Stage | Script | Details |
177
  |---|---|---|
178
- | SFT v1 | `train_sft.py` | LoRA r=32 on 24 curated coding examples |
179
  | Reward-SFT | `train_reward_sft.py` | Reward-heuristic continuation |
180
- | DPO | `train_dpo.py` | 10 chosen/rejected preference pairs |
181
- | SFT v2 | `train_available.py` | LoRA r=16, 100 steps, 99 samples (1.5B) |
182
  | Export | `export_production.py` | Merge → GGUF Q8_0 via llama.cpp |
183
 
184
  ### Re-train from scratch
185
 
186
  ```bash
187
  pip install -r requirements-training.txt
188
-
189
- # Build dataset
190
  python scripts/build_full_dataset.py
191
-
192
- # Train 1.5B LoRA (100 steps, ~5 min on RTX 4060)
193
  python scripts/train_available.py \
194
  --model Qwen/Qwen2.5-1.5B-Instruct \
195
  --quantization none \
196
  --dataset datasets/raw/blitzkode_full_training.json \
197
  --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
198
-
199
- # Export: merge + GGUF
200
  python scripts/export_production.py
201
  ```
202
 
203
- ### Push to HuggingFace
204
-
205
- ```bash
206
- export HF_TOKEN=hf_XXXX # get from https://huggingface.co/settings/tokens
207
- python scripts/push_all_to_hub.py
208
- ```
209
-
210
- This uploads:
211
- - `checkpoints/blitzkode-1.5b-lora/final` → `neuralbroker/blitzkode-1.5b-lora`
212
- - `checkpoints/available-lora-0.5b-full/final` → `neuralbroker/blitzkode-lora-0.5b`
213
- - `blitzkode.gguf` → `neuralbroker/blitzkode`
214
-
215
  ## Project Structure
216
 
217
  ```text
218
  BlitzKode/
219
  server.py FastAPI backend
220
  blitzkode.gguf Local GGUF model (ignored by git)
221
- frontend/ React/Vite web UI
222
- src/App.jsx Chat UI with streaming + research toggle
223
- src/index.css
224
- vite.config.js
225
- scripts/
226
- train_available.py Resource-aware LoRA training
227
- build_full_dataset.py Dataset builder
228
- export_production.py Merge LoRA → GGUF
229
- push_to_hub.py Single-adapter HF push
230
- push_all_to_hub.py Push all artifacts in one command
231
- test_inference.py Adapter smoke test
232
- healthcheck.sh Docker/Compose health probe
233
- tests/test_server.py 20 backend endpoint tests (all passing)
234
  datasets/MANIFEST.md Dataset provenance
235
- docs/PROJECT_OVERVIEW.md Architecture & roadmap
236
- Dockerfile Multi-stage production image
237
  docker-compose.yml CPU + GPU service definitions
238
  requirements.txt Serving dependencies
239
- requirements-training.txt Training dependencies (pinned)
240
  ```
241
 
242
  ## CI
243
 
244
  ```bash
245
- python -m pytest tests/ -v # 20 tests, all pass
246
- python -m ruff check . # lint
247
- npm --prefix frontend run build # frontend build
 
 
248
  ```
249
 
250
  ## License
 
19
 
20
  # BlitzKode
21
 
22
+ BlitzKode is a local API-first AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs on your machine through `llama-cpp-python` with no external model API calls.
23
 
24
  ## Tech Stack
25
 
 
30
  | Training | HuggingFace Transformers + TRL |
31
  | Inference | llama-cpp-python (GGUF Q8_0) |
32
  | Backend | Python 3.11+, FastAPI, uvicorn |
 
33
 
34
  ## Features
35
 
36
+ - **Local-first inference** with the bundled GGUF model
37
+ - **FastAPI backend only** with `/generate`, `/generate/stream`, `/generate/research`, `/search/web`, `/health`, and `/info`
38
+ - **Real-time streaming** via Server-Sent Events on `/generate/stream`
39
+ - **Web research mode** using DuckDuckGo search context before generation
40
+ - **API key auth, request-size limits, and rate limiting** for production use
41
+ - **Backend/model optimizations**: mmap model loading, configurable GPU layer offload, batch/thread tuning, optional prompt cache, search-result TTL caching, and efficient deque-based rate limiting
42
+ - **Docker** runtime image without Node.js/frontend build steps
 
43
 
44
  ## Prerequisites
45
 
46
  - Python 3.11+
47
+ - `blitzkode.gguf` at repo root, or set `BLITZKODE_MODEL_PATH`
 
48
  - 4 GB+ RAM
49
 
50
  ## Quick Start
 
52
  ```bash
53
  pip install -r requirements.txt
54
  python server.py
55
+ curl http://localhost:7860/health
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ```
57
 
 
 
58
  ## Docker
59
 
60
  ```bash
 
84
  -H "Content-Type: application/json" \
85
  -d '{"query":"FastAPI dependency injection","max_results":3}'
86
 
87
+ # Research-augmented generation
88
  curl -X POST http://localhost:7860/generate/research \
89
  -H "Content-Type: application/json" \
90
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'
 
110
 
111
  ### Research (`/generate/research`)
112
 
113
+ Same as generation, plus:
114
 
115
  | Parameter | Type | Default | Description |
116
  |---|---|---|---|
 
131
  | Variable | Default | Description |
132
  |---|---|---|
133
  | `BLITZKODE_MODEL_PATH` | `blitzkode.gguf` | GGUF model path |
 
134
  | `BLITZKODE_HOST` | `0.0.0.0` | Server bind address |
135
  | `BLITZKODE_PORT` | `7860` | Server port |
136
+ | `BLITZKODE_GPU_LAYERS` | `0` | GPU layers for llama.cpp; use `-1` to offload all supported layers |
137
  | `BLITZKODE_N_CTX` | `2048` | Context window |
138
+ | `BLITZKODE_THREADS` | auto | CPU decode threads |
139
+ | `BLITZKODE_THREADS_BATCH` | auto | CPU prompt-processing threads |
140
+ | `BLITZKODE_BATCH` | `256` | Prompt-processing batch size |
141
+ | `BLITZKODE_UBATCH` | `128` | llama.cpp micro-batch size |
142
+ | `BLITZKODE_PROMPT_CACHE` | `true` | Enable llama.cpp in-memory prompt cache when supported |
143
+ | `BLITZKODE_PROMPT_CACHE_BYTES` | `67108864` | Prompt cache capacity in bytes |
144
+ | `BLITZKODE_USE_MMAP` | `true` | Memory-map the GGUF for faster startup and lower memory pressure |
145
+ | `BLITZKODE_USE_MLOCK` | `false` | Try to lock model pages in RAM |
146
+ | `BLITZKODE_OFFLOAD_KQV` | `true` | Offload K/Q/V operations when GPU layers are enabled |
147
  | `BLITZKODE_MAX_PROMPT_LENGTH` | `4000` | Max prompt chars |
148
  | `BLITZKODE_PRELOAD_MODEL` | `false` | Load model at startup |
149
  | `BLITZKODE_CORS_ORIGINS` | `http://localhost:7860` | CORS origins |
150
  | `BLITZKODE_API_KEY` | empty | Optional bearer token |
151
  | `BLITZKODE_WEB_SEARCH` | `true` | Enable web search endpoints |
152
+ | `BLITZKODE_SEARCH_TIMEOUT` | `8` | Search HTTP timeout in seconds |
153
  | `BLITZKODE_MAX_SEARCH_RESULTS` | `5` | Max search results |
154
+ | `BLITZKODE_SEARCH_CACHE_TTL` | `300` | Search result cache TTL in seconds |
155
  | `BLITZKODE_RATE_LIMIT` | `true` | Enable per-IP rate limiting |
156
  | `BLITZKODE_RATE_LIMIT_MAX` | `30` | Requests per IP per minute |
157
  | `BLITZKODE_MAX_REQUEST_BYTES` | `50000` | Request body size limit |
158
 
159
+ ## Model Evaluation
160
+
161
+ Latest local GGUF evaluation: **2026-05-16** using `python scripts/evaluate_model.py` on CPU (`n_ctx=2048`, `threads=8`, `batch=256`, `gpu_layers=0`). Full machine-readable results are stored in `docs/evaluation_results.json`.
162
+
163
+ | Eval case | Result | Notes |
164
+ |---|---:|---|
165
+ | Python factorial with negative-input handling | ✅ Pass | Generated a correct iterative implementation with `ValueError` for negative input. |
166
+ | Iterative binary search | ✅ Pass | Generated a valid loop-based search returning index or `-1`. |
167
+ | SQL top users by order count | ✅ Pass | Generated `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT 5`. |
168
+ | Unknown fictional API uncertainty | ❌ Fail | The raw model hallucinated a plausible signature for `imaginary_blitz_api`; the backend guard still blocks direct unknown-signature prompts on `/generate` and `/generate/stream`. |
169
+
170
+ Summary: **3 / 4 passed (75%)**. Total generation time was **28.864 s** after a **0.312 s** model load. Evaluation-of-the-evaluation: this is a lightweight heuristic smoke eval, not a comprehensive benchmark; it is useful for regression tracking and quick sanity checks, but code should still be reviewed and tested. Future eval work should add executable unit tests for generated code and larger benchmark suites such as HumanEval/MBPP-style tasks.
171
+
172
  ## Training Pipeline
173
 
174
  BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):
175
 
176
  | Stage | Script | Details |
177
  |---|---|---|
178
+ | SFT v1 | `train_sft.py` | LoRA r=32 on curated coding examples |
179
  | Reward-SFT | `train_reward_sft.py` | Reward-heuristic continuation |
180
+ | DPO | `train_dpo.py` | Chosen/rejected preference pairs |
181
+ | SFT v2 | `train_available.py` | LoRA r=16 resource-aware training |
182
  | Export | `export_production.py` | Merge → GGUF Q8_0 via llama.cpp |
183
 
184
  ### Re-train from scratch
185
 
186
  ```bash
187
  pip install -r requirements-training.txt
 
 
188
  python scripts/build_full_dataset.py
 
 
189
  python scripts/train_available.py \
190
  --model Qwen/Qwen2.5-1.5B-Instruct \
191
  --quantization none \
192
  --dataset datasets/raw/blitzkode_full_training.json \
193
  --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
 
 
194
  python scripts/export_production.py
195
  ```
196
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  ## Project Structure
198
 
199
  ```text
200
  BlitzKode/
201
  server.py FastAPI backend
202
  blitzkode.gguf Local GGUF model (ignored by git)
203
+ scripts/ Training, export, evaluation, and utility scripts
204
+ docs/evaluation_results.json Latest local model evaluation output
205
+ tests/test_server.py Backend endpoint tests
 
 
 
 
 
 
 
 
 
 
206
  datasets/MANIFEST.md Dataset provenance
207
+ docs/ Architecture and production docs
208
+ Dockerfile Python runtime image
209
  docker-compose.yml CPU + GPU service definitions
210
  requirements.txt Serving dependencies
211
+ requirements-training.txt Training dependencies
212
  ```
213
 
214
  ## CI
215
 
216
  ```bash
217
+ python -m pytest tests/ -v
218
+ python -m ruff check .
219
+ python -m mypy server.py --ignore-missing-imports
220
+ python scripts/evaluate_model.py
221
+ docker build -t blitzkode:ci .
222
  ```
223
 
224
  ## License