File size: 8,904 Bytes
9327c72
 
2414aad
9327c72
 
 
 
2414aad
 
 
 
 
 
 
 
9327c72
2414aad
9327c72
 
e24f62d
 
d5a79fa
e24f62d
 
 
2414aad
 
 
 
 
 
 
e24f62d
 
 
d5a79fa
 
 
 
 
 
 
e24f62d
 
 
2414aad
d5a79fa
2414aad
e24f62d
2414aad
e24f62d
 
 
2414aad
d5a79fa
e24f62d
 
2414aad
e24f62d
 
2414aad
e24f62d
 
2414aad
 
 
e24f62d
 
2414aad
e24f62d
 
2414aad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5a79fa
2414aad
 
 
 
 
e24f62d
 
 
 
2414aad
 
 
e24f62d
 
2414aad
 
 
 
 
 
 
 
e24f62d
2414aad
e24f62d
d5a79fa
2414aad
 
 
 
 
 
 
 
 
 
 
 
 
 
e24f62d
 
 
2414aad
 
 
 
 
d5a79fa
2414aad
d5a79fa
 
 
 
 
 
 
 
 
2414aad
 
 
 
 
d5a79fa
2414aad
d5a79fa
2414aad
 
 
 
d5a79fa
 
 
 
 
 
 
 
 
 
 
 
 
2414aad
 
 
 
 
 
d5a79fa
2414aad
d5a79fa
 
2414aad
 
 
e24f62d
 
2414aad
 
 
 
 
 
 
 
e24f62d
 
2414aad
 
 
 
 
 
d5a79fa
 
 
2414aad
d5a79fa
 
2414aad
 
d5a79fa
2414aad
 
 
 
 
d5a79fa
 
 
 
 
2414aad
e24f62d
 
 
2414aad
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
language:
  - en
license: mit
library_name: llama-cpp-python
pipeline_tag: text-generation
tags:
  - code-generation
  - coding-assistant
  - gguf
  - llama.cpp
  - qwen2.5
  - python
  - javascript
  - fine-tuned
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct
---

# BlitzKode

BlitzKode is a local API-first AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs on your machine through `llama-cpp-python` with no external model API calls.

## Tech Stack

| Layer | Tech |
|---|---|
| Base model | Qwen2.5-1.5B-Instruct |
| Fine-tuning | LoRA (r=16, α=32) via PEFT |
| Training | HuggingFace Transformers + TRL |
| Inference | llama-cpp-python (GGUF Q8_0) |
| Backend | Python 3.11+, FastAPI, uvicorn |

## Features

- **Local-first inference** with the bundled GGUF model
- **FastAPI backend only** with `/generate`, `/generate/stream`, `/generate/research`, `/search/web`, `/health`, and `/info`
- **Real-time streaming** via Server-Sent Events on `/generate/stream`
- **Web research mode** using DuckDuckGo search context before generation
- **API key auth, request-size limits, and rate limiting** for production use
- **Backend/model optimizations**: mmap model loading, configurable GPU layer offload, batch/thread tuning, optional prompt cache, search-result TTL caching, and efficient deque-based rate limiting
- **Docker** runtime image without Node.js/frontend build steps

## Prerequisites

- Python 3.11+
- `blitzkode.gguf` at repo root, or set `BLITZKODE_MODEL_PATH`
- 4 GB+ RAM

## Quick Start

```bash
pip install -r requirements.txt
python server.py
curl http://localhost:7860/health
```

## Docker

```bash
# CPU
docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf blitzkode

# GPU (with nvidia-docker)
docker compose --profile gpu up
```

## API Examples

```bash
# Standard generation (streaming)
curl -X POST http://localhost:7860/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a Python function to reverse a linked list"}'

# Non-streaming
curl -X POST http://localhost:7860/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Binary search in Python","max_tokens":128}'

# Web search only
curl -X POST http://localhost:7860/search/web \
  -H "Content-Type: application/json" \
  -d '{"query":"FastAPI dependency injection","max_results":3}'

# Research-augmented generation
curl -X POST http://localhost:7860/generate/research \
  -H "Content-Type: application/json" \
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'

# Health / info
curl http://localhost:7860/health
curl http://localhost:7860/info
```

## API Parameters

### Generation (`/generate`, `/generate/stream`)

| Parameter | Type | Default | Description |
|---|---|---|---|
| `prompt` | string | required | User request |
| `messages` | array | `[]` | Conversation history (max 20) |
| `temperature` | float | `0.5` | Sampling randomness `0.0–2.0` |
| `max_tokens` | int | `256` | Max generated tokens (cap 512) |
| `top_p` | float | `0.95` | Nucleus sampling threshold |
| `top_k` | int | `20` | Top-k sampling |
| `repeat_penalty` | float | `1.05` | Repetition penalty |

### Research (`/generate/research`)

Same as generation, plus:

| Parameter | Type | Default | Description |
|---|---|---|---|
| `search_query` | string | prompt | Override query for web search |
| `search_results` | int | `5` | Results to inject |
| `deep_search` | bool | `false` | Also search documentation/best-practices variants |

### Web search (`/search/web`)

| Parameter | Type | Default | Description |
|---|---|---|---|
| `query` | string | required | Search query |
| `max_results` | int | `5` | Results to return |
| `deep` | bool | `false` | Multi-variant deep search |

## Environment Variables

| Variable | Default | Description |
|---|---|---|
| `BLITZKODE_MODEL_PATH` | `blitzkode.gguf` | GGUF model path |
| `BLITZKODE_HOST` | `0.0.0.0` | Server bind address |
| `BLITZKODE_PORT` | `7860` | Server port |
| `BLITZKODE_GPU_LAYERS` | `0` | GPU layers for llama.cpp; use `-1` to offload all supported layers |
| `BLITZKODE_N_CTX` | `2048` | Context window |
| `BLITZKODE_THREADS` | auto | CPU decode threads |
| `BLITZKODE_THREADS_BATCH` | auto | CPU prompt-processing threads |
| `BLITZKODE_BATCH` | `256` | Prompt-processing batch size |
| `BLITZKODE_UBATCH` | `128` | llama.cpp micro-batch size |
| `BLITZKODE_PROMPT_CACHE` | `true` | Enable llama.cpp in-memory prompt cache when supported |
| `BLITZKODE_PROMPT_CACHE_BYTES` | `67108864` | Prompt cache capacity in bytes |
| `BLITZKODE_USE_MMAP` | `true` | Memory-map the GGUF for faster startup and lower memory pressure |
| `BLITZKODE_USE_MLOCK` | `false` | Try to lock model pages in RAM |
| `BLITZKODE_OFFLOAD_KQV` | `true` | Offload K/Q/V operations when GPU layers are enabled |
| `BLITZKODE_MAX_PROMPT_LENGTH` | `4000` | Max prompt chars |
| `BLITZKODE_PRELOAD_MODEL` | `false` | Load model at startup |
| `BLITZKODE_CORS_ORIGINS` | `http://localhost:7860` | CORS origins |
| `BLITZKODE_API_KEY` | empty | Optional bearer token |
| `BLITZKODE_WEB_SEARCH` | `true` | Enable web search endpoints |
| `BLITZKODE_SEARCH_TIMEOUT` | `8` | Search HTTP timeout in seconds |
| `BLITZKODE_MAX_SEARCH_RESULTS` | `5` | Max search results |
| `BLITZKODE_SEARCH_CACHE_TTL` | `300` | Search result cache TTL in seconds |
| `BLITZKODE_RATE_LIMIT` | `true` | Enable per-IP rate limiting |
| `BLITZKODE_RATE_LIMIT_MAX` | `30` | Requests per IP per minute |
| `BLITZKODE_MAX_REQUEST_BYTES` | `50000` | Request body size limit |

## Model Evaluation

Latest local GGUF evaluation: **2026-05-16** using `python scripts/evaluate_model.py` on CPU (`n_ctx=2048`, `threads=8`, `batch=256`, `gpu_layers=0`). Full machine-readable results are stored in `docs/evaluation_results.json`.

| Eval case | Result | Notes |
|---|---:|---|
| Python factorial with negative-input handling | ✅ Pass | Generated a correct iterative implementation with `ValueError` for negative input. |
| Iterative binary search | ✅ Pass | Generated a valid loop-based search returning index or `-1`. |
| SQL top users by order count | ✅ Pass | Generated `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT 5`. |
| Unknown fictional API uncertainty | ❌ Fail | The raw model hallucinated a plausible signature for `imaginary_blitz_api`; the backend guard still blocks direct unknown-signature prompts on `/generate` and `/generate/stream`. |

Summary: **3 / 4 passed (75%)**. Total generation time was **28.864 s** after a **0.312 s** model load. Evaluation-of-the-evaluation: this is a lightweight heuristic smoke eval, not a comprehensive benchmark; it is useful for regression tracking and quick sanity checks, but code should still be reviewed and tested. Future eval work should add executable unit tests for generated code and larger benchmark suites such as HumanEval/MBPP-style tasks.

## Training Pipeline

BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):

| Stage | Script | Details |
|---|---|---|
| SFT v1 | `train_sft.py` | LoRA r=32 on curated coding examples |
| Reward-SFT | `train_reward_sft.py` | Reward-heuristic continuation |
| DPO | `train_dpo.py` | Chosen/rejected preference pairs |
| SFT v2 | `train_available.py` | LoRA r=16 resource-aware training |
| Export | `export_production.py` | Merge → GGUF Q8_0 via llama.cpp |

### Re-train from scratch

```bash
pip install -r requirements-training.txt
python scripts/build_full_dataset.py
python scripts/train_available.py \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --quantization none \
  --dataset datasets/raw/blitzkode_full_training.json \
  --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
python scripts/export_production.py
```

## Project Structure

```text
BlitzKode/
  server.py                    FastAPI backend
  blitzkode.gguf               Local GGUF model (ignored by git)
  scripts/                     Training, export, evaluation, and utility scripts
  docs/evaluation_results.json Latest local model evaluation output
  tests/test_server.py         Backend endpoint tests
  datasets/MANIFEST.md         Dataset provenance
  docs/                        Architecture and production docs
  Dockerfile                   Python runtime image
  docker-compose.yml           CPU + GPU service definitions
  requirements.txt             Serving dependencies
  requirements-training.txt    Training dependencies
```

## CI

```bash
python -m pytest tests/ -v
python -m ruff check .
python -m mypy server.py --ignore-missing-imports
python scripts/evaluate_model.py
docker build -t blitzkode:ci .
```

## License

MIT. See `LICENSE`. Also comply with [Qwen2.5 upstream license](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) when redistributing model weights.

---

*Created by [Sajad (neuralbroker)](https://github.com/neuralbroker)*