File size: 8,454 Bytes
2a2cc3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# BlitzKode Project Overview

## What this project is

BlitzKode is a local coding-assistant application built around a fine-tuned Qwen2.5-1.5B GGUF model. The repository contains:

- `server.py`: FastAPI backend that exposes API endpoints and performs llama.cpp inference.
- `tests/`: backend endpoint tests with a stubbed llama.cpp model.
- `scripts/`: dataset, training, checkpoint smoke-test, and GGUF export utilities.
- `blitzkode.gguf`: local model artifact, intentionally ignored by git.

The runtime design is intentionally small and local-first: API client -> FastAPI -> `llama-cpp-python` -> local GGUF model.

## Backend context

`server.py` is the central orchestration layer.

### Request lifecycle

1. `create_app()` builds the FastAPI app and stores `Settings`, `ModelService`, and `WebSearchService` on `app.state`.
2. Middleware applies CORS, request-size limits, and optional per-IP rate limiting.
3. `/generate` and `/generate/stream` validate prompts, enforce optional bearer auth, and serialize model access with an `asyncio.Lock`.
4. `ModelService` lazily loads `blitzkode.gguf` through `llama_cpp.Llama` and formats requests with a Qwen-style ChatML prompt.
5. Streaming generation runs llama.cpp in a worker thread and bridges tokens into SSE messages through a queue.

### Inference efficiency controls

The following environment variables control CPU/GPU usage:

- `BLITZKODE_GPU_LAYERS`: offload layers to GPU when llama.cpp was built with GPU support.
- `BLITZKODE_THREADS`: CPU decode thread count.
- `BLITZKODE_THREADS_BATCH`: CPU prompt-processing thread count.
- `BLITZKODE_N_CTX`: context window.
- `BLITZKODE_BATCH`: llama.cpp prompt-processing batch size.
- `BLITZKODE_UBATCH`: llama.cpp micro-batch size.
- `BLITZKODE_PROMPT_CACHE`: in-memory prompt cache for repeated prefixes.
- `BLITZKODE_PRELOAD_MODEL`: load model at startup instead of first request.

For CPU-only machines, keep `BLITZKODE_GPU_LAYERS=0`, set `BLITZKODE_THREADS` close to physical performance cores, and use a quantized GGUF such as Q4_K_M for lower memory pressure. For GPU machines, increase `BLITZKODE_GPU_LAYERS` gradually until VRAM is stable.

## Search and research augmentation

The backend now has optional web search primitives:

- `POST /search/web`: DuckDuckGo Instant Answer search.
- `POST /generate/research`: searches, injects snippets into the prompt as untrusted context, and generates an answer.

API clients can use `/generate/stream` for normal token streaming or `/generate/research` when retrieval-augmented output is needed.

Important safety behavior:

- Search context is marked as untrusted.
- The system prompt tells the model not to invent APIs, citations, files, or execution results.
- Research mode asks the model to cite URLs when it relies on search results.



## Training pipeline context

The scripts directory currently represents several generations of training experimentation:

- `build_dataset.py` and `build_full_dataset.py`: local synthetic/curated dataset builders.
- `train_sft.py`: baseline LoRA SFT over curated coding examples.
- `train_reward_sft.py`: reward-inspired continuation, but not true GRPO.
- `train_dpo.py`: preference optimization over handcrafted chosen/rejected samples.
- `train_v2.py`, `train_continue.py`, `train_max.py`: larger public-dataset/QLoRA experiments.
- `export_gguf.py`: merge LoRA adapters and prepare for llama.cpp conversion.
- `test_inference.py`: smoke-test adapter checkpoints.

The scripts are intentionally excluded from Ruff in `pyproject.toml`, so backend CI remains focused and fast. They should be treated as operator tools rather than library code.

## Research-informed optimization notes

DeepSeek-R1 emphasizes staged post-training: cold-start reasoning data, RL with GRPO, rejection sampling, further SFT, and preference/RL alignment. DeepSeek-V3 emphasizes efficient architecture and training systems: MoE, MLA, FP8, multi-token prediction, and strong post-training/distillation.

For this repository and a 1.5B Qwen-family model, the practical subset is:

1. Use high-quality SFT data first; small models are very sensitive to noisy data.
2. Add preference data with DPO or ORPO to reduce low-quality or hallucinated answers.
3. If GRPO is available in the installed TRL version, use verifiable coding rewards on unit-tested tasks rather than subjective string heuristics.
4. Distill reasoning behavior from larger open models only where licensing permits, and store provenance for every dataset.
5. Use QLoRA / 4-bit loading for efficient GPU training.
6. Keep inference GGUF quantized for CPU/GPU-local deployment.

Do not indiscriminately scrape “all social/public internet” data. That creates legal, privacy, quality, and contamination issues. Prefer permissively licensed datasets and retain dataset cards/provenance.

## Cleanup completed in this pass

- Removed `.mypy_cache/` generated cache.
- Removed the malformed empty directory `USERPROFILE.ollamamodels/`.
- Added tested web search and research-generation endpoints.
- Removed the React/Vite frontend and made the project API-only.
- Tightened request models for message roles and history size.
- Hardened request-size middleware against invalid `Content-Length`.
- Added tests for search, disabled search, and research prompt injection.
- Updated README API/environment documentation.

## Current verification commands

- `python -m pytest tests/ -v` — 22 backend endpoint tests
- `python -m ruff check .`
- `python -m mypy server.py --ignore-missing-imports`

Run these after backend changes.

## Production quality audit results (v2.1 — Q8_0 GGUF, 1.53 GB)



All tests ran on RTX 4060 Laptop GPU, using CPU inference, load time 0.35 s.



| Task | Result | Notes |

|---|---|---|

| Floyd cycle detection | ✅ Correct | O(n)/O(1) with explanation |

| Flatten nested list | ✅ Correct | Recursive, clean |

| Kadane's max subarray | ✅ Correct | With complexity analysis |

| Dijkstra shortest path | ✅ Correct | Heap-based, full implementation |

| Fix zero-division bug | ✅ Correct | Detected and fixed |

| Refactor to context manager | ✅ Correct | One-liner |

| Decorator explanation | ✅ Correct | 2-sentence, accurate |

| JavaScript debounce | ✅ Correct | With explanation |

| SQL top-5 users | ✅ Correct | GROUP BY + ORDER BY + LIMIT |

| BFS traversal | ✅ Correct | deque-based |

| TypeScript typed fetch | ⚠️ Partial | Mixes RxJS/fetch patterns |

| Refactor global→class | ⚠️ Weak | Didn't complete the OOP refactor |

| Non-existent API hallucination | ❌ Hallucinated | Small model limitation |



**Root causes and mitigations:**



1. **TypeScript/OOP refactor gaps** — training data was Python-heavy. Adding more TypeScript and OOP refactor examples to `datasets/raw/available_training.jsonl` and re-running `train_available.py` with `--max-steps 200` would improve these.



2. **Hallucination of fictional APIs** — this is a fundamental 1.5B model limitation. The system prompt already instructs against it (`Do not invent APIs…`) but small models don't reliably follow system instructions they weren't trained on. Mitigations: add negative examples to DPO pairs, or use the `/generate/research` endpoint to ground answers in real web results.



3. **n_ctx warning** — `n_ctx_per_seq (2048) < n_ctx_train (32768)` is informational only. For most coding tasks 2048 tokens is sufficient. Increase `BLITZKODE_N_CTX` if you need longer code generation.

## Next recommended work

1. **Expand dataset** — add TypeScript, OOP refactoring, and "I don't know" (grounding) examples to `datasets/raw/available_training.jsonl`, then re-run `train_available.py --max-steps 300`.
2. **DPO for hallucination** — add preference pairs where `chosen = "I don't know, here's how to check"` and `rejected = <hallucinated answer>`, run `train_dpo.py`.
3. **API contract tests** — add tests for OpenAPI schema stability and streaming client behavior.
4. **Search cache tuning** — tune `BLITZKODE_SEARCH_CACHE_TTL` based on expected research workload.
5. **Context window** — increase `BLITZKODE_N_CTX=4096` in production for longer codebases, or `8192` with GPU offload.
6. **GRPO reward training** — use TRL `GRPOTrainer` with unit-test-verified coding rewards once a suitable test harness is available.