neuralbroker commited on
Commit
2a2cc3c
·
verified ·
1 Parent(s): 6561a10

Update clean backend-only project docs and eval

Browse files
Files changed (1) hide show
  1. docs/PROJECT_OVERVIEW.md +141 -0
docs/PROJECT_OVERVIEW.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BlitzKode Project Overview
2
+
3
+ ## What this project is
4
+
5
+ BlitzKode is a local coding-assistant application built around a fine-tuned Qwen2.5-1.5B GGUF model. The repository contains:
6
+
7
+ - `server.py`: FastAPI backend that exposes API endpoints and performs llama.cpp inference.
8
+ - `tests/`: backend endpoint tests with a stubbed llama.cpp model.
9
+ - `scripts/`: dataset, training, checkpoint smoke-test, and GGUF export utilities.
10
+ - `blitzkode.gguf`: local model artifact, intentionally ignored by git.
11
+
12
+ The runtime design is intentionally small and local-first: API client -> FastAPI -> `llama-cpp-python` -> local GGUF model.
13
+
14
+ ## Backend context
15
+
16
+ `server.py` is the central orchestration layer.
17
+
18
+ ### Request lifecycle
19
+
20
+ 1. `create_app()` builds the FastAPI app and stores `Settings`, `ModelService`, and `WebSearchService` on `app.state`.
21
+ 2. Middleware applies CORS, request-size limits, and optional per-IP rate limiting.
22
+ 3. `/generate` and `/generate/stream` validate prompts, enforce optional bearer auth, and serialize model access with an `asyncio.Lock`.
23
+ 4. `ModelService` lazily loads `blitzkode.gguf` through `llama_cpp.Llama` and formats requests with a Qwen-style ChatML prompt.
24
+ 5. Streaming generation runs llama.cpp in a worker thread and bridges tokens into SSE messages through a queue.
25
+
26
+ ### Inference efficiency controls
27
+
28
+ The following environment variables control CPU/GPU usage:
29
+
30
+ - `BLITZKODE_GPU_LAYERS`: offload layers to GPU when llama.cpp was built with GPU support.
31
+ - `BLITZKODE_THREADS`: CPU decode thread count.
32
+ - `BLITZKODE_THREADS_BATCH`: CPU prompt-processing thread count.
33
+ - `BLITZKODE_N_CTX`: context window.
34
+ - `BLITZKODE_BATCH`: llama.cpp prompt-processing batch size.
35
+ - `BLITZKODE_UBATCH`: llama.cpp micro-batch size.
36
+ - `BLITZKODE_PROMPT_CACHE`: in-memory prompt cache for repeated prefixes.
37
+ - `BLITZKODE_PRELOAD_MODEL`: load model at startup instead of first request.
38
+
39
+ For CPU-only machines, keep `BLITZKODE_GPU_LAYERS=0`, set `BLITZKODE_THREADS` close to physical performance cores, and use a quantized GGUF such as Q4_K_M for lower memory pressure. For GPU machines, increase `BLITZKODE_GPU_LAYERS` gradually until VRAM is stable.
40
+
41
+ ## Search and research augmentation
42
+
43
+ The backend now has optional web search primitives:
44
+
45
+ - `POST /search/web`: DuckDuckGo Instant Answer search.
46
+ - `POST /generate/research`: searches, injects snippets into the prompt as untrusted context, and generates an answer.
47
+
48
+ API clients can use `/generate/stream` for normal token streaming or `/generate/research` when retrieval-augmented output is needed.
49
+
50
+ Important safety behavior:
51
+
52
+ - Search context is marked as untrusted.
53
+ - The system prompt tells the model not to invent APIs, citations, files, or execution results.
54
+ - Research mode asks the model to cite URLs when it relies on search results.
55
+
56
+
57
+
58
+ ## Training pipeline context
59
+
60
+ The scripts directory currently represents several generations of training experimentation:
61
+
62
+ - `build_dataset.py` and `build_full_dataset.py`: local synthetic/curated dataset builders.
63
+ - `train_sft.py`: baseline LoRA SFT over curated coding examples.
64
+ - `train_reward_sft.py`: reward-inspired continuation, but not true GRPO.
65
+ - `train_dpo.py`: preference optimization over handcrafted chosen/rejected samples.
66
+ - `train_v2.py`, `train_continue.py`, `train_max.py`: larger public-dataset/QLoRA experiments.
67
+ - `export_gguf.py`: merge LoRA adapters and prepare for llama.cpp conversion.
68
+ - `test_inference.py`: smoke-test adapter checkpoints.
69
+
70
+ The scripts are intentionally excluded from Ruff in `pyproject.toml`, so backend CI remains focused and fast. They should be treated as operator tools rather than library code.
71
+
72
+ ## Research-informed optimization notes
73
+
74
+ DeepSeek-R1 emphasizes staged post-training: cold-start reasoning data, RL with GRPO, rejection sampling, further SFT, and preference/RL alignment. DeepSeek-V3 emphasizes efficient architecture and training systems: MoE, MLA, FP8, multi-token prediction, and strong post-training/distillation.
75
+
76
+ For this repository and a 1.5B Qwen-family model, the practical subset is:
77
+
78
+ 1. Use high-quality SFT data first; small models are very sensitive to noisy data.
79
+ 2. Add preference data with DPO or ORPO to reduce low-quality or hallucinated answers.
80
+ 3. If GRPO is available in the installed TRL version, use verifiable coding rewards on unit-tested tasks rather than subjective string heuristics.
81
+ 4. Distill reasoning behavior from larger open models only where licensing permits, and store provenance for every dataset.
82
+ 5. Use QLoRA / 4-bit loading for efficient GPU training.
83
+ 6. Keep inference GGUF quantized for CPU/GPU-local deployment.
84
+
85
+ Do not indiscriminately scrape “all social/public internet” data. That creates legal, privacy, quality, and contamination issues. Prefer permissively licensed datasets and retain dataset cards/provenance.
86
+
87
+ ## Cleanup completed in this pass
88
+
89
+ - Removed `.mypy_cache/` generated cache.
90
+ - Removed the malformed empty directory `USERPROFILE.ollamamodels/`.
91
+ - Added tested web search and research-generation endpoints.
92
+ - Removed the React/Vite frontend and made the project API-only.
93
+ - Tightened request models for message roles and history size.
94
+ - Hardened request-size middleware against invalid `Content-Length`.
95
+ - Added tests for search, disabled search, and research prompt injection.
96
+ - Updated README API/environment documentation.
97
+
98
+ ## Current verification commands
99
+
100
+ - `python -m pytest tests/ -v` — 22 backend endpoint tests
101
+ - `python -m ruff check .`
102
+ - `python -m mypy server.py --ignore-missing-imports`
103
+
104
+ Run these after backend changes.
105
+
106
+ ## Production quality audit results (v2.1 — Q8_0 GGUF, 1.53 GB)
107
+
108
+ All tests ran on RTX 4060 Laptop GPU, using CPU inference, load time 0.35 s.
109
+
110
+ | Task | Result | Notes |
111
+ |---|---|---|
112
+ | Floyd cycle detection | ✅ Correct | O(n)/O(1) with explanation |
113
+ | Flatten nested list | ✅ Correct | Recursive, clean |
114
+ | Kadane's max subarray | ✅ Correct | With complexity analysis |
115
+ | Dijkstra shortest path | ✅ Correct | Heap-based, full implementation |
116
+ | Fix zero-division bug | ✅ Correct | Detected and fixed |
117
+ | Refactor to context manager | ✅ Correct | One-liner |
118
+ | Decorator explanation | ✅ Correct | 2-sentence, accurate |
119
+ | JavaScript debounce | ✅ Correct | With explanation |
120
+ | SQL top-5 users | ✅ Correct | GROUP BY + ORDER BY + LIMIT |
121
+ | BFS traversal | ✅ Correct | deque-based |
122
+ | TypeScript typed fetch | ⚠️ Partial | Mixes RxJS/fetch patterns |
123
+ | Refactor global→class | ⚠️ Weak | Didn't complete the OOP refactor |
124
+ | Non-existent API hallucination | ❌ Hallucinated | Small model limitation |
125
+
126
+ **Root causes and mitigations:**
127
+
128
+ 1. **TypeScript/OOP refactor gaps** — training data was Python-heavy. Adding more TypeScript and OOP refactor examples to `datasets/raw/available_training.jsonl` and re-running `train_available.py` with `--max-steps 200` would improve these.
129
+
130
+ 2. **Hallucination of fictional APIs** — this is a fundamental 1.5B model limitation. The system prompt already instructs against it (`Do not invent APIs…`) but small models don't reliably follow system instructions they weren't trained on. Mitigations: add negative examples to DPO pairs, or use the `/generate/research` endpoint to ground answers in real web results.
131
+
132
+ 3. **n_ctx warning** — `n_ctx_per_seq (2048) < n_ctx_train (32768)` is informational only. For most coding tasks 2048 tokens is sufficient. Increase `BLITZKODE_N_CTX` if you need longer code generation.
133
+
134
+ ## Next recommended work
135
+
136
+ 1. **Expand dataset** — add TypeScript, OOP refactoring, and "I don't know" (grounding) examples to `datasets/raw/available_training.jsonl`, then re-run `train_available.py --max-steps 300`.
137
+ 2. **DPO for hallucination** — add preference pairs where `chosen = "I don't know, here's how to check"` and `rejected = <hallucinated answer>`, run `train_dpo.py`.
138
+ 3. **API contract tests** — add tests for OpenAPI schema stability and streaming client behavior.
139
+ 4. **Search cache tuning** — tune `BLITZKODE_SEARCH_CACHE_TTL` based on expected research workload.
140
+ 5. **Context window** — increase `BLITZKODE_N_CTX=4096` in production for longer codebases, or `8192` with GPU offload.
141
+ 6. **GRPO reward training** — use TRL `GRPOTrainer` with unit-test-verified coding rewards once a suitable test harness is available.