blitzkode / docs /PROJECT_OVERVIEW.md
neuralbroker's picture
Update clean backend-only project docs and eval
2a2cc3c verified

BlitzKode Project Overview

What this project is

BlitzKode is a local coding-assistant application built around a fine-tuned Qwen2.5-1.5B GGUF model. The repository contains:

  • server.py: FastAPI backend that exposes API endpoints and performs llama.cpp inference.
  • tests/: backend endpoint tests with a stubbed llama.cpp model.
  • scripts/: dataset, training, checkpoint smoke-test, and GGUF export utilities.
  • blitzkode.gguf: local model artifact, intentionally ignored by git.

The runtime design is intentionally small and local-first: API client -> FastAPI -> llama-cpp-python -> local GGUF model.

Backend context

server.py is the central orchestration layer.

Request lifecycle

  1. create_app() builds the FastAPI app and stores Settings, ModelService, and WebSearchService on app.state.
  2. Middleware applies CORS, request-size limits, and optional per-IP rate limiting.
  3. /generate and /generate/stream validate prompts, enforce optional bearer auth, and serialize model access with an asyncio.Lock.
  4. ModelService lazily loads blitzkode.gguf through llama_cpp.Llama and formats requests with a Qwen-style ChatML prompt.
  5. Streaming generation runs llama.cpp in a worker thread and bridges tokens into SSE messages through a queue.

Inference efficiency controls

The following environment variables control CPU/GPU usage:

  • BLITZKODE_GPU_LAYERS: offload layers to GPU when llama.cpp was built with GPU support.
  • BLITZKODE_THREADS: CPU decode thread count.
  • BLITZKODE_THREADS_BATCH: CPU prompt-processing thread count.
  • BLITZKODE_N_CTX: context window.
  • BLITZKODE_BATCH: llama.cpp prompt-processing batch size.
  • BLITZKODE_UBATCH: llama.cpp micro-batch size.
  • BLITZKODE_PROMPT_CACHE: in-memory prompt cache for repeated prefixes.
  • BLITZKODE_PRELOAD_MODEL: load model at startup instead of first request.

For CPU-only machines, keep BLITZKODE_GPU_LAYERS=0, set BLITZKODE_THREADS close to physical performance cores, and use a quantized GGUF such as Q4_K_M for lower memory pressure. For GPU machines, increase BLITZKODE_GPU_LAYERS gradually until VRAM is stable.

Search and research augmentation

The backend now has optional web search primitives:

  • POST /search/web: DuckDuckGo Instant Answer search.
  • POST /generate/research: searches, injects snippets into the prompt as untrusted context, and generates an answer.

API clients can use /generate/stream for normal token streaming or /generate/research when retrieval-augmented output is needed.

Important safety behavior:

  • Search context is marked as untrusted.
  • The system prompt tells the model not to invent APIs, citations, files, or execution results.
  • Research mode asks the model to cite URLs when it relies on search results.

Training pipeline context

The scripts directory currently represents several generations of training experimentation:

  • build_dataset.py and build_full_dataset.py: local synthetic/curated dataset builders.
  • train_sft.py: baseline LoRA SFT over curated coding examples.
  • train_reward_sft.py: reward-inspired continuation, but not true GRPO.
  • train_dpo.py: preference optimization over handcrafted chosen/rejected samples.
  • train_v2.py, train_continue.py, train_max.py: larger public-dataset/QLoRA experiments.
  • export_gguf.py: merge LoRA adapters and prepare for llama.cpp conversion.
  • test_inference.py: smoke-test adapter checkpoints.

The scripts are intentionally excluded from Ruff in pyproject.toml, so backend CI remains focused and fast. They should be treated as operator tools rather than library code.

Research-informed optimization notes

DeepSeek-R1 emphasizes staged post-training: cold-start reasoning data, RL with GRPO, rejection sampling, further SFT, and preference/RL alignment. DeepSeek-V3 emphasizes efficient architecture and training systems: MoE, MLA, FP8, multi-token prediction, and strong post-training/distillation.

For this repository and a 1.5B Qwen-family model, the practical subset is:

  1. Use high-quality SFT data first; small models are very sensitive to noisy data.
  2. Add preference data with DPO or ORPO to reduce low-quality or hallucinated answers.
  3. If GRPO is available in the installed TRL version, use verifiable coding rewards on unit-tested tasks rather than subjective string heuristics.
  4. Distill reasoning behavior from larger open models only where licensing permits, and store provenance for every dataset.
  5. Use QLoRA / 4-bit loading for efficient GPU training.
  6. Keep inference GGUF quantized for CPU/GPU-local deployment.

Do not indiscriminately scrape “all social/public internet” data. That creates legal, privacy, quality, and contamination issues. Prefer permissively licensed datasets and retain dataset cards/provenance.

Cleanup completed in this pass

  • Removed .mypy_cache/ generated cache.
  • Removed the malformed empty directory USERPROFILE.ollamamodels/.
  • Added tested web search and research-generation endpoints.
  • Removed the React/Vite frontend and made the project API-only.
  • Tightened request models for message roles and history size.
  • Hardened request-size middleware against invalid Content-Length.
  • Added tests for search, disabled search, and research prompt injection.
  • Updated README API/environment documentation.

Current verification commands

  • python -m pytest tests/ -v — 22 backend endpoint tests
  • python -m ruff check .
  • python -m mypy server.py --ignore-missing-imports

Run these after backend changes.

Production quality audit results (v2.1 — Q8_0 GGUF, 1.53 GB)

All tests ran on RTX 4060 Laptop GPU, using CPU inference, load time 0.35 s.

Task Result Notes
Floyd cycle detection ✅ Correct O(n)/O(1) with explanation
Flatten nested list ✅ Correct Recursive, clean
Kadane's max subarray ✅ Correct With complexity analysis
Dijkstra shortest path ✅ Correct Heap-based, full implementation
Fix zero-division bug ✅ Correct Detected and fixed
Refactor to context manager ✅ Correct One-liner
Decorator explanation ✅ Correct 2-sentence, accurate
JavaScript debounce ✅ Correct With explanation
SQL top-5 users ✅ Correct GROUP BY + ORDER BY + LIMIT
BFS traversal ✅ Correct deque-based
TypeScript typed fetch ⚠️ Partial Mixes RxJS/fetch patterns
Refactor global→class ⚠️ Weak Didn't complete the OOP refactor
Non-existent API hallucination ❌ Hallucinated Small model limitation

Root causes and mitigations:

  1. TypeScript/OOP refactor gaps — training data was Python-heavy. Adding more TypeScript and OOP refactor examples to datasets/raw/available_training.jsonl and re-running train_available.py with --max-steps 200 would improve these.

  2. Hallucination of fictional APIs — this is a fundamental 1.5B model limitation. The system prompt already instructs against it (Do not invent APIs…) but small models don't reliably follow system instructions they weren't trained on. Mitigations: add negative examples to DPO pairs, or use the /generate/research endpoint to ground answers in real web results.

  3. n_ctx warningn_ctx_per_seq (2048) < n_ctx_train (32768) is informational only. For most coding tasks 2048 tokens is sufficient. Increase BLITZKODE_N_CTX if you need longer code generation.

Next recommended work

  1. Expand dataset — add TypeScript, OOP refactoring, and "I don't know" (grounding) examples to datasets/raw/available_training.jsonl, then re-run train_available.py --max-steps 300.
  2. DPO for hallucination — add preference pairs where chosen = "I don't know, here's how to check" and rejected = <hallucinated answer>, run train_dpo.py.
  3. API contract tests — add tests for OpenAPI schema stability and streaming client behavior.
  4. Search cache tuning — tune BLITZKODE_SEARCH_CACHE_TTL based on expected research workload.
  5. Context window — increase BLITZKODE_N_CTX=4096 in production for longer codebases, or 8192 with GPU offload.
  6. GRPO reward training — use TRL GRPOTrainer with unit-test-verified coding rewards once a suitable test harness is available.