blitzkode / README.md
neuralbroker's picture
Update clean backend-only project docs and eval
d5a79fa verified
metadata
language:
  - en
license: mit
library_name: llama-cpp-python
pipeline_tag: text-generation
tags:
  - code-generation
  - coding-assistant
  - gguf
  - llama.cpp
  - qwen2.5
  - python
  - javascript
  - fine-tuned
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct

BlitzKode

BlitzKode is a local API-first AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs on your machine through llama-cpp-python with no external model API calls.

Tech Stack

Layer Tech
Base model Qwen2.5-1.5B-Instruct
Fine-tuning LoRA (r=16, α=32) via PEFT
Training HuggingFace Transformers + TRL
Inference llama-cpp-python (GGUF Q8_0)
Backend Python 3.11+, FastAPI, uvicorn

Features

  • Local-first inference with the bundled GGUF model
  • FastAPI backend only with /generate, /generate/stream, /generate/research, /search/web, /health, and /info
  • Real-time streaming via Server-Sent Events on /generate/stream
  • Web research mode using DuckDuckGo search context before generation
  • API key auth, request-size limits, and rate limiting for production use
  • Backend/model optimizations: mmap model loading, configurable GPU layer offload, batch/thread tuning, optional prompt cache, search-result TTL caching, and efficient deque-based rate limiting
  • Docker runtime image without Node.js/frontend build steps

Prerequisites

  • Python 3.11+
  • blitzkode.gguf at repo root, or set BLITZKODE_MODEL_PATH
  • 4 GB+ RAM

Quick Start

pip install -r requirements.txt
python server.py
curl http://localhost:7860/health

Docker

# CPU
docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf blitzkode

# GPU (with nvidia-docker)
docker compose --profile gpu up

API Examples

# Standard generation (streaming)
curl -X POST http://localhost:7860/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a Python function to reverse a linked list"}'

# Non-streaming
curl -X POST http://localhost:7860/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Binary search in Python","max_tokens":128}'

# Web search only
curl -X POST http://localhost:7860/search/web \
  -H "Content-Type: application/json" \
  -d '{"query":"FastAPI dependency injection","max_results":3}'

# Research-augmented generation
curl -X POST http://localhost:7860/generate/research \
  -H "Content-Type: application/json" \
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'

# Health / info
curl http://localhost:7860/health
curl http://localhost:7860/info

API Parameters

Generation (/generate, /generate/stream)

Parameter Type Default Description
prompt string required User request
messages array [] Conversation history (max 20)
temperature float 0.5 Sampling randomness 0.0–2.0
max_tokens int 256 Max generated tokens (cap 512)
top_p float 0.95 Nucleus sampling threshold
top_k int 20 Top-k sampling
repeat_penalty float 1.05 Repetition penalty

Research (/generate/research)

Same as generation, plus:

Parameter Type Default Description
search_query string prompt Override query for web search
search_results int 5 Results to inject
deep_search bool false Also search documentation/best-practices variants

Web search (/search/web)

Parameter Type Default Description
query string required Search query
max_results int 5 Results to return
deep bool false Multi-variant deep search

Environment Variables

Variable Default Description
BLITZKODE_MODEL_PATH blitzkode.gguf GGUF model path
BLITZKODE_HOST 0.0.0.0 Server bind address
BLITZKODE_PORT 7860 Server port
BLITZKODE_GPU_LAYERS 0 GPU layers for llama.cpp; use -1 to offload all supported layers
BLITZKODE_N_CTX 2048 Context window
BLITZKODE_THREADS auto CPU decode threads
BLITZKODE_THREADS_BATCH auto CPU prompt-processing threads
BLITZKODE_BATCH 256 Prompt-processing batch size
BLITZKODE_UBATCH 128 llama.cpp micro-batch size
BLITZKODE_PROMPT_CACHE true Enable llama.cpp in-memory prompt cache when supported
BLITZKODE_PROMPT_CACHE_BYTES 67108864 Prompt cache capacity in bytes
BLITZKODE_USE_MMAP true Memory-map the GGUF for faster startup and lower memory pressure
BLITZKODE_USE_MLOCK false Try to lock model pages in RAM
BLITZKODE_OFFLOAD_KQV true Offload K/Q/V operations when GPU layers are enabled
BLITZKODE_MAX_PROMPT_LENGTH 4000 Max prompt chars
BLITZKODE_PRELOAD_MODEL false Load model at startup
BLITZKODE_CORS_ORIGINS http://localhost:7860 CORS origins
BLITZKODE_API_KEY empty Optional bearer token
BLITZKODE_WEB_SEARCH true Enable web search endpoints
BLITZKODE_SEARCH_TIMEOUT 8 Search HTTP timeout in seconds
BLITZKODE_MAX_SEARCH_RESULTS 5 Max search results
BLITZKODE_SEARCH_CACHE_TTL 300 Search result cache TTL in seconds
BLITZKODE_RATE_LIMIT true Enable per-IP rate limiting
BLITZKODE_RATE_LIMIT_MAX 30 Requests per IP per minute
BLITZKODE_MAX_REQUEST_BYTES 50000 Request body size limit

Model Evaluation

Latest local GGUF evaluation: 2026-05-16 using python scripts/evaluate_model.py on CPU (n_ctx=2048, threads=8, batch=256, gpu_layers=0). Full machine-readable results are stored in docs/evaluation_results.json.

Eval case Result Notes
Python factorial with negative-input handling ✅ Pass Generated a correct iterative implementation with ValueError for negative input.
Iterative binary search ✅ Pass Generated a valid loop-based search returning index or -1.
SQL top users by order count ✅ Pass Generated JOIN, GROUP BY, ORDER BY, and LIMIT 5.
Unknown fictional API uncertainty ❌ Fail The raw model hallucinated a plausible signature for imaginary_blitz_api; the backend guard still blocks direct unknown-signature prompts on /generate and /generate/stream.

Summary: 3 / 4 passed (75%). Total generation time was 28.864 s after a 0.312 s model load. Evaluation-of-the-evaluation: this is a lightweight heuristic smoke eval, not a comprehensive benchmark; it is useful for regression tracking and quick sanity checks, but code should still be reviewed and tested. Future eval work should add executable unit tests for generated code and larger benchmark suites such as HumanEval/MBPP-style tasks.

Training Pipeline

BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):

Stage Script Details
SFT v1 train_sft.py LoRA r=32 on curated coding examples
Reward-SFT train_reward_sft.py Reward-heuristic continuation
DPO train_dpo.py Chosen/rejected preference pairs
SFT v2 train_available.py LoRA r=16 resource-aware training
Export export_production.py Merge → GGUF Q8_0 via llama.cpp

Re-train from scratch

pip install -r requirements-training.txt
python scripts/build_full_dataset.py
python scripts/train_available.py \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --quantization none \
  --dataset datasets/raw/blitzkode_full_training.json \
  --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
python scripts/export_production.py

Project Structure

BlitzKode/
  server.py                    FastAPI backend
  blitzkode.gguf               Local GGUF model (ignored by git)
  scripts/                     Training, export, evaluation, and utility scripts
  docs/evaluation_results.json Latest local model evaluation output
  tests/test_server.py         Backend endpoint tests
  datasets/MANIFEST.md         Dataset provenance
  docs/                        Architecture and production docs
  Dockerfile                   Python runtime image
  docker-compose.yml           CPU + GPU service definitions
  requirements.txt             Serving dependencies
  requirements-training.txt    Training dependencies

CI

python -m pytest tests/ -v
python -m ruff check .
python -m mypy server.py --ignore-missing-imports
python scripts/evaluate_model.py
docker build -t blitzkode:ci .

License

MIT. See LICENSE. Also comply with Qwen2.5 upstream license when redistributing model weights.


Created by Sajad (neuralbroker)