Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use neuralbroker/blitzkode with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama-cli -hf neuralbroker/blitzkode
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama-cli -hf neuralbroker/blitzkode
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./llama-cli -hf neuralbroker/blitzkode
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./build/bin/llama-cli -hf neuralbroker/blitzkode
Use Docker
docker model run hf.co/neuralbroker/blitzkode
- LM Studio
- Jan
- vLLM
How to use neuralbroker/blitzkode with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "neuralbroker/blitzkode" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "neuralbroker/blitzkode", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/neuralbroker/blitzkode
- Ollama
How to use neuralbroker/blitzkode with Ollama:
ollama run hf.co/neuralbroker/blitzkode
- Unsloth Studio new
How to use neuralbroker/blitzkode with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for neuralbroker/blitzkode to start chatting
- Pi new
How to use neuralbroker/blitzkode with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf neuralbroker/blitzkode
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "neuralbroker/blitzkode" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use neuralbroker/blitzkode with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf neuralbroker/blitzkode
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default neuralbroker/blitzkode
Run Hermes
hermes
- Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
docker model run hf.co/neuralbroker/blitzkode
- Lemonade
How to use neuralbroker/blitzkode with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull neuralbroker/blitzkode
Run and chat with the model
lemonade run user.blitzkode-{{QUANT_TAG}}List all available models
lemonade list
BlitzKode Project Overview
What this project is
BlitzKode is a local coding-assistant application built around a fine-tuned Qwen2.5-1.5B GGUF model. The repository contains:
server.py: FastAPI backend that exposes API endpoints and performs llama.cpp inference.tests/: backend endpoint tests with a stubbed llama.cpp model.scripts/: dataset, training, checkpoint smoke-test, and GGUF export utilities.blitzkode.gguf: local model artifact, intentionally ignored by git.
The runtime design is intentionally small and local-first: API client -> FastAPI -> llama-cpp-python -> local GGUF model.
Backend context
server.py is the central orchestration layer.
Request lifecycle
create_app()builds the FastAPI app and storesSettings,ModelService, andWebSearchServiceonapp.state.- Middleware applies CORS, request-size limits, and optional per-IP rate limiting.
/generateand/generate/streamvalidate prompts, enforce optional bearer auth, and serialize model access with anasyncio.Lock.ModelServicelazily loadsblitzkode.ggufthroughllama_cpp.Llamaand formats requests with a Qwen-style ChatML prompt.- Streaming generation runs llama.cpp in a worker thread and bridges tokens into SSE messages through a queue.
Inference efficiency controls
The following environment variables control CPU/GPU usage:
BLITZKODE_GPU_LAYERS: offload layers to GPU when llama.cpp was built with GPU support.BLITZKODE_THREADS: CPU decode thread count.BLITZKODE_THREADS_BATCH: CPU prompt-processing thread count.BLITZKODE_N_CTX: context window.BLITZKODE_BATCH: llama.cpp prompt-processing batch size.BLITZKODE_UBATCH: llama.cpp micro-batch size.BLITZKODE_PROMPT_CACHE: in-memory prompt cache for repeated prefixes.BLITZKODE_PRELOAD_MODEL: load model at startup instead of first request.
For CPU-only machines, keep BLITZKODE_GPU_LAYERS=0, set BLITZKODE_THREADS close to physical performance cores, and use a quantized GGUF such as Q4_K_M for lower memory pressure. For GPU machines, increase BLITZKODE_GPU_LAYERS gradually until VRAM is stable.
Search and research augmentation
The backend now has optional web search primitives:
POST /search/web: DuckDuckGo Instant Answer search.POST /generate/research: searches, injects snippets into the prompt as untrusted context, and generates an answer.
API clients can use /generate/stream for normal token streaming or /generate/research when retrieval-augmented output is needed.
Important safety behavior:
- Search context is marked as untrusted.
- The system prompt tells the model not to invent APIs, citations, files, or execution results.
- Research mode asks the model to cite URLs when it relies on search results.
Training pipeline context
The scripts directory currently represents several generations of training experimentation:
build_dataset.pyandbuild_full_dataset.py: local synthetic/curated dataset builders.train_sft.py: baseline LoRA SFT over curated coding examples.train_reward_sft.py: reward-inspired continuation, but not true GRPO.train_dpo.py: preference optimization over handcrafted chosen/rejected samples.train_v2.py,train_continue.py,train_max.py: larger public-dataset/QLoRA experiments.export_gguf.py: merge LoRA adapters and prepare for llama.cpp conversion.test_inference.py: smoke-test adapter checkpoints.
The scripts are intentionally excluded from Ruff in pyproject.toml, so backend CI remains focused and fast. They should be treated as operator tools rather than library code.
Research-informed optimization notes
DeepSeek-R1 emphasizes staged post-training: cold-start reasoning data, RL with GRPO, rejection sampling, further SFT, and preference/RL alignment. DeepSeek-V3 emphasizes efficient architecture and training systems: MoE, MLA, FP8, multi-token prediction, and strong post-training/distillation.
For this repository and a 1.5B Qwen-family model, the practical subset is:
- Use high-quality SFT data first; small models are very sensitive to noisy data.
- Add preference data with DPO or ORPO to reduce low-quality or hallucinated answers.
- If GRPO is available in the installed TRL version, use verifiable coding rewards on unit-tested tasks rather than subjective string heuristics.
- Distill reasoning behavior from larger open models only where licensing permits, and store provenance for every dataset.
- Use QLoRA / 4-bit loading for efficient GPU training.
- Keep inference GGUF quantized for CPU/GPU-local deployment.
Do not indiscriminately scrape “all social/public internet” data. That creates legal, privacy, quality, and contamination issues. Prefer permissively licensed datasets and retain dataset cards/provenance.
Cleanup completed in this pass
- Removed
.mypy_cache/generated cache. - Removed the malformed empty directory
USERPROFILE.ollamamodels/. - Added tested web search and research-generation endpoints.
- Removed the React/Vite frontend and made the project API-only.
- Tightened request models for message roles and history size.
- Hardened request-size middleware against invalid
Content-Length. - Added tests for search, disabled search, and research prompt injection.
- Updated README API/environment documentation.
Current verification commands
python -m pytest tests/ -v— 22 backend endpoint testspython -m ruff check .python -m mypy server.py --ignore-missing-imports
Run these after backend changes.
Production quality audit results (v2.1 — Q8_0 GGUF, 1.53 GB)
All tests ran on RTX 4060 Laptop GPU, using CPU inference, load time 0.35 s.
| Task | Result | Notes |
|---|---|---|
| Floyd cycle detection | ✅ Correct | O(n)/O(1) with explanation |
| Flatten nested list | ✅ Correct | Recursive, clean |
| Kadane's max subarray | ✅ Correct | With complexity analysis |
| Dijkstra shortest path | ✅ Correct | Heap-based, full implementation |
| Fix zero-division bug | ✅ Correct | Detected and fixed |
| Refactor to context manager | ✅ Correct | One-liner |
| Decorator explanation | ✅ Correct | 2-sentence, accurate |
| JavaScript debounce | ✅ Correct | With explanation |
| SQL top-5 users | ✅ Correct | GROUP BY + ORDER BY + LIMIT |
| BFS traversal | ✅ Correct | deque-based |
| TypeScript typed fetch | ⚠️ Partial | Mixes RxJS/fetch patterns |
| Refactor global→class | ⚠️ Weak | Didn't complete the OOP refactor |
| Non-existent API hallucination | ❌ Hallucinated | Small model limitation |
Root causes and mitigations:
TypeScript/OOP refactor gaps — training data was Python-heavy. Adding more TypeScript and OOP refactor examples to
datasets/raw/available_training.jsonland re-runningtrain_available.pywith--max-steps 200would improve these.Hallucination of fictional APIs — this is a fundamental 1.5B model limitation. The system prompt already instructs against it (
Do not invent APIs…) but small models don't reliably follow system instructions they weren't trained on. Mitigations: add negative examples to DPO pairs, or use the/generate/researchendpoint to ground answers in real web results.n_ctx warning —
n_ctx_per_seq (2048) < n_ctx_train (32768)is informational only. For most coding tasks 2048 tokens is sufficient. IncreaseBLITZKODE_N_CTXif you need longer code generation.
Next recommended work
- Expand dataset — add TypeScript, OOP refactoring, and "I don't know" (grounding) examples to
datasets/raw/available_training.jsonl, then re-runtrain_available.py --max-steps 300. - DPO for hallucination — add preference pairs where
chosen = "I don't know, here's how to check"andrejected = <hallucinated answer>, runtrain_dpo.py. - API contract tests — add tests for OpenAPI schema stability and streaming client behavior.
- Search cache tuning — tune
BLITZKODE_SEARCH_CACHE_TTLbased on expected research workload. - Context window — increase
BLITZKODE_N_CTX=4096in production for longer codebases, or8192with GPU offload. - GRPO reward training — use TRL
GRPOTrainerwith unit-test-verified coding rewards once a suitable test harness is available.