Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use neuralbroker/blitzkode with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama-cli -hf neuralbroker/blitzkode
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama-cli -hf neuralbroker/blitzkode
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./llama-cli -hf neuralbroker/blitzkode
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./build/bin/llama-cli -hf neuralbroker/blitzkode
Use Docker
docker model run hf.co/neuralbroker/blitzkode
- LM Studio
- Jan
- vLLM
How to use neuralbroker/blitzkode with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "neuralbroker/blitzkode" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "neuralbroker/blitzkode", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/neuralbroker/blitzkode
- Ollama
How to use neuralbroker/blitzkode with Ollama:
ollama run hf.co/neuralbroker/blitzkode
- Unsloth Studio new
How to use neuralbroker/blitzkode with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for neuralbroker/blitzkode to start chatting
- Pi new
How to use neuralbroker/blitzkode with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf neuralbroker/blitzkode
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "neuralbroker/blitzkode" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use neuralbroker/blitzkode with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf neuralbroker/blitzkode
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default neuralbroker/blitzkode
Run Hermes
hermes
- Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
docker model run hf.co/neuralbroker/blitzkode
- Lemonade
How to use neuralbroker/blitzkode with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull neuralbroker/blitzkode
Run and chat with the model
lemonade run user.blitzkode-{{QUANT_TAG}}List all available models
lemonade list
language:
- en
license: mit
library_name: llama-cpp-python
pipeline_tag: text-generation
tags:
- code-generation
- coding-assistant
- gguf
- llama.cpp
- qwen2.5
- python
- javascript
- fine-tuned
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
BlitzKode
BlitzKode is a local API-first AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs on your machine through llama-cpp-python with no external model API calls.
Tech Stack
| Layer | Tech |
|---|---|
| Base model | Qwen2.5-1.5B-Instruct |
| Fine-tuning | LoRA (r=16, α=32) via PEFT |
| Training | HuggingFace Transformers + TRL |
| Inference | llama-cpp-python (GGUF Q8_0) |
| Backend | Python 3.11+, FastAPI, uvicorn |
Features
- Local-first inference with the bundled GGUF model
- FastAPI backend only with
/generate,/generate/stream,/generate/research,/search/web,/health, and/info - Real-time streaming via Server-Sent Events on
/generate/stream - Web research mode using DuckDuckGo search context before generation
- API key auth, request-size limits, and rate limiting for production use
- Backend/model optimizations: mmap model loading, configurable GPU layer offload, batch/thread tuning, optional prompt cache, search-result TTL caching, and efficient deque-based rate limiting
- Docker runtime image without Node.js/frontend build steps
Prerequisites
- Python 3.11+
blitzkode.ggufat repo root, or setBLITZKODE_MODEL_PATH- 4 GB+ RAM
Quick Start
pip install -r requirements.txt
python server.py
curl http://localhost:7860/health
Docker
# CPU
docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf blitzkode
# GPU (with nvidia-docker)
docker compose --profile gpu up
API Examples
# Standard generation (streaming)
curl -X POST http://localhost:7860/generate/stream \
-H "Content-Type: application/json" \
-d '{"prompt":"Write a Python function to reverse a linked list"}'
# Non-streaming
curl -X POST http://localhost:7860/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Binary search in Python","max_tokens":128}'
# Web search only
curl -X POST http://localhost:7860/search/web \
-H "Content-Type: application/json" \
-d '{"query":"FastAPI dependency injection","max_results":3}'
# Research-augmented generation
curl -X POST http://localhost:7860/generate/research \
-H "Content-Type: application/json" \
-d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'
# Health / info
curl http://localhost:7860/health
curl http://localhost:7860/info
API Parameters
Generation (/generate, /generate/stream)
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | User request |
messages |
array | [] |
Conversation history (max 20) |
temperature |
float | 0.5 |
Sampling randomness 0.0–2.0 |
max_tokens |
int | 256 |
Max generated tokens (cap 512) |
top_p |
float | 0.95 |
Nucleus sampling threshold |
top_k |
int | 20 |
Top-k sampling |
repeat_penalty |
float | 1.05 |
Repetition penalty |
Research (/generate/research)
Same as generation, plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
search_query |
string | prompt | Override query for web search |
search_results |
int | 5 |
Results to inject |
deep_search |
bool | false |
Also search documentation/best-practices variants |
Web search (/search/web)
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
string | required | Search query |
max_results |
int | 5 |
Results to return |
deep |
bool | false |
Multi-variant deep search |
Environment Variables
| Variable | Default | Description |
|---|---|---|
BLITZKODE_MODEL_PATH |
blitzkode.gguf |
GGUF model path |
BLITZKODE_HOST |
0.0.0.0 |
Server bind address |
BLITZKODE_PORT |
7860 |
Server port |
BLITZKODE_GPU_LAYERS |
0 |
GPU layers for llama.cpp; use -1 to offload all supported layers |
BLITZKODE_N_CTX |
2048 |
Context window |
BLITZKODE_THREADS |
auto | CPU decode threads |
BLITZKODE_THREADS_BATCH |
auto | CPU prompt-processing threads |
BLITZKODE_BATCH |
256 |
Prompt-processing batch size |
BLITZKODE_UBATCH |
128 |
llama.cpp micro-batch size |
BLITZKODE_PROMPT_CACHE |
true |
Enable llama.cpp in-memory prompt cache when supported |
BLITZKODE_PROMPT_CACHE_BYTES |
67108864 |
Prompt cache capacity in bytes |
BLITZKODE_USE_MMAP |
true |
Memory-map the GGUF for faster startup and lower memory pressure |
BLITZKODE_USE_MLOCK |
false |
Try to lock model pages in RAM |
BLITZKODE_OFFLOAD_KQV |
true |
Offload K/Q/V operations when GPU layers are enabled |
BLITZKODE_MAX_PROMPT_LENGTH |
4000 |
Max prompt chars |
BLITZKODE_PRELOAD_MODEL |
false |
Load model at startup |
BLITZKODE_CORS_ORIGINS |
http://localhost:7860 |
CORS origins |
BLITZKODE_API_KEY |
empty | Optional bearer token |
BLITZKODE_WEB_SEARCH |
true |
Enable web search endpoints |
BLITZKODE_SEARCH_TIMEOUT |
8 |
Search HTTP timeout in seconds |
BLITZKODE_MAX_SEARCH_RESULTS |
5 |
Max search results |
BLITZKODE_SEARCH_CACHE_TTL |
300 |
Search result cache TTL in seconds |
BLITZKODE_RATE_LIMIT |
true |
Enable per-IP rate limiting |
BLITZKODE_RATE_LIMIT_MAX |
30 |
Requests per IP per minute |
BLITZKODE_MAX_REQUEST_BYTES |
50000 |
Request body size limit |
Model Evaluation
Latest local GGUF evaluation: 2026-05-16 using python scripts/evaluate_model.py on CPU (n_ctx=2048, threads=8, batch=256, gpu_layers=0). Full machine-readable results are stored in docs/evaluation_results.json.
| Eval case | Result | Notes |
|---|---|---|
| Python factorial with negative-input handling | ✅ Pass | Generated a correct iterative implementation with ValueError for negative input. |
| Iterative binary search | ✅ Pass | Generated a valid loop-based search returning index or -1. |
| SQL top users by order count | ✅ Pass | Generated JOIN, GROUP BY, ORDER BY, and LIMIT 5. |
| Unknown fictional API uncertainty | ❌ Fail | The raw model hallucinated a plausible signature for imaginary_blitz_api; the backend guard still blocks direct unknown-signature prompts on /generate and /generate/stream. |
Summary: 3 / 4 passed (75%). Total generation time was 28.864 s after a 0.312 s model load. Evaluation-of-the-evaluation: this is a lightweight heuristic smoke eval, not a comprehensive benchmark; it is useful for regression tracking and quick sanity checks, but code should still be reviewed and tested. Future eval work should add executable unit tests for generated code and larger benchmark suites such as HumanEval/MBPP-style tasks.
Training Pipeline
BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):
| Stage | Script | Details |
|---|---|---|
| SFT v1 | train_sft.py |
LoRA r=32 on curated coding examples |
| Reward-SFT | train_reward_sft.py |
Reward-heuristic continuation |
| DPO | train_dpo.py |
Chosen/rejected preference pairs |
| SFT v2 | train_available.py |
LoRA r=16 resource-aware training |
| Export | export_production.py |
Merge → GGUF Q8_0 via llama.cpp |
Re-train from scratch
pip install -r requirements-training.txt
python scripts/build_full_dataset.py
python scripts/train_available.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--quantization none \
--dataset datasets/raw/blitzkode_full_training.json \
--max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
python scripts/export_production.py
Project Structure
BlitzKode/
server.py FastAPI backend
blitzkode.gguf Local GGUF model (ignored by git)
scripts/ Training, export, evaluation, and utility scripts
docs/evaluation_results.json Latest local model evaluation output
tests/test_server.py Backend endpoint tests
datasets/MANIFEST.md Dataset provenance
docs/ Architecture and production docs
Dockerfile Python runtime image
docker-compose.yml CPU + GPU service definitions
requirements.txt Serving dependencies
requirements-training.txt Training dependencies
CI
python -m pytest tests/ -v
python -m ruff check .
python -m mypy server.py --ignore-missing-imports
python scripts/evaluate_model.py
docker build -t blitzkode:ci .
License
MIT. See LICENSE. Also comply with Qwen2.5 upstream license when redistributing model weights.
Created by Sajad (neuralbroker)