Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

llama-cpp-python

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use neuralbroker/blitzkode with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./llama-cli -hf neuralbroker/blitzkode

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./build/bin/llama-cli -hf neuralbroker/blitzkode

Use Docker

docker model run hf.co/neuralbroker/blitzkode

LM Studio
Jan

vLLM

How to use neuralbroker/blitzkode with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "neuralbroker/blitzkode"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "neuralbroker/blitzkode",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/neuralbroker/blitzkode

Ollama
How to use neuralbroker/blitzkode with Ollama:
```
ollama run hf.co/neuralbroker/blitzkode
```

Unsloth Studio new

How to use neuralbroker/blitzkode with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for neuralbroker/blitzkode to start chatting

Pi new

How to use neuralbroker/blitzkode with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "neuralbroker/blitzkode"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use neuralbroker/blitzkode with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default neuralbroker/blitzkode

Run Hermes

hermes

Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
```
docker model run hf.co/neuralbroker/blitzkode
```

Lemonade

How to use neuralbroker/blitzkode with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull neuralbroker/blitzkode

Run and chat with the model

lemonade run user.blitzkode-{{QUANT_TAG}}

List all available models

lemonade list

blitzkode / MODEL_CARD.md

neuralbroker

Update clean backend-only project docs and eval

25fe3e8 verified 10 days ago

preview code

raw

history blame contribute delete

11 kB

	---
	language:
	- en
	license: mit
	library_name: llama-cpp-python
	pipeline_tag: text-generation
	tags:
	- code-generation
	- coding-assistant
	- gguf
	- llama.cpp
	- qwen2.5
	- python
	- javascript
	- fine-tuned
	- lora
	- peft
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	- Qwen/Qwen2.5-0.5B-Instruct
	---

	# BlitzKode

	BlitzKode is a local AI coding assistant fine-tuned from the Qwen2.5 family. It
	ships as a GGUF model (1.5B, Q8_0, ~1.53 GB) for fast offline inference with
	llama.cpp, and as LoRA adapters for PEFT-based research and further
	fine-tuning.

	> Creator: [Sajad (neuralbroker)](https://github.com/neuralbroker)
	> GitHub: <https://github.com/neuralbroker/blitzkode>
	> GGUF model: [`neuralbroker/blitzkode`](https://huggingface.co/neuralbroker/blitzkode)
	> LoRA adapter: [`neuralbroker/blitzkode-lora-0.5b`](https://huggingface.co/neuralbroker/blitzkode-lora-0.5b)

	---

	## Model Variants

	\| Variant \| Version \| Base Model \| Format \| Size \| Runtime \|
	\|---\|---\|---\|---\|---\|---\|
	\| GGUF (production) \| 2.0 \| `Qwen/Qwen2.5-1.5B-Instruct` \| GGUF Q8_0 \| ~1.53 GB \| llama.cpp / llama-cpp-python \|
	\| LoRA adapter (research) \| 2.1 \| `Qwen/Qwen2.5-0.5B-Instruct` \| PEFT safetensors \| ~100 MB \| PEFT + Transformers \|

	---

	## Architecture

	\| Property \| GGUF (1.5B) \| LoRA Adapter (0.5B) \|
	\|---\|---\|---\|
	\| Model type \| Transformer (Qwen2) \| Transformer (Qwen2) + LoRA \|
	\| Parameters \| 1.5 B \| 0.5 B + adapter weights \|
	\| Quantization \| GGUF Q8_0 \| bfloat16 / float16 \|
	\| LoRA rank (r) \| — \| 16 \|
	\| LoRA alpha \| — \| 32 \|
	\| LoRA target modules \| — \| q, k, v, o, gate, up, down projections \|
	\| Context window \| 2 048 tokens \| 2 048 tokens \|
	\| Vocabulary \| 151 936 \| 151 936 \|

	---

	## Training Pipeline

	BlitzKode was produced by a 4-stage fine-tuning pipeline:

	### Stage 1 — SFT (Supervised Fine-Tuning)
	LoRA fine-tuning (`r=32`, base: Qwen2.5-1.5B-Instruct) on 71 curated algorithmic
	coding problems covering arrays, strings, trees, dynamic programming, graphs,
	sorting, hash tables, binary search, and more.

	- Adapter checkpoint: `checkpoints/sft-1.5b-v1/`
	- Library: PEFT + HuggingFace Transformers

	### Stage 2 — Reward-SFT
	Continued SFT with heuristic reward functions to reinforce code correctness,
	formatting quality, and concise explanation style. This is a standard SFT
	training loop using scalar reward signals, not full GRPO.

	- Adapter checkpoint: `checkpoints/grpo-v1/` (label is historical)
	- Library: TRL / Transformers

	### Stage 3 — DPO (Direct Preference Optimization)
	Preference optimization on handcrafted chosen/rejected pairs to improve answer
	clarity, reduce verbosity, and penalize hallucinated APIs or filenames.

	- Adapter checkpoint: `checkpoints/dpo-v1/`
	- Library: TRL

	### Stage 4 — Continued LoRA SFT (Published Adapter)
	Final LoRA fine-tuning (`r=16`, base: Qwen2.5-0.5B-Instruct) on 99 samples
	drawn from the 199-sample full dataset. Training ran for 50 steps; final loss
	reached ~0.48.

	- Adapter checkpoint: `checkpoints/available-lora-0.5b-full/final` ✅ (publicly available)
	- Library: PEFT + Transformers

	### Stage 5 — Merge & Export (GGUF)
	LoRA adapters from Stage 1–3 were merged into the 1.5B base model using
	`merge_and_unload()`, then converted to GGUF Q8_0 format with llama.cpp.

	- Script: `scripts/export_gguf.py`
	- Artifact: `blitzkode.gguf` (~1.53 GB, git-ignored)

	---

	## Training Data

	Total: 199 samples across 3 subsets

	\| Subset \| Count \| Source \| License \| Purpose \|
	\|---\|---\|---\|---\|---\|
	\| Curated algorithmic problems \| 71 \| Custom (local) \| MIT \| Core coding skills: arrays, strings, trees, DP, graphs, sorting, searching \|
	\| MetaMathQA samples \| 100 \| [`meta-math/MetaMathQA`](https://huggingface.co/datasets/meta-math/MetaMathQA) \| CC BY 4.0 \| Math reasoning transfer to improve step-by-step problem solving \|
	\| Python/JavaScript patterns \| 28 \| Custom (local) \| MIT \| Practical patterns: decorators, context managers, data classes, async, CLI tools \|
	\| Total \| 199 \| \| \| \|

	See [`datasets/MANIFEST.md`](datasets/MANIFEST.md) for full dataset provenance,
	preprocessing notes, and per-sample license details.

	---

	## Features

	- Multi-language code generation — Python, JavaScript, Java, C++, TypeScript, SQL
	- Code explanation — clear inline comments and documentation
	- Bug fixing — debug and fix common code issues
	- Algorithm assistance — data structures and algorithms (LeetCode-style)
	- Offline operation — fully local, no internet required at inference time
	- Fast CPU inference — GGUF F16 runs on commodity CPUs
	- API-first serving — FastAPI backend with REST and SSE streaming endpoints
	- Optimized local inference — configurable llama.cpp GPU offload, mmap loading, batching, and prompt cache

	---

	## Usage

	### Production: GGUF with llama.cpp

	```bash
	# Clone and install
	git clone https://github.com/neuralbroker/blitzkode
	cd blitzkode
	pip install -r requirements.txt

	# Start the API server (place blitzkode.gguf in repo root first)
	python server.py
	curl http://localhost:7860/health
	```

	### Research: LoRA Adapter with PEFT

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer

	base_model_id = "Qwen/Qwen2.5-0.5B-Instruct"
	adapter_repo = "neuralbroker/blitzkode-lora-0.5b"

	tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	base_model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True
	)
	model = PeftModel.from_pretrained(model, adapter_repo)
	model.eval()
	```

	### Prompt Format (ChatML)

	All variants use the Qwen ChatML template:

	```
	<\|im_start\|>system
	You are BlitzKode, an AI coding assistant created by Sajad. You are an expert
	in Python, JavaScript, Java, C++, and other languages. Write clean, efficient,
	and well-documented code. Keep responses concise and practical.<\|im_end\|>
	<\|im_start\|>user
	{your prompt}<\|im_end\|>
	<\|im_start\|>assistant
	```

	---

	## Intended Use

	### Best For
	- Local offline coding assistance
	- Algorithm and data structure problem solving
	- Code generation and explanation
	- Educational programming support
	- Code review, refactoring, and debugging

	### Out of Scope
	- Production code without thorough expert review
	- Security-critical or cryptographic applications
	- Multi-modal tasks (images not supported)
	- Long-context repository analysis (> 2 048 tokens)

	---

	## Evaluation

	Latest local GGUF smoke evaluation was run with `python scripts/evaluate_model.py` on CPU (`n_ctx=2048`, `threads=8`, `batch=256`, `gpu_layers=0`). Full machine-readable results are available in [`docs/evaluation_results.json`](docs/evaluation_results.json).

	\| Eval case \| Result \| Notes \|
	\|---\|---:\|---\|
	\| Python factorial with negative-input handling \| ✅ Pass \| Correct iterative implementation with negative-input validation. \|
	\| Iterative binary search \| ✅ Pass \| Valid loop-based implementation returning index or `-1`. \|
	\| SQL top users by order count \| ✅ Pass \| Correct `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT 5` structure. \|
	\| Unknown fictional API uncertainty \| ❌ Fail \| Raw model hallucinated a plausible signature; the FastAPI backend adds a guard for direct unknown-signature prompts. \|

	Summary: 3 / 4 passed (75%). This is a lightweight heuristic regression smoke test, not a benchmark suite. Stronger future evaluation should include executable unit tests and larger coding benchmarks such as HumanEval/MBPP-style tasks.

	---

	## Limitations

	- Text-only input — no image or file-upload support
	- 2 048-token default context — CPU-friendly but limits long conversation history
	- Verify all outputs — always review and test generated code
	- Small model — 0.5B–1.5B scale; may produce incorrect code on complex tasks
	- Raw model hallucination risk — the API server includes guardrails, but direct GGUF prompting can still invent unsupported API details
	- No real-time data — knowledge cutoff follows the Qwen2.5 base model unless the optional research endpoint is used
	- Math reasoning — MetaMathQA transfer helps basic reasoning; not a math specialist

	---

	## Environment Variables (Inference Server)

	\| Variable \| Default \| Description \|
	\|---\|---\|---\|
	\| `BLITZKODE_GPU_LAYERS` \| `0` \| Number of layers to offload to GPU \|
	\| `BLITZKODE_THREADS` \| system \| CPU decode thread count \|
	\| `BLITZKODE_THREADS_BATCH` \| system \| CPU prompt-processing thread count \|
	\| `BLITZKODE_N_CTX` \| `2048` \| Context window size \|
	\| `BLITZKODE_BATCH` \| `256` \| llama.cpp prompt-processing batch size \|
	\| `BLITZKODE_UBATCH` \| `128` \| llama.cpp micro-batch size \|
	\| `BLITZKODE_PROMPT_CACHE` \| `true` \| Enable in-memory prompt cache when supported \|
	\| `BLITZKODE_PRELOAD_MODEL` \| `false` \| Load model at startup vs first request \|

	---

	## Project Structure

	```text
	BlitzKode/
	server.py # FastAPI backend (inference + search)
	blitzkode.gguf # GGUF model artifact (~3 GB, git-ignored)
	scripts/
	evaluate_model.py # Lightweight GGUF evaluation harness
	train_sft.py # Stage 1: SFT training
	train_reward_sft.py # Stage 2: Reward-SFT
	train_dpo.py # Stage 3: DPO
	train_available.py # Stage 4: LoRA fine-tune (0.5B)
	export_gguf.py # Merge & convert to GGUF
	push_to_hub.py # Push adapter to HuggingFace Hub
	build_full_dataset.py # Dataset builder (algorithmic + HF datasets)
	docs/
	evaluation_results.json # Latest smoke-eval output
	datasets/
	MANIFEST.md # Dataset provenance and license info
	checkpoints/
	available-lora-0.5b-full/ # Published LoRA adapter (0.5B)
	tests/
	test_server.py # HTTP integration tests
	docs/
	PROJECT_OVERVIEW.md # Architecture and design notes
	README.md # Full project documentation
	MODEL_CARD.md # This file
	```

	---

	## License

	MIT — see [LICENSE](https://github.com/neuralbroker/blitzkode/blob/main/LICENSE).

	You must also comply with the upstream Qwen2.5 license when redistributing any
	fine-tuned weights derived from it.

	- [Qwen2.5-0.5B-Instruct license](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
	- [Qwen2.5-1.5B-Instruct license](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)

	Training data subsets carry their own licenses:
	- MetaMathQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
	- Custom/local samples: MIT

	---

	## Contact

	- GitHub Issues: <https://github.com/neuralbroker/blitzkode/issues>
	- Portfolio: <https://neuralbroker.vercel.app>

	Contributions and feedback are welcome!

	---

	## Citation

	```bibtex
	@software{blitzkode2025,
	author = {Sajad},
	title = {BlitzKode: A Local AI Coding Assistant},
	year = {2025},
	url = {https://github.com/neuralbroker/blitzkode}
	}
	```