Text Generation
llama-cpp-python
GGUF
English
code-generation
coding-assistant
llama.cpp
qwen2.5
python
javascript
fine-tuned
conversational
Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use neuralbroker/blitzkode with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama-cli -hf neuralbroker/blitzkode
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama-cli -hf neuralbroker/blitzkode
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./llama-cli -hf neuralbroker/blitzkode
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./build/bin/llama-cli -hf neuralbroker/blitzkode
Use Docker
docker model run hf.co/neuralbroker/blitzkode
- LM Studio
- Jan
- vLLM
How to use neuralbroker/blitzkode with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "neuralbroker/blitzkode" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "neuralbroker/blitzkode", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/neuralbroker/blitzkode
- Ollama
How to use neuralbroker/blitzkode with Ollama:
ollama run hf.co/neuralbroker/blitzkode
- Unsloth Studio new
How to use neuralbroker/blitzkode with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for neuralbroker/blitzkode to start chatting
- Pi new
How to use neuralbroker/blitzkode with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf neuralbroker/blitzkode
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "neuralbroker/blitzkode" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use neuralbroker/blitzkode with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf neuralbroker/blitzkode
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default neuralbroker/blitzkode
Run Hermes
hermes
- Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
docker model run hf.co/neuralbroker/blitzkode
- Lemonade
How to use neuralbroker/blitzkode with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull neuralbroker/blitzkode
Run and chat with the model
lemonade run user.blitzkode-{{QUANT_TAG}}List all available models
lemonade list
Update clean backend-only project docs and eval
Browse files- MODEL_CARD.md +37 -19
MODEL_CARD.md
CHANGED
|
@@ -23,9 +23,9 @@ base_model:
|
|
| 23 |
# BlitzKode
|
| 24 |
|
| 25 |
**BlitzKode** is a local AI coding assistant fine-tuned from the Qwen2.5 family. It
|
| 26 |
-
ships as a **GGUF model** (1.5B,
|
| 27 |
-
llama.cpp, and as
|
| 28 |
-
|
| 29 |
|
| 30 |
> **Creator:** [Sajad (neuralbroker)](https://github.com/neuralbroker)
|
| 31 |
> **GitHub:** <https://github.com/neuralbroker/blitzkode>
|
|
@@ -38,7 +38,7 @@ further fine-tuning.
|
|
| 38 |
|
| 39 |
| Variant | Version | Base Model | Format | Size | Runtime |
|
| 40 |
|---|---|---|---|---|---|
|
| 41 |
-
| **GGUF** (production) | 2.0 | `Qwen/Qwen2.5-1.5B-Instruct` | GGUF
|
| 42 |
| **LoRA adapter** (research) | 2.1 | `Qwen/Qwen2.5-0.5B-Instruct` | PEFT safetensors | ~100 MB | PEFT + Transformers |
|
| 43 |
|
| 44 |
---
|
|
@@ -49,7 +49,7 @@ further fine-tuning.
|
|
| 49 |
|---|---|---|
|
| 50 |
| **Model type** | Transformer (Qwen2) | Transformer (Qwen2) + LoRA |
|
| 51 |
| **Parameters** | 1.5 B | 0.5 B + adapter weights |
|
| 52 |
-
| **Quantization** | GGUF
|
| 53 |
| **LoRA rank (r)** | β | 16 |
|
| 54 |
| **LoRA alpha** | β | 32 |
|
| 55 |
| **LoRA target modules** | β | q, k, v, o, gate, up, down projections |
|
|
@@ -95,10 +95,10 @@ reached **~0.48**.
|
|
| 95 |
|
| 96 |
### Stage 5 β Merge & Export (GGUF)
|
| 97 |
LoRA adapters from Stage 1β3 were merged into the 1.5B base model using
|
| 98 |
-
`merge_and_unload()`, then converted to GGUF
|
| 99 |
|
| 100 |
- **Script:** `scripts/export_gguf.py`
|
| 101 |
-
- **Artifact:** `blitzkode.gguf` (~
|
| 102 |
|
| 103 |
---
|
| 104 |
|
|
@@ -126,8 +126,8 @@ preprocessing notes, and per-sample license details.
|
|
| 126 |
- **Algorithm assistance** β data structures and algorithms (LeetCode-style)
|
| 127 |
- **Offline operation** β fully local, no internet required at inference time
|
| 128 |
- **Fast CPU inference** β GGUF F16 runs on commodity CPUs
|
| 129 |
-
- **
|
| 130 |
-
- **
|
| 131 |
|
| 132 |
---
|
| 133 |
|
|
@@ -141,12 +141,9 @@ git clone https://github.com/neuralbroker/blitzkode
|
|
| 141 |
cd blitzkode
|
| 142 |
pip install -r requirements.txt
|
| 143 |
|
| 144 |
-
#
|
| 145 |
-
cd frontend && npm install && npm run build && cd ..
|
| 146 |
-
|
| 147 |
-
# Start the server (place blitzkode.gguf in repo root first)
|
| 148 |
python server.py
|
| 149 |
-
|
| 150 |
```
|
| 151 |
|
| 152 |
### Research: LoRA Adapter with PEFT
|
|
@@ -199,13 +196,29 @@ and well-documented code. Keep responses concise and practical.<|im_end|>
|
|
| 199 |
|
| 200 |
---
|
| 201 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
## Limitations
|
| 203 |
|
| 204 |
- **Text-only input** β no image or file-upload support
|
| 205 |
-
- **2 048-token context** β CPU-friendly but limits long conversation history
|
| 206 |
- **Verify all outputs** β always review and test generated code
|
| 207 |
- **Small model** β 0.5Bβ1.5B scale; may produce incorrect code on complex tasks
|
| 208 |
-
- **
|
|
|
|
| 209 |
- **Math reasoning** β MetaMathQA transfer helps basic reasoning; not a math specialist
|
| 210 |
|
| 211 |
---
|
|
@@ -215,9 +228,12 @@ and well-documented code. Keep responses concise and practical.<|im_end|>
|
|
| 215 |
| Variable | Default | Description |
|
| 216 |
|---|---|---|
|
| 217 |
| `BLITZKODE_GPU_LAYERS` | `0` | Number of layers to offload to GPU |
|
| 218 |
-
| `BLITZKODE_THREADS` | system | CPU
|
|
|
|
| 219 |
| `BLITZKODE_N_CTX` | `2048` | Context window size |
|
| 220 |
-
| `BLITZKODE_BATCH` | `
|
|
|
|
|
|
|
| 221 |
| `BLITZKODE_PRELOAD_MODEL` | `false` | Load model at startup vs first request |
|
| 222 |
|
| 223 |
---
|
|
@@ -228,8 +244,8 @@ and well-documented code. Keep responses concise and practical.<|im_end|>
|
|
| 228 |
BlitzKode/
|
| 229 |
server.py # FastAPI backend (inference + search)
|
| 230 |
blitzkode.gguf # GGUF model artifact (~3 GB, git-ignored)
|
| 231 |
-
frontend/ # React/Vite web UI
|
| 232 |
scripts/
|
|
|
|
| 233 |
train_sft.py # Stage 1: SFT training
|
| 234 |
train_reward_sft.py # Stage 2: Reward-SFT
|
| 235 |
train_dpo.py # Stage 3: DPO
|
|
@@ -237,6 +253,8 @@ BlitzKode/
|
|
| 237 |
export_gguf.py # Merge & convert to GGUF
|
| 238 |
push_to_hub.py # Push adapter to HuggingFace Hub
|
| 239 |
build_full_dataset.py # Dataset builder (algorithmic + HF datasets)
|
|
|
|
|
|
|
| 240 |
datasets/
|
| 241 |
MANIFEST.md # Dataset provenance and license info
|
| 242 |
checkpoints/
|
|
|
|
| 23 |
# BlitzKode
|
| 24 |
|
| 25 |
**BlitzKode** is a local AI coding assistant fine-tuned from the Qwen2.5 family. It
|
| 26 |
+
ships as a **GGUF model** (1.5B, Q8_0, ~1.53 GB) for fast offline inference with
|
| 27 |
+
llama.cpp, and as **LoRA adapters** for PEFT-based research and further
|
| 28 |
+
fine-tuning.
|
| 29 |
|
| 30 |
> **Creator:** [Sajad (neuralbroker)](https://github.com/neuralbroker)
|
| 31 |
> **GitHub:** <https://github.com/neuralbroker/blitzkode>
|
|
|
|
| 38 |
|
| 39 |
| Variant | Version | Base Model | Format | Size | Runtime |
|
| 40 |
|---|---|---|---|---|---|
|
| 41 |
+
| **GGUF** (production) | 2.0 | `Qwen/Qwen2.5-1.5B-Instruct` | GGUF Q8_0 | ~1.53 GB | llama.cpp / llama-cpp-python |
|
| 42 |
| **LoRA adapter** (research) | 2.1 | `Qwen/Qwen2.5-0.5B-Instruct` | PEFT safetensors | ~100 MB | PEFT + Transformers |
|
| 43 |
|
| 44 |
---
|
|
|
|
| 49 |
|---|---|---|
|
| 50 |
| **Model type** | Transformer (Qwen2) | Transformer (Qwen2) + LoRA |
|
| 51 |
| **Parameters** | 1.5 B | 0.5 B + adapter weights |
|
| 52 |
+
| **Quantization** | GGUF Q8_0 | bfloat16 / float16 |
|
| 53 |
| **LoRA rank (r)** | β | 16 |
|
| 54 |
| **LoRA alpha** | β | 32 |
|
| 55 |
| **LoRA target modules** | β | q, k, v, o, gate, up, down projections |
|
|
|
|
| 95 |
|
| 96 |
### Stage 5 β Merge & Export (GGUF)
|
| 97 |
LoRA adapters from Stage 1β3 were merged into the 1.5B base model using
|
| 98 |
+
`merge_and_unload()`, then converted to GGUF Q8_0 format with llama.cpp.
|
| 99 |
|
| 100 |
- **Script:** `scripts/export_gguf.py`
|
| 101 |
+
- **Artifact:** `blitzkode.gguf` (~1.53 GB, git-ignored)
|
| 102 |
|
| 103 |
---
|
| 104 |
|
|
|
|
| 126 |
- **Algorithm assistance** β data structures and algorithms (LeetCode-style)
|
| 127 |
- **Offline operation** β fully local, no internet required at inference time
|
| 128 |
- **Fast CPU inference** β GGUF F16 runs on commodity CPUs
|
| 129 |
+
- **API-first serving** β FastAPI backend with REST and SSE streaming endpoints
|
| 130 |
+
- **Optimized local inference** β configurable llama.cpp GPU offload, mmap loading, batching, and prompt cache
|
| 131 |
|
| 132 |
---
|
| 133 |
|
|
|
|
| 141 |
cd blitzkode
|
| 142 |
pip install -r requirements.txt
|
| 143 |
|
| 144 |
+
# Start the API server (place blitzkode.gguf in repo root first)
|
|
|
|
|
|
|
|
|
|
| 145 |
python server.py
|
| 146 |
+
curl http://localhost:7860/health
|
| 147 |
```
|
| 148 |
|
| 149 |
### Research: LoRA Adapter with PEFT
|
|
|
|
| 196 |
|
| 197 |
---
|
| 198 |
|
| 199 |
+
## Evaluation
|
| 200 |
+
|
| 201 |
+
Latest local GGUF smoke evaluation was run with `python scripts/evaluate_model.py` on CPU (`n_ctx=2048`, `threads=8`, `batch=256`, `gpu_layers=0`). Full machine-readable results are available in [`docs/evaluation_results.json`](docs/evaluation_results.json).
|
| 202 |
+
|
| 203 |
+
| Eval case | Result | Notes |
|
| 204 |
+
|---|---:|---|
|
| 205 |
+
| Python factorial with negative-input handling | β
Pass | Correct iterative implementation with negative-input validation. |
|
| 206 |
+
| Iterative binary search | β
Pass | Valid loop-based implementation returning index or `-1`. |
|
| 207 |
+
| SQL top users by order count | β
Pass | Correct `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT 5` structure. |
|
| 208 |
+
| Unknown fictional API uncertainty | β Fail | Raw model hallucinated a plausible signature; the FastAPI backend adds a guard for direct unknown-signature prompts. |
|
| 209 |
+
|
| 210 |
+
Summary: **3 / 4 passed (75%)**. This is a lightweight heuristic regression smoke test, not a benchmark suite. Stronger future evaluation should include executable unit tests and larger coding benchmarks such as HumanEval/MBPP-style tasks.
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
## Limitations
|
| 215 |
|
| 216 |
- **Text-only input** β no image or file-upload support
|
| 217 |
+
- **2 048-token default context** β CPU-friendly but limits long conversation history
|
| 218 |
- **Verify all outputs** β always review and test generated code
|
| 219 |
- **Small model** β 0.5Bβ1.5B scale; may produce incorrect code on complex tasks
|
| 220 |
+
- **Raw model hallucination risk** β the API server includes guardrails, but direct GGUF prompting can still invent unsupported API details
|
| 221 |
+
- **No real-time data** β knowledge cutoff follows the Qwen2.5 base model unless the optional research endpoint is used
|
| 222 |
- **Math reasoning** β MetaMathQA transfer helps basic reasoning; not a math specialist
|
| 223 |
|
| 224 |
---
|
|
|
|
| 228 |
| Variable | Default | Description |
|
| 229 |
|---|---|---|
|
| 230 |
| `BLITZKODE_GPU_LAYERS` | `0` | Number of layers to offload to GPU |
|
| 231 |
+
| `BLITZKODE_THREADS` | system | CPU decode thread count |
|
| 232 |
+
| `BLITZKODE_THREADS_BATCH` | system | CPU prompt-processing thread count |
|
| 233 |
| `BLITZKODE_N_CTX` | `2048` | Context window size |
|
| 234 |
+
| `BLITZKODE_BATCH` | `256` | llama.cpp prompt-processing batch size |
|
| 235 |
+
| `BLITZKODE_UBATCH` | `128` | llama.cpp micro-batch size |
|
| 236 |
+
| `BLITZKODE_PROMPT_CACHE` | `true` | Enable in-memory prompt cache when supported |
|
| 237 |
| `BLITZKODE_PRELOAD_MODEL` | `false` | Load model at startup vs first request |
|
| 238 |
|
| 239 |
---
|
|
|
|
| 244 |
BlitzKode/
|
| 245 |
server.py # FastAPI backend (inference + search)
|
| 246 |
blitzkode.gguf # GGUF model artifact (~3 GB, git-ignored)
|
|
|
|
| 247 |
scripts/
|
| 248 |
+
evaluate_model.py # Lightweight GGUF evaluation harness
|
| 249 |
train_sft.py # Stage 1: SFT training
|
| 250 |
train_reward_sft.py # Stage 2: Reward-SFT
|
| 251 |
train_dpo.py # Stage 3: DPO
|
|
|
|
| 253 |
export_gguf.py # Merge & convert to GGUF
|
| 254 |
push_to_hub.py # Push adapter to HuggingFace Hub
|
| 255 |
build_full_dataset.py # Dataset builder (algorithmic + HF datasets)
|
| 256 |
+
docs/
|
| 257 |
+
evaluation_results.json # Latest smoke-eval output
|
| 258 |
datasets/
|
| 259 |
MANIFEST.md # Dataset provenance and license info
|
| 260 |
checkpoints/
|