Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

llama-cpp-python

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use neuralbroker/blitzkode with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./llama-cli -hf neuralbroker/blitzkode

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./build/bin/llama-cli -hf neuralbroker/blitzkode

Use Docker

docker model run hf.co/neuralbroker/blitzkode

LM Studio
Jan

vLLM

How to use neuralbroker/blitzkode with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "neuralbroker/blitzkode"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "neuralbroker/blitzkode",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/neuralbroker/blitzkode

Ollama
How to use neuralbroker/blitzkode with Ollama:
```
ollama run hf.co/neuralbroker/blitzkode
```

Unsloth Studio new

How to use neuralbroker/blitzkode with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for neuralbroker/blitzkode to start chatting

Pi new

How to use neuralbroker/blitzkode with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "neuralbroker/blitzkode"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use neuralbroker/blitzkode with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default neuralbroker/blitzkode

Run Hermes

hermes

Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
```
docker model run hf.co/neuralbroker/blitzkode
```

Lemonade

How to use neuralbroker/blitzkode with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull neuralbroker/blitzkode

Run and chat with the model

lemonade run user.blitzkode-{{QUANT_TAG}}

List all available models

lemonade list

neuralbroker commited on 1 day ago

Commit

d5a79fa

verified ·

1 Parent(s): 06b342b

Update clean backend-only project docs and eval

Browse files

Files changed (1) hide show

README.md +51 -77

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ base_model:
 # BlitzKode
-BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs entirely on your machine — no external API calls, no data leaving your device.
 ## Tech Stack
@@ -30,24 +30,21 @@ BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-In
 | Training | HuggingFace Transformers + TRL |
 | Inference | llama-cpp-python (GGUF Q8_0) |
 | Backend | Python 3.11+, FastAPI, uvicorn |
-| Frontend | React 18, Vite, Phosphor Icons |
 ## Features
-- **Local-first** — inference with the bundled GGUF, no cloud dependency
-- **Real-time streaming** — SSE token-by-token via `/generate/stream`
-- **Web research mode** — DuckDuckGo search → context-augmented generation via `/generate/research`
-- **Web search API** — standalone `/search/web` endpoint for raw results
-- **React chat UI** — streaming, conversation history, copy controls, research-mode toggle
-- **Multi-language** — Python, JavaScript, Java, C++, TypeScript, SQL
-- **API key auth + rate limiting** — production-ready security middleware
-- **Docker** — multi-stage production image with frontend baked in
 ## Prerequisites
 - Python 3.11+
-- Node.js 20+ (for frontend dev/builds only)
-- `blitzkode.gguf` at repo root (or set `BLITZKODE_MODEL_PATH`)
 - 4 GB+ RAM
 ## Quick Start
@@ -55,26 +52,9 @@ BlitzKode is a local AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-In
 ```bash
 pip install -r requirements.txt
 python server.py
-# Open http://localhost:7860
-```
-## Frontend Development
-```bash
-cd frontend
-npm install
-npm run dev      # http://localhost:5173 — proxies /generate and /health to :7860
-```
-## Production Frontend Build
-```bash
-cd frontend && npm install && npm run build && cd ..
-python server.py
 ```
-FastAPI serves `frontend/dist/index.html` and `/assets/*` from the same port.
 ## Docker
 ```bash
@@ -104,7 +84,7 @@ curl -X POST http://localhost:7860/search/web \
   -H "Content-Type: application/json" \
   -d '{"query":"FastAPI dependency injection","max_results":3}'
-# Research-augmented generation (search → inject → answer)
 curl -X POST http://localhost:7860/generate/research \
   -H "Content-Type: application/json" \
   -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'
@@ -130,7 +110,7 @@ curl http://localhost:7860/info
 ### Research (`/generate/research`)
-Same as above, plus:
 | Parameter | Type | Default | Description |
 |---|---|---|---|
@@ -151,100 +131,94 @@ Same as above, plus:
 | Variable | Default | Description |
 |---|---|---|
 | `BLITZKODE_MODEL_PATH` | `blitzkode.gguf` | GGUF model path |
-| `BLITZKODE_FRONTEND_PATH` | `frontend/dist/index.html` | Built frontend |
 | `BLITZKODE_HOST` | `0.0.0.0` | Server bind address |
 | `BLITZKODE_PORT` | `7860` | Server port |
-| `BLITZKODE_GPU_LAYERS` | `0` | GPU layers for llama.cpp |
 | `BLITZKODE_N_CTX` | `2048` | Context window |
-| `BLITZKODE_THREADS` | auto | CPU worker threads |
-| `BLITZKODE_BATCH` | `128` | Batch size |
 | `BLITZKODE_MAX_PROMPT_LENGTH` | `4000` | Max prompt chars |
 | `BLITZKODE_PRELOAD_MODEL` | `false` | Load model at startup |
 | `BLITZKODE_CORS_ORIGINS` | `http://localhost:7860` | CORS origins |
 | `BLITZKODE_API_KEY` | empty | Optional bearer token |
 | `BLITZKODE_WEB_SEARCH` | `true` | Enable web search endpoints |
-| `BLITZKODE_SEARCH_TIMEOUT` | `8` | Search HTTP timeout (s) |
 | `BLITZKODE_MAX_SEARCH_RESULTS` | `5` | Max search results |
 | `BLITZKODE_RATE_LIMIT` | `true` | Enable per-IP rate limiting |
 | `BLITZKODE_RATE_LIMIT_MAX` | `30` | Requests per IP per minute |
 | `BLITZKODE_MAX_REQUEST_BYTES` | `50000` | Request body size limit |
 ## Training Pipeline
 BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):
 | Stage | Script | Details |
 |---|---|---|
-| SFT v1 | `train_sft.py` | LoRA r=32 on 24 curated coding examples |
 | Reward-SFT | `train_reward_sft.py` | Reward-heuristic continuation |
-| DPO | `train_dpo.py` | 10 chosen/rejected preference pairs |
-| SFT v2 | `train_available.py` | LoRA r=16, 100 steps, 99 samples (1.5B) |
 | Export | `export_production.py` | Merge → GGUF Q8_0 via llama.cpp |
 ### Re-train from scratch
 ```bash
 pip install -r requirements-training.txt
-# Build dataset
 python scripts/build_full_dataset.py
-# Train 1.5B LoRA (100 steps, ~5 min on RTX 4060)
 python scripts/train_available.py \
   --model Qwen/Qwen2.5-1.5B-Instruct \
   --quantization none \
   --dataset datasets/raw/blitzkode_full_training.json \
   --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
-# Export: merge + GGUF
 python scripts/export_production.py
 ```
-### Push to HuggingFace
-```bash
-export HF_TOKEN=hf_XXXX        # get from https://huggingface.co/settings/tokens
-python scripts/push_all_to_hub.py
-```
-This uploads:
-- `checkpoints/blitzkode-1.5b-lora/final` → `neuralbroker/blitzkode-1.5b-lora`
-- `checkpoints/available-lora-0.5b-full/final` → `neuralbroker/blitzkode-lora-0.5b`
-- `blitzkode.gguf` → `neuralbroker/blitzkode`
 ## Project Structure
 ```text
 BlitzKode/
   server.py                    FastAPI backend
   blitzkode.gguf               Local GGUF model (ignored by git)
-  frontend/                    React/Vite web UI
-    src/App.jsx                Chat UI with streaming + research toggle
-    src/index.css
-    vite.config.js
-  scripts/
-    train_available.py         Resource-aware LoRA training
-    build_full_dataset.py      Dataset builder
-    export_production.py       Merge LoRA → GGUF
-    push_to_hub.py             Single-adapter HF push
-    push_all_to_hub.py         Push all artifacts in one command
-    test_inference.py          Adapter smoke test
-    healthcheck.sh             Docker/Compose health probe
-  tests/test_server.py         20 backend endpoint tests (all passing)
   datasets/MANIFEST.md         Dataset provenance
-  docs/PROJECT_OVERVIEW.md     Architecture & roadmap
-  Dockerfile                   Multi-stage production image
   docker-compose.yml           CPU + GPU service definitions
   requirements.txt             Serving dependencies
-  requirements-training.txt    Training dependencies (pinned)
 ```
 ## CI
 ```bash
-python -m pytest tests/ -v          # 20 tests, all pass
-python -m ruff check .              # lint
-npm --prefix frontend run build     # frontend build
 ```
 ## License

 # BlitzKode
+BlitzKode is a local API-first AI coding assistant powered by a fine-tuned Qwen2.5-1.5B-Instruct model. It runs on your machine through `llama-cpp-python` with no external model API calls.
 ## Tech Stack
 | Training | HuggingFace Transformers + TRL |
 | Inference | llama-cpp-python (GGUF Q8_0) |
 | Backend | Python 3.11+, FastAPI, uvicorn |
 ## Features
+- **Local-first inference** with the bundled GGUF model
+- **FastAPI backend only** with `/generate`, `/generate/stream`, `/generate/research`, `/search/web`, `/health`, and `/info`
+- **Real-time streaming** via Server-Sent Events on `/generate/stream`
+- **Web research mode** using DuckDuckGo search context before generation
+- **API key auth, request-size limits, and rate limiting** for production use
+- **Backend/model optimizations**: mmap model loading, configurable GPU layer offload, batch/thread tuning, optional prompt cache, search-result TTL caching, and efficient deque-based rate limiting
+- **Docker** runtime image without Node.js/frontend build steps
 ## Prerequisites
 - Python 3.11+
+- `blitzkode.gguf` at repo root, or set `BLITZKODE_MODEL_PATH`
 - 4 GB+ RAM
 ## Quick Start
 ```bash
 pip install -r requirements.txt
 python server.py
+curl http://localhost:7860/health
 ```
 ## Docker
 ```bash
   -H "Content-Type: application/json" \
   -d '{"query":"FastAPI dependency injection","max_results":3}'
+# Research-augmented generation
 curl -X POST http://localhost:7860/generate/research \
   -H "Content-Type: application/json" \
   -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'
 ### Research (`/generate/research`)
+Same as generation, plus:
 | Parameter | Type | Default | Description |
 |---|---|---|---|
 | Variable | Default | Description |
 |---|---|---|
 | `BLITZKODE_MODEL_PATH` | `blitzkode.gguf` | GGUF model path |
 | `BLITZKODE_HOST` | `0.0.0.0` | Server bind address |
 | `BLITZKODE_PORT` | `7860` | Server port |
+| `BLITZKODE_GPU_LAYERS` | `0` | GPU layers for llama.cpp; use `-1` to offload all supported layers |
 | `BLITZKODE_N_CTX` | `2048` | Context window |
+| `BLITZKODE_THREADS` | auto | CPU decode threads |
+| `BLITZKODE_THREADS_BATCH` | auto | CPU prompt-processing threads |
+| `BLITZKODE_BATCH` | `256` | Prompt-processing batch size |
+| `BLITZKODE_UBATCH` | `128` | llama.cpp micro-batch size |
+| `BLITZKODE_PROMPT_CACHE` | `true` | Enable llama.cpp in-memory prompt cache when supported |
+| `BLITZKODE_PROMPT_CACHE_BYTES` | `67108864` | Prompt cache capacity in bytes |
+| `BLITZKODE_USE_MMAP` | `true` | Memory-map the GGUF for faster startup and lower memory pressure |
+| `BLITZKODE_USE_MLOCK` | `false` | Try to lock model pages in RAM |
+| `BLITZKODE_OFFLOAD_KQV` | `true` | Offload K/Q/V operations when GPU layers are enabled |
 | `BLITZKODE_MAX_PROMPT_LENGTH` | `4000` | Max prompt chars |
 | `BLITZKODE_PRELOAD_MODEL` | `false` | Load model at startup |
 | `BLITZKODE_CORS_ORIGINS` | `http://localhost:7860` | CORS origins |
 | `BLITZKODE_API_KEY` | empty | Optional bearer token |
 | `BLITZKODE_WEB_SEARCH` | `true` | Enable web search endpoints |
+| `BLITZKODE_SEARCH_TIMEOUT` | `8` | Search HTTP timeout in seconds |
 | `BLITZKODE_MAX_SEARCH_RESULTS` | `5` | Max search results |
+| `BLITZKODE_SEARCH_CACHE_TTL` | `300` | Search result cache TTL in seconds |
 | `BLITZKODE_RATE_LIMIT` | `true` | Enable per-IP rate limiting |
 | `BLITZKODE_RATE_LIMIT_MAX` | `30` | Requests per IP per minute |
 | `BLITZKODE_MAX_REQUEST_BYTES` | `50000` | Request body size limit |
+## Model Evaluation
+Latest local GGUF evaluation: **2026-05-16** using `python scripts/evaluate_model.py` on CPU (`n_ctx=2048`, `threads=8`, `batch=256`, `gpu_layers=0`). Full machine-readable results are stored in `docs/evaluation_results.json`.
+| Eval case | Result | Notes |
+|---|---:|---|
+| Python factorial with negative-input handling | ✅ Pass | Generated a correct iterative implementation with `ValueError` for negative input. |
+| Iterative binary search | ✅ Pass | Generated a valid loop-based search returning index or `-1`. |
+| SQL top users by order count | ✅ Pass | Generated `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT 5`. |
+| Unknown fictional API uncertainty | ❌ Fail | The raw model hallucinated a plausible signature for `imaginary_blitz_api`; the backend guard still blocks direct unknown-signature prompts on `/generate` and `/generate/stream`. |
+Summary: **3 / 4 passed (75%)**. Total generation time was **28.864 s** after a **0.312 s** model load. Evaluation-of-the-evaluation: this is a lightweight heuristic smoke eval, not a comprehensive benchmark; it is useful for regression tracking and quick sanity checks, but code should still be reviewed and tested. Future eval work should add executable unit tests for generated code and larger benchmark suites such as HumanEval/MBPP-style tasks.
 ## Training Pipeline
 BlitzKode was fine-tuned through a staged pipeline on an RTX 4060 (8 GB VRAM):
 | Stage | Script | Details |
 |---|---|---|
+| SFT v1 | `train_sft.py` | LoRA r=32 on curated coding examples |
 | Reward-SFT | `train_reward_sft.py` | Reward-heuristic continuation |
+| DPO | `train_dpo.py` | Chosen/rejected preference pairs |
+| SFT v2 | `train_available.py` | LoRA r=16 resource-aware training |
 | Export | `export_production.py` | Merge → GGUF Q8_0 via llama.cpp |
 ### Re-train from scratch
 ```bash
 pip install -r requirements-training.txt
 python scripts/build_full_dataset.py
 python scripts/train_available.py \
   --model Qwen/Qwen2.5-1.5B-Instruct \
   --quantization none \
   --dataset datasets/raw/blitzkode_full_training.json \
   --max-steps 100 --seq-len 384 --batch-size 1 --grad-accum 8
 python scripts/export_production.py
 ```
 ## Project Structure
 ```text
 BlitzKode/
   server.py                    FastAPI backend
   blitzkode.gguf               Local GGUF model (ignored by git)
+  scripts/                     Training, export, evaluation, and utility scripts
+  docs/evaluation_results.json Latest local model evaluation output
+  tests/test_server.py         Backend endpoint tests
   datasets/MANIFEST.md         Dataset provenance
+  docs/                        Architecture and production docs
+  Dockerfile                   Python runtime image
   docker-compose.yml           CPU + GPU service definitions
   requirements.txt             Serving dependencies
+  requirements-training.txt    Training dependencies
 ```
 ## CI
 ```bash
+python -m pytest tests/ -v
+python -m ruff check .
+python -m mypy server.py --ignore-missing-imports
+python scripts/evaluate_model.py
+docker build -t blitzkode:ci .
 ```
 ## License