Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

llama-cpp-python

How to use neuralbroker/blitzkode with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="neuralbroker/blitzkode",
	filename="blitzkode.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use neuralbroker/blitzkode with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
llama-cli -hf neuralbroker/blitzkode

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./llama-cli -hf neuralbroker/blitzkode

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf neuralbroker/blitzkode
# Run inference directly in the terminal:
./build/bin/llama-cli -hf neuralbroker/blitzkode

Use Docker

docker model run hf.co/neuralbroker/blitzkode

LM Studio
Jan

vLLM

How to use neuralbroker/blitzkode with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "neuralbroker/blitzkode"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "neuralbroker/blitzkode",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/neuralbroker/blitzkode

Ollama
How to use neuralbroker/blitzkode with Ollama:
```
ollama run hf.co/neuralbroker/blitzkode
```

Unsloth Studio new

How to use neuralbroker/blitzkode with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for neuralbroker/blitzkode to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for neuralbroker/blitzkode to start chatting

Pi new

How to use neuralbroker/blitzkode with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "neuralbroker/blitzkode"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use neuralbroker/blitzkode with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf neuralbroker/blitzkode

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default neuralbroker/blitzkode

Run Hermes

hermes

Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
```
docker model run hf.co/neuralbroker/blitzkode
```

Lemonade

How to use neuralbroker/blitzkode with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull neuralbroker/blitzkode

Run and chat with the model

lemonade run user.blitzkode-{{QUANT_TAG}}

List all available models

lemonade list

blitzkode / docs /PROJECT_OVERVIEW.md

neuralbroker

Update clean backend-only project docs and eval

2a2cc3c verified 10 days ago

preview code

raw

history blame contribute delete

8.45 kB

BlitzKode Project Overview

What this project is

BlitzKode is a local coding-assistant application built around a fine-tuned Qwen2.5-1.5B GGUF model. The repository contains:

server.py: FastAPI backend that exposes API endpoints and performs llama.cpp inference.
tests/: backend endpoint tests with a stubbed llama.cpp model.
scripts/: dataset, training, checkpoint smoke-test, and GGUF export utilities.
blitzkode.gguf: local model artifact, intentionally ignored by git.

The runtime design is intentionally small and local-first: API client -> FastAPI -> llama-cpp-python -> local GGUF model.

Backend context

server.py is the central orchestration layer.

Request lifecycle

create_app() builds the FastAPI app and stores Settings, ModelService, and WebSearchService on app.state.
Middleware applies CORS, request-size limits, and optional per-IP rate limiting.
/generate and /generate/stream validate prompts, enforce optional bearer auth, and serialize model access with an asyncio.Lock.
ModelService lazily loads blitzkode.gguf through llama_cpp.Llama and formats requests with a Qwen-style ChatML prompt.
Streaming generation runs llama.cpp in a worker thread and bridges tokens into SSE messages through a queue.

Inference efficiency controls

The following environment variables control CPU/GPU usage:

BLITZKODE_GPU_LAYERS: offload layers to GPU when llama.cpp was built with GPU support.
BLITZKODE_THREADS: CPU decode thread count.
BLITZKODE_THREADS_BATCH: CPU prompt-processing thread count.
BLITZKODE_N_CTX: context window.
BLITZKODE_BATCH: llama.cpp prompt-processing batch size.
BLITZKODE_UBATCH: llama.cpp micro-batch size.
BLITZKODE_PROMPT_CACHE: in-memory prompt cache for repeated prefixes.
BLITZKODE_PRELOAD_MODEL: load model at startup instead of first request.

For CPU-only machines, keep BLITZKODE_GPU_LAYERS=0, set BLITZKODE_THREADS close to physical performance cores, and use a quantized GGUF such as Q4_K_M for lower memory pressure. For GPU machines, increase BLITZKODE_GPU_LAYERS gradually until VRAM is stable.

Search and research augmentation

The backend now has optional web search primitives:

POST /search/web: DuckDuckGo Instant Answer search.
POST /generate/research: searches, injects snippets into the prompt as untrusted context, and generates an answer.

API clients can use /generate/stream for normal token streaming or /generate/research when retrieval-augmented output is needed.

Important safety behavior:

Search context is marked as untrusted.
The system prompt tells the model not to invent APIs, citations, files, or execution results.
Research mode asks the model to cite URLs when it relies on search results.

Training pipeline context

The scripts directory currently represents several generations of training experimentation:

build_dataset.py and build_full_dataset.py: local synthetic/curated dataset builders.
train_sft.py: baseline LoRA SFT over curated coding examples.
train_reward_sft.py: reward-inspired continuation, but not true GRPO.
train_dpo.py: preference optimization over handcrafted chosen/rejected samples.
train_v2.py, train_continue.py, train_max.py: larger public-dataset/QLoRA experiments.
export_gguf.py: merge LoRA adapters and prepare for llama.cpp conversion.
test_inference.py: smoke-test adapter checkpoints.

The scripts are intentionally excluded from Ruff in pyproject.toml, so backend CI remains focused and fast. They should be treated as operator tools rather than library code.

Research-informed optimization notes

DeepSeek-R1 emphasizes staged post-training: cold-start reasoning data, RL with GRPO, rejection sampling, further SFT, and preference/RL alignment. DeepSeek-V3 emphasizes efficient architecture and training systems: MoE, MLA, FP8, multi-token prediction, and strong post-training/distillation.

For this repository and a 1.5B Qwen-family model, the practical subset is:

Use high-quality SFT data first; small models are very sensitive to noisy data.
Add preference data with DPO or ORPO to reduce low-quality or hallucinated answers.
If GRPO is available in the installed TRL version, use verifiable coding rewards on unit-tested tasks rather than subjective string heuristics.
Distill reasoning behavior from larger open models only where licensing permits, and store provenance for every dataset.
Use QLoRA / 4-bit loading for efficient GPU training.
Keep inference GGUF quantized for CPU/GPU-local deployment.

Do not indiscriminately scrape “all social/public internet” data. That creates legal, privacy, quality, and contamination issues. Prefer permissively licensed datasets and retain dataset cards/provenance.

Cleanup completed in this pass

Removed .mypy_cache/ generated cache.
Removed the malformed empty directory USERPROFILE.ollamamodels/.
Added tested web search and research-generation endpoints.
Removed the React/Vite frontend and made the project API-only.
Tightened request models for message roles and history size.
Hardened request-size middleware against invalid Content-Length.
Added tests for search, disabled search, and research prompt injection.
Updated README API/environment documentation.

Current verification commands

python -m pytest tests/ -v — 22 backend endpoint tests
python -m ruff check .
python -m mypy server.py --ignore-missing-imports

Run these after backend changes.

Production quality audit results (v2.1 — Q8_0 GGUF, 1.53 GB)

All tests ran on RTX 4060 Laptop GPU, using CPU inference, load time 0.35 s.

Task	Result	Notes
Floyd cycle detection	✅ Correct	O(n)/O(1) with explanation
Flatten nested list	✅ Correct	Recursive, clean
Kadane's max subarray	✅ Correct	With complexity analysis
Dijkstra shortest path	✅ Correct	Heap-based, full implementation
Fix zero-division bug	✅ Correct	Detected and fixed
Refactor to context manager	✅ Correct	One-liner
Decorator explanation	✅ Correct	2-sentence, accurate
JavaScript debounce	✅ Correct	With explanation
SQL top-5 users	✅ Correct	GROUP BY + ORDER BY + LIMIT
BFS traversal	✅ Correct	deque-based
TypeScript typed fetch	⚠️ Partial	Mixes RxJS/fetch patterns
Refactor global→class	⚠️ Weak	Didn't complete the OOP refactor
Non-existent API hallucination	❌ Hallucinated	Small model limitation

Root causes and mitigations:

TypeScript/OOP refactor gaps — training data was Python-heavy. Adding more TypeScript and OOP refactor examples to datasets/raw/available_training.jsonl and re-running train_available.py with --max-steps 200 would improve these.
Hallucination of fictional APIs — this is a fundamental 1.5B model limitation. The system prompt already instructs against it (Do not invent APIs…) but small models don't reliably follow system instructions they weren't trained on. Mitigations: add negative examples to DPO pairs, or use the /generate/research endpoint to ground answers in real web results.
n_ctx warning — n_ctx_per_seq (2048) < n_ctx_train (32768) is informational only. For most coding tasks 2048 tokens is sufficient. Increase BLITZKODE_N_CTX if you need longer code generation.

Next recommended work

Expand dataset — add TypeScript, OOP refactoring, and "I don't know" (grounding) examples to datasets/raw/available_training.jsonl, then re-run train_available.py --max-steps 300.
DPO for hallucination — add preference pairs where chosen = "I don't know, here's how to check" and rejected = <hallucinated answer>, run train_dpo.py.
API contract tests — add tests for OpenAPI schema stability and streaming client behavior.
Search cache tuning — tune BLITZKODE_SEARCH_CACHE_TTL based on expected research workload.
Context window — increase BLITZKODE_N_CTX=4096 in production for longer codebases, or 8192 with GPU offload.
GRPO reward training — use TRL GRPOTrainer with unit-test-verified coding rewards once a suitable test harness is available.