Text Generation
llama-cpp-python
GGUF
English
code-generation
coding-assistant
llama.cpp
qwen2.5
python
javascript
fine-tuned
conversational
Instructions to use neuralbroker/blitzkode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - llama-cpp-python
How to use neuralbroker/blitzkode with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="neuralbroker/blitzkode", filename="blitzkode.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use neuralbroker/blitzkode with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama cli -hf neuralbroker/blitzkode
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf neuralbroker/blitzkode # Run inference directly in the terminal: llama cli -hf neuralbroker/blitzkode
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./llama-cli -hf neuralbroker/blitzkode
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf neuralbroker/blitzkode # Run inference directly in the terminal: ./build/bin/llama-cli -hf neuralbroker/blitzkode
Use Docker
docker model run hf.co/neuralbroker/blitzkode
- LM Studio
- Jan
- vLLM
How to use neuralbroker/blitzkode with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "neuralbroker/blitzkode" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "neuralbroker/blitzkode", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/neuralbroker/blitzkode
- Ollama
How to use neuralbroker/blitzkode with Ollama:
ollama run hf.co/neuralbroker/blitzkode
- Unsloth Studio
How to use neuralbroker/blitzkode with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for neuralbroker/blitzkode to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for neuralbroker/blitzkode to start chatting
- Pi
How to use neuralbroker/blitzkode with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf neuralbroker/blitzkode
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "neuralbroker/blitzkode" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use neuralbroker/blitzkode with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf neuralbroker/blitzkode
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default neuralbroker/blitzkode
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use neuralbroker/blitzkode with Docker Model Runner:
docker model run hf.co/neuralbroker/blitzkode
- Lemonade
How to use neuralbroker/blitzkode with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull neuralbroker/blitzkode
Run and chat with the model
lemonade run user.blitzkode-{{QUANT_TAG}}List all available models
lemonade list
| #!/usr/bin/env python3 | |
| """Run a small deterministic BlitzKode GGUF evaluation. | |
| This is intentionally lightweight: it verifies practical coding behavior on a | |
| small, repeatable prompt set and writes machine-readable results. It is not a | |
| replacement for a benchmark such as HumanEval, MBPP, or SWE-bench. | |
| """ | |
| from __future__ import annotations | |
| import argparse | |
| import json | |
| import os | |
| import re | |
| import time | |
| from collections.abc import Callable | |
| from dataclasses import dataclass | |
| from pathlib import Path | |
| from typing import Any, cast | |
| import llama_cpp | |
| REPO_ROOT = Path(__file__).resolve().parents[1] | |
| DEFAULT_MODEL_PATH = REPO_ROOT / "blitzkode.gguf" | |
| DEFAULT_OUTPUT_PATH = REPO_ROOT / "docs" / "evaluation_results.json" | |
| STOP_TOKENS = ["<|im_end|>", "<|im_start|>user"] | |
| SYSTEM_PROMPT = ( | |
| "<|im_start|>system\n" | |
| "You are BlitzKode, an AI coding assistant created by Sajad. " | |
| "Write clean, efficient, and practical code. If you do not know something, say so." | |
| "<|im_end|>" | |
| ) | |
| class EvalCase: | |
| name: str | |
| prompt: str | |
| checks: list[Callable[[str], bool]] | |
| max_tokens: int = 180 | |
| def contains_all(*needles: str) -> Callable[[str], bool]: | |
| def _check(text: str) -> bool: | |
| lowered = text.lower() | |
| return all(needle.lower() in lowered for needle in needles) | |
| return _check | |
| def regex(pattern: str) -> Callable[[str], bool]: | |
| compiled = re.compile(pattern, re.IGNORECASE | re.DOTALL) | |
| def _check(text: str) -> bool: | |
| return compiled.search(text) is not None | |
| return _check | |
| def build_prompt(user_prompt: str) -> str: | |
| return "\n".join( | |
| [ | |
| SYSTEM_PROMPT, | |
| f"<|im_start|>user\n{user_prompt}<|im_end|>", | |
| "<|im_start|>assistant\n", | |
| ] | |
| ) | |
| def eval_cases() -> list[EvalCase]: | |
| return [ | |
| EvalCase( | |
| name="python_factorial", | |
| prompt="Write a Python function named factorial that handles 0, positive integers, and rejects negative input.", | |
| checks=[contains_all("def factorial", "return"), regex(r"raise\s+ValueError|if\s+n\s*<\s*0")], | |
| ), | |
| EvalCase( | |
| name="binary_search", | |
| prompt="Implement iterative binary search in Python. Return the index or -1.", | |
| checks=[contains_all("def binary_search", "mid", "-1"), regex(r"while\s+\w+\s*<=\s*\w+"), regex(r"//\s*2")], | |
| ), | |
| EvalCase( | |
| name="sql_top_users", | |
| prompt="Write SQL to return the top 5 users by order count from users and orders tables.", | |
| checks=[contains_all("select", "join", "group by", "order by"), regex(r"limit\s+5|top\s+5")], | |
| ), | |
| EvalCase( | |
| name="unknown_api_uncertainty", | |
| prompt="What is the exact signature of imaginary_blitz_api()? If you are not sure, say you do not know.", | |
| checks=[regex(r"do not know|don't know|not sure|not have enough|cannot verify")], | |
| max_tokens=96, | |
| ), | |
| ] | |
| def load_model(model_path: Path, n_ctx: int, n_threads: int, n_batch: int, n_gpu_layers: int) -> llama_cpp.Llama: | |
| return llama_cpp.Llama( | |
| model_path=str(model_path), | |
| n_ctx=n_ctx, | |
| n_threads=n_threads, | |
| n_batch=n_batch, | |
| n_gpu_layers=n_gpu_layers, | |
| use_mmap=True, | |
| use_mlock=False, | |
| verbose=False, | |
| seed=42, | |
| ) | |
| def run_case(llm: llama_cpp.Llama, case: EvalCase) -> dict[str, Any]: | |
| started = time.perf_counter() | |
| raw = cast( | |
| dict[str, Any], | |
| llm( | |
| build_prompt(case.prompt), | |
| max_tokens=case.max_tokens, | |
| temperature=0.0, | |
| top_p=0.95, | |
| top_k=20, | |
| repeat_penalty=1.05, | |
| stop=STOP_TOKENS, | |
| ), | |
| ) | |
| elapsed = time.perf_counter() - started | |
| text = str(raw["choices"][0]["text"]).strip() | |
| check_results = [check(text) for check in case.checks] | |
| return { | |
| "name": case.name, | |
| "passed": all(check_results), | |
| "checks_passed": sum(check_results), | |
| "checks_total": len(check_results), | |
| "latency_seconds": round(elapsed, 3), | |
| "prompt": case.prompt, | |
| "response": text, | |
| } | |
| def parse_args() -> argparse.Namespace: | |
| parser = argparse.ArgumentParser(description=__doc__) | |
| parser.add_argument("--model", type=Path, default=Path(os.getenv("BLITZKODE_MODEL_PATH", DEFAULT_MODEL_PATH))) | |
| parser.add_argument("--output", type=Path, default=DEFAULT_OUTPUT_PATH) | |
| parser.add_argument("--ctx", type=int, default=int(os.getenv("BLITZKODE_N_CTX", "2048"))) | |
| parser.add_argument("--threads", type=int, default=int(os.getenv("BLITZKODE_THREADS", str(max(1, min(8, os.cpu_count() or 1)))))) | |
| parser.add_argument("--batch", type=int, default=int(os.getenv("BLITZKODE_BATCH", "256"))) | |
| parser.add_argument("--gpu-layers", type=int, default=int(os.getenv("BLITZKODE_GPU_LAYERS", "0"))) | |
| return parser.parse_args() | |
| def main() -> None: | |
| args = parse_args() | |
| model_path = args.model.resolve() | |
| if not model_path.exists(): | |
| raise SystemExit(f"Model file not found: {model_path}") | |
| started = time.perf_counter() | |
| llm = load_model(model_path, args.ctx, args.threads, args.batch, args.gpu_layers) | |
| load_seconds = time.perf_counter() - started | |
| cases = eval_cases() | |
| results = [run_case(llm, case) for case in cases] | |
| passed = sum(1 for result in results if result["passed"]) | |
| total = len(results) | |
| total_latency = sum(float(result["latency_seconds"]) for result in results) | |
| payload = { | |
| "model_path": str(model_path), | |
| "load_seconds": round(load_seconds, 3), | |
| "settings": { | |
| "ctx": args.ctx, | |
| "threads": args.threads, | |
| "batch": args.batch, | |
| "gpu_layers": args.gpu_layers, | |
| }, | |
| "summary": { | |
| "passed": passed, | |
| "total": total, | |
| "pass_rate": round(passed / total, 3), | |
| "total_generation_seconds": round(total_latency, 3), | |
| }, | |
| "results": results, | |
| } | |
| args.output.parent.mkdir(parents=True, exist_ok=True) | |
| args.output.write_text(json.dumps(payload, indent=2), encoding="utf-8") | |
| print(json.dumps(payload["summary"], indent=2)) | |
| print(f"Wrote {args.output}") | |
| if __name__ == "__main__": | |
| main() | |