Instructions to use sterlixlol/kazi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sterlixlol/kazi with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sterlixlol/kazi", filename="kazi-final-q8_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use sterlixlol/kazi with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sterlixlol/kazi:Q8_0 # Run inference directly in the terminal: llama-cli -hf sterlixlol/kazi:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sterlixlol/kazi:Q8_0 # Run inference directly in the terminal: llama-cli -hf sterlixlol/kazi:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sterlixlol/kazi:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf sterlixlol/kazi:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sterlixlol/kazi:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf sterlixlol/kazi:Q8_0
Use Docker
docker model run hf.co/sterlixlol/kazi:Q8_0
- LM Studio
- Jan
- vLLM
How to use sterlixlol/kazi with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sterlixlol/kazi" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sterlixlol/kazi", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sterlixlol/kazi:Q8_0
- Ollama
How to use sterlixlol/kazi with Ollama:
ollama run hf.co/sterlixlol/kazi:Q8_0
- Unsloth Studio new
How to use sterlixlol/kazi with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sterlixlol/kazi to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sterlixlol/kazi to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sterlixlol/kazi to start chatting
- Pi new
How to use sterlixlol/kazi with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sterlixlol/kazi:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sterlixlol/kazi:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sterlixlol/kazi with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sterlixlol/kazi:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sterlixlol/kazi:Q8_0
Run Hermes
hermes
- Docker Model Runner
How to use sterlixlol/kazi with Docker Model Runner:
docker model run hf.co/sterlixlol/kazi:Q8_0
- Lemonade
How to use sterlixlol/kazi with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sterlixlol/kazi:Q8_0
Run and chat with the model
lemonade run user.kazi-Q8_0
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Kazi — Qwen3.6-27B Agentic SWE Fine-Tune (Q8_0 GGUF)
This is Kazi v1 — a QLoRA fine-tune of Qwen3.6-27B-Heretic2-Uncensored-Finetune-Thinking, distilled on 19,000 judge-filtered multi-turn agentic SWE traces from sterlixlol/kazi-agentic-traces-19k.
What it does differently
Measured shifts (50-prompt eval vs base)
| Metric | Base | Kazi (final) | Δ |
|---|---|---|---|
| Eval loss | 0.4762 | 0.4359 | -8.5% |
| Text-only responses | 22 / 50 (44%) | 5 / 50 (10%) | -77% |
| Avg completion tokens per turn | 569 | 131 | -77% |
Behavioral upgrades distilled from the dataset
The training corpus is 19K judge-filtered traces from frontier models doing real SWE work. Anything that survived the judge had to be: (1) correct, (2) terse, (3) tool-grounded, (4) coherent across turns. The LoRA pulls the base model toward that joint distribution. In practice:
- Calls the tool instead of describing it — biggest single shift (the 77% drop in text-only replies)
- Reads before writing — looks at files / runs grep before producing edits, instead of guessing at API shapes
- Diff/patch-shaped edits — produces minimal scoped changes rather than full-file rewrites
- Multi-turn plan coherence — keeps plan state across 5–15 turn conversations without drifting
- Recovers from tool errors — when a command fails, debugs the actual error rather than retrying blind
- Convention-matching code — picks up existing style (naming, indent, framework idioms) from files it just read
- Shell + git fluency —
find,grep,rg,git log,git diff, conditional pipelines — used like a human dev would - Anti-yap discipline — no apology preambles, no "Certainly! I'd be happy to…", no recap-the-task openings, no recap-what-I-just-did closings
- Honest uncertainty — admits when it doesn't know rather than hallucinating function names or flags
- Refusal-free — base is uncensored; fine-tune doesn't add safety filters back
The first three are directly measured. The rest are properties of the source distribution that the LoRA was strong enough to absorb at r=32 / α=64 over 2 epochs.
Use it
Easiest path — llama.cpp server:
hf download sterlixlol/kazi --local-dir .
llama-server \
-m kazi-final-q8_0.gguf \
-ngl -1 \
-c 262144 \
--jinja \
--temp 1.0 \
--parallel 1 \
--host 0.0.0.0 \
--port 8080
Hardware needed: ~32 GB VRAM at full 262K context. Single 3090/4090/A100/L40S all work; 4×3090 layer-split also fine.
Training recipe
| Setting | Value |
|---|---|
| Base | DavidAU/Qwen3.6-27B-Heretic2-Uncensored-Finetune-Thinking |
| Method | QLoRA 4-bit (Unsloth + bnb) |
| Rank | 32 |
| Alpha | 64 |
| Target modules | All 7 (q,k,v,o + gate,up,down) |
| LR | 2e-4, cosine schedule |
| Epochs | 2 |
| Hardware | 1× A100 40GB SXM4 (Vast.ai) |
| Total cost | $63 |
| Best checkpoint | step 2176 (final), eval loss 0.4359 |
| Kernels | flash-linear-attention + causal-conv1d (3.7× speedup on Gated DeltaNet) |
Eval loss progression across saved checkpoints:
0.4762 → 0.4576 → 0.4461 → 0.4524 → 0.4445 → 0.4383 → 0.4364 → 0.4359
base step290 step580 step1160 ... final
Dataset
sterlixlol/kazi-agentic-traces-19k — 19K multi-turn agentic SWE traces, judge-filtered from a larger pool of completions across multiple frontier LLMs.
Files
kazi-final-q8_0.gguf— 27 GB, Q8_0 quantization (8.50 BPW). The merged weights, what you actually want.
Limitations
- Base is Heretic2-Uncensored — refusal rate is ~0%. Treat it like an unrestricted dev tool, not a content filter.
- Tuned for English. Multilingual ability inherits from base, untested in fine-tune.
- No vision tower — text only.
- Not a chatbot for casual chitchat. It will try to call tools or write code.
License
Apache 2.0, inherited from Qwen3 base. Use freely.
- Downloads last month
- 86
8-bit
Model tree for sterlixlol/kazi
Base model
trohrbaugh/Qwen3.6-27B-heretic-ara
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sterlixlol/kazi", filename="kazi-final-q8_0.gguf", )