Instructions to use sterlixlol/kazi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sterlixlol/kazi with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sterlixlol/kazi",
	filename="kazi-final-q8_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use sterlixlol/kazi with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sterlixlol/kazi:Q8_0
# Run inference directly in the terminal:
llama-cli -hf sterlixlol/kazi:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sterlixlol/kazi:Q8_0
# Run inference directly in the terminal:
llama-cli -hf sterlixlol/kazi:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sterlixlol/kazi:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf sterlixlol/kazi:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sterlixlol/kazi:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sterlixlol/kazi:Q8_0

Use Docker

docker model run hf.co/sterlixlol/kazi:Q8_0

LM Studio
Jan

vLLM

How to use sterlixlol/kazi with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sterlixlol/kazi"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sterlixlol/kazi",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sterlixlol/kazi:Q8_0

Ollama
How to use sterlixlol/kazi with Ollama:
```
ollama run hf.co/sterlixlol/kazi:Q8_0
```

Unsloth Studio new

How to use sterlixlol/kazi with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sterlixlol/kazi to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sterlixlol/kazi to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sterlixlol/kazi to start chatting

Pi new

How to use sterlixlol/kazi with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sterlixlol/kazi:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "sterlixlol/kazi:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use sterlixlol/kazi with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sterlixlol/kazi:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sterlixlol/kazi:Q8_0

Run Hermes

hermes

Docker Model Runner
How to use sterlixlol/kazi with Docker Model Runner:
```
docker model run hf.co/sterlixlol/kazi:Q8_0
```

Lemonade

How to use sterlixlol/kazi with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sterlixlol/kazi:Q8_0

Run and chat with the model

lemonade run user.kazi-Q8_0

List all available models

lemonade list

Kazi — Qwen3.6-27B Agentic SWE Fine-Tune (Q8_0 GGUF)

This is Kazi v1 — a QLoRA fine-tune of Qwen3.6-27B-Heretic2-Uncensored-Finetune-Thinking, distilled on 19,000 judge-filtered multi-turn agentic SWE traces from sterlixlol/kazi-agentic-traces-19k.

What it does differently

Measured shifts (50-prompt eval vs base)

Metric	Base	Kazi (final)	Δ
Eval loss	0.4762	0.4359	-8.5%
Text-only responses	22 / 50 (44%)	5 / 50 (10%)	-77%
Avg completion tokens per turn	569	131	-77%

Behavioral upgrades distilled from the dataset

The training corpus is 19K judge-filtered traces from frontier models doing real SWE work. Anything that survived the judge had to be: (1) correct, (2) terse, (3) tool-grounded, (4) coherent across turns. The LoRA pulls the base model toward that joint distribution. In practice:

Calls the tool instead of describing it — biggest single shift (the 77% drop in text-only replies)
Reads before writing — looks at files / runs grep before producing edits, instead of guessing at API shapes
Diff/patch-shaped edits — produces minimal scoped changes rather than full-file rewrites
Multi-turn plan coherence — keeps plan state across 5–15 turn conversations without drifting
Recovers from tool errors — when a command fails, debugs the actual error rather than retrying blind
Convention-matching code — picks up existing style (naming, indent, framework idioms) from files it just read
Shell + git fluency — find, grep, rg, git log, git diff, conditional pipelines — used like a human dev would
Anti-yap discipline — no apology preambles, no "Certainly! I'd be happy to…", no recap-the-task openings, no recap-what-I-just-did closings
Honest uncertainty — admits when it doesn't know rather than hallucinating function names or flags
Refusal-free — base is uncensored; fine-tune doesn't add safety filters back

The first three are directly measured. The rest are properties of the source distribution that the LoRA was strong enough to absorb at r=32 / α=64 over 2 epochs.

Use it

Easiest path — llama.cpp server:

hf download sterlixlol/kazi --local-dir .

llama-server \
    -m kazi-final-q8_0.gguf \
    -ngl -1 \
    -c 262144 \
    --jinja \
    --temp 1.0 \
    --parallel 1 \
    --host 0.0.0.0 \
    --port 8080

Hardware needed: ~32 GB VRAM at full 262K context. Single 3090/4090/A100/L40S all work; 4×3090 layer-split also fine.

Training recipe

Setting	Value
Base	`DavidAU/Qwen3.6-27B-Heretic2-Uncensored-Finetune-Thinking`
Method	QLoRA 4-bit (Unsloth + bnb)
Rank	32
Alpha	64
Target modules	All 7 (q,k,v,o + gate,up,down)
LR	2e-4, cosine schedule
Epochs	2
Hardware	1× A100 40GB SXM4 (Vast.ai)
Total cost	$63
Best checkpoint	step 2176 (final), eval loss 0.4359
Kernels	flash-linear-attention + causal-conv1d (3.7× speedup on Gated DeltaNet)

Eval loss progression across saved checkpoints:

0.4762 → 0.4576 → 0.4461 → 0.4524 → 0.4445 → 0.4383 → 0.4364 → 0.4359
 base    step290  step580  step1160  ...                          final

Dataset

sterlixlol/kazi-agentic-traces-19k — 19K multi-turn agentic SWE traces, judge-filtered from a larger pool of completions across multiple frontier LLMs.

Files

kazi-final-q8_0.gguf — 27 GB, Q8_0 quantization (8.50 BPW). The merged weights, what you actually want.

Limitations

Base is Heretic2-Uncensored — refusal rate is ~0%. Treat it like an unrestricted dev tool, not a content filter.
Tuned for English. Multilingual ability inherits from base, untested in fine-tune.
No vision tower — text only.
Not a chatbot for casual chitchat. It will try to call tools or write code.

License

Apache 2.0, inherited from Qwen3 base. Use freely.

Downloads last month: 86

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

8-bit

Model tree for sterlixlol/kazi

Base model

trohrbaugh/Qwen3.6-27B-heretic-ara

Finetuned

DavidAU/Qwen3.6-27B-Heretic2-Uncensored-Finetune-Thinking

Quantized

(11)

this model