How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf evalengine/unbound-q-0.8b-GGUF:Q4_K_M
Use Docker
docker model run hf.co/evalengine/unbound-q-0.8b-GGUF:Q4_K_M
Quick Links

Unbound Q-0.8B GGUF — because there is no boundary

No guarantee — use at your own risk. Reduced safety filtering; can produce harmful or false output. This is the smallest Unbound build — substantially less reliable than E2B/E4B. See the benchmark on the main card.

GGUF quants of evalengine/unbound-q-0.8b for Ollama, llama.cpp, and LM Studio. Built by Chromia & Eval Engine.

Available quants

Single-file GGUFs (no shards — small enough at this size class).

Quant Size Notes
Q4_K_M 530 MB Recommended default — phone-deployable
bf16 1.5 GB Full precision; reference quality

Run

# llama.cpp
./llama-cli -m unbound-q-0.8b-Q4_K_M.gguf \
  --jinja -ngl 99 \
  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0

Sampling matches Qwen3.5's non-thinking preset (see main card for details). For factual / brand questions drop --temp to ~0.3–0.5.

Headline benchmark

(See main card for the full table — corrected single-judge mimo-v2-pro numbers.)

refusal useful_compl. hallucination SimpleQA correct KL vs base
Unbound Q-0.8B 5.00% 6.35% 35.77% 1.50% 0.605

Refusal collapses 91% → 5% (−86 pts); KL ~5× cleaner than the larger Unbound E2B/E4B; useful-compliance and hallucination are materially worse than E2B/E4B — Q-0.8B is not a peer of the larger Unbound builds on quality. It is the on-phone footprint pick: ~530 MB vs 1.5/3.4 GB.

Acknowledgements

Fine-tuned with Unsloth + HF TRL. Abliteration via heretic. Compliance training data distilled from AEON and audited row-by-row; 48 major-fabrication rows decontaminated before this build.

Links

License

Apache-2.0, inherited from Qwen/Qwen3.5-0.8B.

Downloads last month
338
GGUF
Model size
0.8B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for evalengine/unbound-q-0.8b-GGUF

Quantized
(1)
this model

Collection including evalengine/unbound-q-0.8b-GGUF