How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="zenlm/zen-5-gguf",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Zen5

Canonical default of the Zen5 family. Multimodal sparse MoE (image + text in → text out) with 35B total / 3B active parameters per token, 256K context. The everyday Zen5 model — agentic-trained, fast at scale, frontier-quality vision-language reasoning at a 3B-active compute budget.

Part of the canonical Zen5 ladder:

SKU Hardware fit This repo
zen5-flash anything (4 GB VRAM) zen-5-flash-gguf
zen5-mini 32 GB zen-5-mini-gguf
zen5 (default) 24 GB+ VRAM (Q4_K) ← you are here
zen5-pro Mac M4 Max / DGX Spark / H100 80GB zen-5-pro-gguf
zen5-max Mac Studio M3 Ultra 512GB / 8x H100 zen-5-max-gguf

Files

File Format
main GGUF (*-Q4_K.gguf) GGUF Q4_K (text + vision), refusal-orthogonalized
mmproj-model-f16.gguf multimodal vision projector — load alongside the main GGUF for image input

Run

Hosted via the Hanzo gateway (api.hanzo.ai) as zen5.

Local with llama.cpp (CLI / server) or zen5-engine:

hf download zenlm/zen-5-gguf --local-dir gguf
MAIN=$(ls gguf/*-Q4_K.gguf | head -1)

# text-only chat
llama-cli -m "$MAIN" -p "Explain MoE inference."

# vision-language (image input)
llama-cli -m "$MAIN" \
          --mmproj gguf/mmproj-model-f16.gguf \
          --image path/to/screenshot.png \
          -p "Describe this UI and propose a fix."

Acknowledgements

Built on Qwen/Qwen3.6-35B-A3B (Apache-2.0, multimodal MoE). Abliterated GGUF variant + MTP draft-token support by huihui-ai. Mirrored here for the Zen5 canonical distribution. Native FP8 weights are also available upstream at Qwen/Qwen3.6-35B-A3B-FP8 for higher-precision inference on H100/H200.

Downloads last month
10
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zenlm/zen-5-gguf

Quantized
(405)
this model

Collection including zenlm/zen-5-gguf