Zen5 Pro

High-quality Zen5 tier that fits on a single 128 GB unified-memory machine. Sparse MoE with 284B total / 37B active parameters per token, 1M context, asymmetric routed-MoE quantization (routed IQ2_XXS up/gate, Q2_K down; shared experts, attention projections, routing logits and the LM head left at higher precision).

Runs on a single 128 GB Apple Silicon (M3/M4 Max), DGX Spark (GB10), or H100 80 GB with the zen5-engine.

Part of the canonical Zen5 ladder:

SKU Hardware fit This repo
zen5-flash anything zen-5-flash-gguf
zen5-mini 32 GB zen-5-mini-gguf
zen5 (default) 24 GB+ VRAM zen-5-gguf
zen5-pro 128 GB single-machine โ† you are here
zen5-max 512 GB / 8x H100 zen-5-max-gguf

Files

File pattern Size Notes
*-IQ2XXS-w2Q2K-*-chat-v2-imatrix.gguf 81 GB Recommended โ€” IQ2_XXS routed-MoE with imatrix calibration, fits 128 GB
*-Layers37-42Q4KExperts-*-imatrix-fixed.gguf 91 GB Mixed-quant โ€” higher quality at the boundary layers, fits 128 GB
*-Q4KExperts-F16HC-*-imatrix.gguf 153 GB Q4-imatrix โ€” for 256 GB+ unified-memory machines
*-MTP-Q4K-Q8_0-F32.gguf 3.5 GB Optional speculative-decoding draft heads (use with --mtp)
*-IQ2XXS-w2Q2K-*-chat-v2.gguf (non-imatrix) 81 GB Legacy Q2 โ€” prefer the -imatrix version above
*-Q4KExperts-F16HC-*-chat-v2.gguf (non-imatrix) 153 GB Legacy Q4 โ€” prefer the -imatrix version above

Run

Hosted via the Hanzo gateway (api.hanzo.ai) as zen5-pro.

Local with the zen5-engine:

git clone https://github.com/zenlm/zen5-engine
cd zen5-engine && make                  # macOS Metal
                       # or: make cuda-spark    # DGX Spark GB10
                       # or: make cuda-generic  # generic CUDA box

./download_model.sh q2-imatrix          # pulls this repo's recommended GGUF
./zen5 -p "Hello"
./zen5-server --ctx 100000 --kv-disk-dir /tmp/zen5-kv --kv-disk-space-mb 8192

Performance (Metal, --ctx 32768 --nothink)

Machine Prompt Prefill Generation
MacBook Pro M3 Max 128 GB short 58.5 t/s 26.7 t/s
MacBook Pro M3 Max 128 GB 11709 tok 250.1 t/s 21.5 t/s
Mac Studio M3 Ultra 512 GB short 84.4 t/s 36.9 t/s
Mac Studio M3 Ultra 512 GB 12018 tok (Q4) 448.8 t/s 26.6 t/s
DGX Spark GB10 128 GB 7047 tok 343.8 t/s 13.7 t/s

Acknowledgements

Built on deepseek-ai/DeepSeek-V4-Flash. The asymmetric routed-MoE quantization scheme, GGUF layout, MTP draft-token support, imatrix calibration, and inference engine all come from Salvatore Sanfilippo's antirez/ds4 project, on whose shoulders this distribution stands. MIT-licensed; both antirez/ds4 and ggml-org/llama.cpp copyrights are preserved in the zen5-engine LICENSE file.

Downloads last month
56
GGUF
Hardware compatibility
Log In to add your hardware

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zenlm/zen-5-pro-gguf

Quantized
(66)
this model

Collection including zenlm/zen-5-pro-gguf