Zen5 Pro

High-quality Zen5 tier that fits on a single 128 GB unified-memory machine. Sparse MoE with 284B total / 37B active parameters per token, 1M context, asymmetric routed-MoE quantization (routed IQ2_XXS up/gate, Q2_K down; shared experts, attention projections, routing logits and the LM head left at higher precision).

Runs on a single 128 GB Apple Silicon (M3/M4 Max), DGX Spark (GB10), or H100 80 GB with the zen5-engine.

Part of the canonical Zen5 ladder:

SKU	Hardware fit	This repo
`zen5-flash`	anything	zen-5-flash-gguf
`zen5-mini`	32 GB	zen-5-mini-gguf
`zen5` (default)	24 GB+ VRAM	zen-5-gguf
`zen5-pro`	128 GB single-machine	← you are here
`zen5-max`	512 GB / 8x H100	zen-5-max-gguf

Files

File pattern	Size	Notes
`-IQ2XXS-w2Q2K--chat-v2-imatrix.gguf`	81 GB	Recommended — IQ2_XXS routed-MoE with imatrix calibration, fits 128 GB
`-Layers37-42Q4KExperts--imatrix-fixed.gguf`	91 GB	Mixed-quant — higher quality at the boundary layers, fits 128 GB
`-Q4KExperts-F16HC--imatrix.gguf`	153 GB	Q4-imatrix — for 256 GB+ unified-memory machines
`*-MTP-Q4K-Q8_0-F32.gguf`	3.5 GB	Optional speculative-decoding draft heads (use with `--mtp`)
`-IQ2XXS-w2Q2K--chat-v2.gguf` (non-imatrix)	81 GB	Legacy Q2 — prefer the `-imatrix` version above
`-Q4KExperts-F16HC--chat-v2.gguf` (non-imatrix)	153 GB	Legacy Q4 — prefer the `-imatrix` version above

Run

Hosted via the Hanzo gateway (api.hanzo.ai) as zen5-pro.

Local with the zen5-engine:

git clone https://github.com/zenlm/zen5-engine
cd zen5-engine && make                  # macOS Metal
                       # or: make cuda-spark    # DGX Spark GB10
                       # or: make cuda-generic  # generic CUDA box

./download_model.sh q2-imatrix          # pulls this repo's recommended GGUF
./zen5 -p "Hello"
./zen5-server --ctx 100000 --kv-disk-dir /tmp/zen5-kv --kv-disk-space-mb 8192

Performance (Metal, `--ctx 32768 --nothink`)

Machine	Prompt	Prefill	Generation
MacBook Pro M3 Max 128 GB	short	58.5 t/s	26.7 t/s
MacBook Pro M3 Max 128 GB	11709 tok	250.1 t/s	21.5 t/s
Mac Studio M3 Ultra 512 GB	short	84.4 t/s	36.9 t/s
Mac Studio M3 Ultra 512 GB	12018 tok (Q4)	448.8 t/s	26.6 t/s
DGX Spark GB10 128 GB	7047 tok	343.8 t/s	13.7 t/s

Acknowledgements

Built on deepseek-ai/DeepSeek-V4-Flash. The asymmetric routed-MoE quantization scheme, GGUF layout, MTP draft-token support, imatrix calibration, and inference engine all come from Salvatore Sanfilippo's antirez/ds4 project, on whose shoulders this distribution stands. MIT-licensed; both antirez/ds4 and ggml-org/llama.cpp copyrights are preserved in the zen5-engine LICENSE file.

Downloads last month: 56

GGUF

Hardware compatibility

16-bit

32-bit

View +2 variants

Model tree for zenlm/zen-5-pro-gguf

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(66)

this model

Collection including zenlm/zen-5-pro-gguf

Zen5 Chat Ladder

Collection

Canonical Zen5 lineup, smallest to largest. • 6 items • Updated 3 days ago