Zen5 Pro
High-quality Zen5 tier that fits on a single 128 GB unified-memory machine. Sparse MoE with 284B total / 37B active parameters per token, 1M context, asymmetric routed-MoE quantization (routed IQ2_XXS up/gate, Q2_K down; shared experts, attention projections, routing logits and the LM head left at higher precision).
Runs on a single 128 GB Apple Silicon (M3/M4 Max), DGX Spark (GB10), or H100 80 GB with the zen5-engine.
Part of the canonical Zen5 ladder:
| SKU | Hardware fit | This repo |
|---|---|---|
zen5-flash |
anything | zen-5-flash-gguf |
zen5-mini |
32 GB | zen-5-mini-gguf |
zen5 (default) |
24 GB+ VRAM | zen-5-gguf |
zen5-pro |
128 GB single-machine | โ you are here |
zen5-max |
512 GB / 8x H100 | zen-5-max-gguf |
Files
| File pattern | Size | Notes |
|---|---|---|
*-IQ2XXS-w2Q2K-*-chat-v2-imatrix.gguf |
81 GB | Recommended โ IQ2_XXS routed-MoE with imatrix calibration, fits 128 GB |
*-Layers37-42Q4KExperts-*-imatrix-fixed.gguf |
91 GB | Mixed-quant โ higher quality at the boundary layers, fits 128 GB |
*-Q4KExperts-F16HC-*-imatrix.gguf |
153 GB | Q4-imatrix โ for 256 GB+ unified-memory machines |
*-MTP-Q4K-Q8_0-F32.gguf |
3.5 GB | Optional speculative-decoding draft heads (use with --mtp) |
*-IQ2XXS-w2Q2K-*-chat-v2.gguf (non-imatrix) |
81 GB | Legacy Q2 โ prefer the -imatrix version above |
*-Q4KExperts-F16HC-*-chat-v2.gguf (non-imatrix) |
153 GB | Legacy Q4 โ prefer the -imatrix version above |
Run
Hosted via the Hanzo gateway (api.hanzo.ai) as zen5-pro.
Local with the zen5-engine:
git clone https://github.com/zenlm/zen5-engine
cd zen5-engine && make # macOS Metal
# or: make cuda-spark # DGX Spark GB10
# or: make cuda-generic # generic CUDA box
./download_model.sh q2-imatrix # pulls this repo's recommended GGUF
./zen5 -p "Hello"
./zen5-server --ctx 100000 --kv-disk-dir /tmp/zen5-kv --kv-disk-space-mb 8192
Performance (Metal, --ctx 32768 --nothink)
| Machine | Prompt | Prefill | Generation |
|---|---|---|---|
| MacBook Pro M3 Max 128 GB | short | 58.5 t/s | 26.7 t/s |
| MacBook Pro M3 Max 128 GB | 11709 tok | 250.1 t/s | 21.5 t/s |
| Mac Studio M3 Ultra 512 GB | short | 84.4 t/s | 36.9 t/s |
| Mac Studio M3 Ultra 512 GB | 12018 tok (Q4) | 448.8 t/s | 26.6 t/s |
| DGX Spark GB10 128 GB | 7047 tok | 343.8 t/s | 13.7 t/s |
Acknowledgements
Built on deepseek-ai/DeepSeek-V4-Flash. The asymmetric routed-MoE quantization scheme, GGUF layout, MTP draft-token support, imatrix calibration, and inference engine all come from Salvatore Sanfilippo's antirez/ds4 project, on whose shoulders this distribution stands. MIT-licensed; both antirez/ds4 and ggml-org/llama.cpp copyrights are preserved in the zen5-engine LICENSE file.
- Downloads last month
- 56
Model tree for zenlm/zen-5-pro-gguf
Base model
deepseek-ai/DeepSeek-V4-Flash