Zen5 Flash

Smallest and fastest tier in the Zen5 family. A dense 4B-parameter instruct model with sub-100ms time-to-first-token at 32K context, tuned for high-throughput routing and simple agent loops.

Part of the canonical Zen5 ladder:

SKU Hardware fit This repo
zen5-flash anything (4 GB VRAM) ← you are here
zen5-mini hosted only zen-5-mini-gguf
zen5 (default) 24 GB+ VRAM zen-5-gguf
zen5-pro Mac M4 Max / DGX Spark / H100 80GB zen-5-pro-gguf
zen5-max Mac Studio M3 Ultra 512GB / 8x H100 zen-5-max-gguf

Files

File Format
model-00001-of-00002.safetensors + model-00002-of-00002.safetensors sharded safetensors
tokenizer.json, tokenizer_config.json, special_tokens_map.json tokenizer
config.json, generation_config.json model config
chat_template.jinja chat template

Run

Hosted via the Hanzo gateway (api.hanzo.ai) as zen5-flash — see https://docs.hanzo.ai/zen.

Local with the zen5-engine or any transformers-compatible runtime:

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("zenlm/zen-5-flash-gguf")
model = AutoModelForCausalLM.from_pretrained("zenlm/zen-5-flash-gguf", device_map="auto")

Acknowledgements

Built on Qwen/Qwen3-4B-Instruct-2507 (Apache-2.0) with refusal-direction-orthogonalized weights to improve agentic dual-use task handling.

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zenlm/zen-5-flash-gguf

Finetuned
(1708)
this model

Collection including zenlm/zen-5-flash-gguf