Zen5 Flash

Smallest and fastest tier in the Zen5 family. A dense 4B-parameter instruct model with sub-100ms time-to-first-token at 32K context, tuned for high-throughput routing and simple agent loops.

Part of the canonical Zen5 ladder:

SKU	Hardware fit	This repo
`zen5-flash`	anything (4 GB VRAM)	← you are here
`zen5-mini`	hosted only	zen-5-mini-gguf
`zen5` (default)	24 GB+ VRAM	zen-5-gguf
`zen5-pro`	Mac M4 Max / DGX Spark / H100 80GB	zen-5-pro-gguf
`zen5-max`	Mac Studio M3 Ultra 512GB / 8x H100	zen-5-max-gguf

Files

File	Format
`model-00001-of-00002.safetensors` + `model-00002-of-00002.safetensors`	sharded safetensors
`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`	tokenizer
`config.json`, `generation_config.json`	model config
`chat_template.jinja`	chat template

Run

Hosted via the Hanzo gateway (api.hanzo.ai) as zen5-flash — see https://docs.hanzo.ai/zen.

Local with the zen5-engine or any transformers-compatible runtime:

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("zenlm/zen-5-flash-gguf")
model = AutoModelForCausalLM.from_pretrained("zenlm/zen-5-flash-gguf", device_map="auto")