Zen5 Chat Ladder
Collection
Canonical Zen5 lineup, smallest to largest. • 6 items • Updated
How to use zenlm/zen-5-flash-gguf with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="zenlm/zen-5-flash-gguf") # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("zenlm/zen-5-flash-gguf")
model = AutoModelForCausalLM.from_pretrained("zenlm/zen-5-flash-gguf")How to use zenlm/zen-5-flash-gguf with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zenlm/zen-5-flash-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "zenlm/zen-5-flash-gguf",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/zenlm/zen-5-flash-gguf
How to use zenlm/zen-5-flash-gguf with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "zenlm/zen-5-flash-gguf" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "zenlm/zen-5-flash-gguf",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "zenlm/zen-5-flash-gguf" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "zenlm/zen-5-flash-gguf",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use zenlm/zen-5-flash-gguf with Docker Model Runner:
docker model run hf.co/zenlm/zen-5-flash-gguf
Smallest and fastest tier in the Zen5 family. A dense 4B-parameter instruct model with sub-100ms time-to-first-token at 32K context, tuned for high-throughput routing and simple agent loops.
Part of the canonical Zen5 ladder:
| SKU | Hardware fit | This repo |
|---|---|---|
zen5-flash |
anything (4 GB VRAM) | ← you are here |
zen5-mini |
hosted only | zen-5-mini-gguf |
zen5 (default) |
24 GB+ VRAM | zen-5-gguf |
zen5-pro |
Mac M4 Max / DGX Spark / H100 80GB | zen-5-pro-gguf |
zen5-max |
Mac Studio M3 Ultra 512GB / 8x H100 | zen-5-max-gguf |
| File | Format |
|---|---|
model-00001-of-00002.safetensors + model-00002-of-00002.safetensors |
sharded safetensors |
tokenizer.json, tokenizer_config.json, special_tokens_map.json |
tokenizer |
config.json, generation_config.json |
model config |
chat_template.jinja |
chat template |
Hosted via the Hanzo gateway (api.hanzo.ai) as zen5-flash — see https://docs.hanzo.ai/zen.
Local with the zen5-engine or any transformers-compatible runtime:
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("zenlm/zen-5-flash-gguf")
model = AutoModelForCausalLM.from_pretrained("zenlm/zen-5-flash-gguf", device_map="auto")
Built on Qwen/Qwen3-4B-Instruct-2507 (Apache-2.0) with refusal-direction-orthogonalized weights to improve agentic dual-use task handling.
Base model
Qwen/Qwen3-4B-Instruct-2507