Qwen3.5-9B-NVFP4

Quantized variant of Qwen/Qwen3.5-9B exported in unified Hugging Face checkpoint format.

Quantization Details

This checkpoint corresponds to an NVFP4 MLP-only export profile:

  • MLP layers: NVFP4
  • Non-MLP layers: kept in higher precision (e.g. BF16)
  • KV cache: left unquantized in export config (kvnone profile)
  • Vision modules: kept in higher precision to preserve multimodal quality

Recommended Runtime (vLLM Nightly)

Use the latest nightly vLLM build:

pip install -U --pre vllm --extra-index-url https://wheels.vllm.ai/nightly

Serve directly from this Hub repo:

vllm serve "ykarout/Qwen3.5-9b-nvfp4" \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.85 \ #adjust based on VRAM 
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --chat-template "chat_template.jinja" \ #chat_template.ninja file in the repo root
  --enable-prefix-caching \
  --served-model-name qwen3.5-9b-nvfp4

Quick Test

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"qwen3.5-9b-nvfp4",
    "messages":[{"role":"user","content":"Explain KV cache in 3 bullet points."}],
    "max_tokens":220,
    "temperature":0.7,
    "top_p":0.8,
    "top_k":20,
    "min_p":0.0,
    "presence_penalty":1.5,
    "repetition_penalty":1.0
  }'

Notes

  • If VRAM is tight, reduce --max-model-len and/or --gpu-memory-utilization.
  • This is a quantized checkpoint; output quality and speed depend on backend/kernel versions.
Downloads last month
205
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ykarout/Qwen3.5-9B-NVFP4

Finetuned
Qwen/Qwen3.5-9B
Quantized
(66)
this model