Qwen3.5-9B-Base-NVFP4
NVFP4 (4-bit floating point) quantization of Qwen/Qwen3.5-9B-Base using NVIDIA TensorRT Model Optimizer (modelopt).
Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-9B-Base |
| Parameters | 9.5B |
| Quantization | NVFP4 (group_size=16) |
| Model size | 8.0 GB |
| Compression | ~1.4x vs bf16 |
| Excluded | lm_head |
| Producer | modelopt 0.37.0 |
| Calibration | 256 samples, CNN/DailyMail, max_seq_len=2048 |
Architecture
Hybrid Gated DeltaNet + Gated Attention with 32 layers in a repeating 3x linear_attention + 1x full_attention pattern. Includes MTP (Multi-Token Prediction) head.
Usage with vLLM
vllm serve Qwen3.5-9B-Base-NVFP4 \
--quantization modelopt \
--language-model-only \
--trust-remote-code \
--gpu-memory-utilization 0.85
Note: --language-model-only is required because Qwen3.5 models use ForConditionalGeneration (multimodal architecture). This flag skips the vision encoder for text-only inference.
Quantization
Produced on NVIDIA DGX Spark (GB10 Grace Blackwell, 128GB unified memory):
python quantize_nvfp4.py \
--model Qwen/Qwen3.5-9B-Base \
--output Qwen3.5-9B-Base-NVFP4 \
--calib-size 256
- Downloads last month
- 3,268
Model tree for osoleve/Qwen3.5-9B-Base-Text-NVFP4
Base model
Qwen/Qwen3.5-9B-Base