Qwen3.5-9B-Base-NVFP4

NVFP4 (4-bit floating point) quantization of Qwen/Qwen3.5-9B-Base using NVIDIA TensorRT Model Optimizer (modelopt).

Details

Property Value
Base model Qwen/Qwen3.5-9B-Base
Parameters 9.5B
Quantization NVFP4 (group_size=16)
Model size 8.0 GB
Compression ~1.4x vs bf16
Excluded lm_head
Producer modelopt 0.37.0
Calibration 256 samples, CNN/DailyMail, max_seq_len=2048

Architecture

Hybrid Gated DeltaNet + Gated Attention with 32 layers in a repeating 3x linear_attention + 1x full_attention pattern. Includes MTP (Multi-Token Prediction) head.

Usage with vLLM

vllm serve Qwen3.5-9B-Base-NVFP4 \
    --quantization modelopt \
    --language-model-only \
    --trust-remote-code \
    --gpu-memory-utilization 0.85

Note: --language-model-only is required because Qwen3.5 models use ForConditionalGeneration (multimodal architecture). This flag skips the vision encoder for text-only inference.

Quantization

Produced on NVIDIA DGX Spark (GB10 Grace Blackwell, 128GB unified memory):

python quantize_nvfp4.py \
    --model Qwen/Qwen3.5-9B-Base \
    --output Qwen3.5-9B-Base-NVFP4 \
    --calib-size 256
Downloads last month
3,268
Safetensors
Model size
6B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for osoleve/Qwen3.5-9B-Base-Text-NVFP4

Quantized
(9)
this model