osoleve's picture
Upload folder using huggingface_hub
9ee2ce0 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen3.5-9B-Base
tags:
  - qwen3.5
  - nvfp4
  - quantized
  - modelopt
library_name: transformers

Qwen3.5-9B-Base-NVFP4

NVFP4 (4-bit floating point) quantization of Qwen/Qwen3.5-9B-Base using NVIDIA TensorRT Model Optimizer (modelopt).

Details

Property Value
Base model Qwen/Qwen3.5-9B-Base
Parameters 9.5B
Quantization NVFP4 (group_size=16)
Model size 8.0 GB
Compression ~1.4x vs bf16
Excluded lm_head
Producer modelopt 0.37.0
Calibration 256 samples, CNN/DailyMail, max_seq_len=2048

Architecture

Hybrid Gated DeltaNet + Gated Attention with 32 layers in a repeating 3x linear_attention + 1x full_attention pattern. Includes MTP (Multi-Token Prediction) head.

Usage with vLLM

vllm serve Qwen3.5-9B-Base-NVFP4 \
    --quantization modelopt \
    --language-model-only \
    --trust-remote-code \
    --gpu-memory-utilization 0.85

Note: --language-model-only is required because Qwen3.5 models use ForConditionalGeneration (multimodal architecture). This flag skips the vision encoder for text-only inference.

Quantization

Produced on NVIDIA DGX Spark (GB10 Grace Blackwell, 128GB unified memory):

python quantize_nvfp4.py \
    --model Qwen/Qwen3.5-9B-Base \
    --output Qwen3.5-9B-Base-NVFP4 \
    --calib-size 256