Qwen3-VL-4B-Instruct-NVFP4
This is a quantized version of Qwen/Qwen3-VL-4B-Instruct using SmoothQuant + NVFP4 across all text linear layers.
Quantization Strategy
| Component | Scheme | Details |
|---|---|---|
| All Text Linear Layers | NVFP4 | W4A4 (W4A16 via Marlin on Ampere) |
| Vision Encoder | BF16 (unquantized) | Full precision for visual understanding |
| LM Head | BF16 (unquantized) | Full precision for output quality |
SmoothQuant
Applied with strength 0.8 for activation smoothing before quantization:
- Q/K/V projections ← input_layernorm
- Gate/Up projections ← post_attention_layernorm
Model Details
- Base Model: Qwen/Qwen3-VL-4B-Instruct (4.4B parameters)
- Quantization Method: compressed-tensors (llm-compressor)
- Model Size: ~4.1 GB (reduced from ~8.9 GB BF16, ~54% compression)
- Calibration: 512 samples from flickr30k, max_seq_length=2048
Usage with vLLM
vllm serve JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4 \
--quantization compressed-tensors \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
from vllm import LLM, SamplingParams
llm = LLM(
model="JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4",
quantization="compressed-tensors",
trust_remote_code=True,
kv_cache_dtype="fp8",
max_model_len=8192,
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)
Usage with Transformers
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch
model_id = "JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
License
Apache 2.0, same as the base model.
- Downloads last month
- 17
Model tree for JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4
Base model
Qwen/Qwen3-VL-4B-Instruct