Qwen3-VL-4B-Instruct-NVFP4

This is a quantized version of Qwen/Qwen3-VL-4B-Instruct using SmoothQuant + NVFP4 across all text linear layers.

Quantization Strategy

Component	Scheme	Details
All Text Linear Layers	NVFP4	W4A4 (W4A16 via Marlin on Ampere)
Vision Encoder	BF16 (unquantized)	Full precision for visual understanding
LM Head	BF16 (unquantized)	Full precision for output quality

SmoothQuant

Applied with strength 0.8 for activation smoothing before quantization:

Q/K/V projections ← input_layernorm
Gate/Up projections ← post_attention_layernorm

Model Details

Base Model: Qwen/Qwen3-VL-4B-Instruct (4.4B parameters)
Quantization Method: compressed-tensors (llm-compressor)
Model Size: ~4.1 GB (reduced from ~8.9 GB BF16, ~54% compression)
Calibration: 512 samples from flickr30k, max_seq_length=2048

Usage with vLLM

vllm serve JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4 \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192

from vllm import LLM, SamplingParams

llm = LLM(
    model="JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4",
    quantization="compressed-tensors",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)

Usage with Transformers

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch

model_id = "JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

License

Apache 2.0, same as the base model.

Downloads last month: 292

Safetensors

Model size

3B params

Tensor type

F32

BF16

F8_E4M3

Model tree for JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4

Base model

Qwen/Qwen3-VL-4B-Instruct

Quantized

(72)

this model