Qwen3-VL-4B-Instruct-NVFP4

This is a quantized version of Qwen/Qwen3-VL-4B-Instruct using SmoothQuant + NVFP4 across all text linear layers.

Quantization Strategy

Component Scheme Details
All Text Linear Layers NVFP4 W4A4 (W4A16 via Marlin on Ampere)
Vision Encoder BF16 (unquantized) Full precision for visual understanding
LM Head BF16 (unquantized) Full precision for output quality

SmoothQuant

Applied with strength 0.8 for activation smoothing before quantization:

  • Q/K/V projections ← input_layernorm
  • Gate/Up projections ← post_attention_layernorm

Model Details

  • Base Model: Qwen/Qwen3-VL-4B-Instruct (4.4B parameters)
  • Quantization Method: compressed-tensors (llm-compressor)
  • Model Size: ~4.1 GB (reduced from ~8.9 GB BF16, ~54% compression)
  • Calibration: 512 samples from flickr30k, max_seq_length=2048

Usage with vLLM

vllm serve JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4 \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192
from vllm import LLM, SamplingParams

llm = LLM(
    model="JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4",
    quantization="compressed-tensors",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)

Usage with Transformers

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch

model_id = "JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

License

Apache 2.0, same as the base model.

Downloads last month
17
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JEILDLWLRMA/Qwen3-VL-4B-Instruct-NVFP4

Quantized
(53)
this model