Qwen3-VL-2B-Instruct — OpenVINO NF4 (Intel NPU)

Qwen/Qwen3-VL-2B-Instruct exported to OpenVINO IR with NF4 channel-wise weight compression for Intel NPU inference.

⚠️ NF4 requires Intel® Core™ Ultra Series 2 (Lunar Lake) NPU or newer.
For older NPU hardware use an INT4 channel-wise model instead.

Quantization details

Property Value
Weight format nf4
Quantization mode channel-wise (--group-size -1)
Symmetry symmetric (--sym)
4-bit ratio 1.0 (100 % of eligible layers)
Tool optimum-intel + NNCF

NNCF bitwidth distribution

Component Mode
Language model backbone (196 layers) nf4, per-channel
Embeddings / LM head (1 layer) int8_asym, per-channel
Vision encoder (104 layers) int8_sym, per-channel

Installation

pip install openvino-genai openvino openvino-tokenizers

Usage

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline("Qwen3-VL-2B-Instruct-ov-nf4", "NPU")
print(pipe.generate("Hello!", max_new_tokens=200))

# Optional: tune runtime for best throughput
pipeline_config = {
    "MAX_PROMPT_LEN": 1024,
    "MIN_RESPONSE_LEN": 256,
    "GENERATE_HINT": "BEST_PERF",
    "CACHE_DIR": ".npucache",
}
pipe = ov_genai.LLMPipeline("Qwen3-VL-2B-Instruct-ov-nf4", "NPU", pipeline_config)

Export command

optimum-cli export openvino \
    --model Qwen/Qwen3-VL-2B-Instruct \
    --trust-remote-code \
    --weight-format nf4 \
    --sym \
    --ratio 1.0 \
    --group-size -1 \
    Qwen3-VL-2B-Instruct-ov-nf4

Additional resources

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0ldev/Qwen3-VL-2B-Instruct-ov-nf4-npu

Finetuned
(185)
this model