Qwen3-VL-2B-Instruct — OpenVINO NF4 (Intel NPU)

Qwen/Qwen3-VL-2B-Instruct exported to OpenVINO IR with NF4 channel-wise weight compression for Intel NPU inference.

⚠️ NF4 requires Intel® Core™ Ultra Series 2 (Lunar Lake) NPU or newer.
For older NPU hardware use an INT4 channel-wise model instead.

Quantization details

Property	Value
Weight format	`nf4`
Quantization mode	channel-wise (`--group-size -1`)
Symmetry	symmetric (`--sym`)
4-bit ratio	1.0 (100 % of eligible layers)
Tool	optimum-intel + NNCF

NNCF bitwidth distribution

Component	Mode
Language model backbone (196 layers)	nf4, per-channel
Embeddings / LM head (1 layer)	int8_asym, per-channel
Vision encoder (104 layers)	int8_sym, per-channel

Installation

pip install openvino-genai openvino openvino-tokenizers

Usage

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline("Qwen3-VL-2B-Instruct-ov-nf4", "NPU")
print(pipe.generate("Hello!", max_new_tokens=200))

# Optional: tune runtime for best throughput
pipeline_config = {
    "MAX_PROMPT_LEN": 1024,
    "MIN_RESPONSE_LEN": 256,
    "GENERATE_HINT": "BEST_PERF",
    "CACHE_DIR": ".npucache",
}
pipe = ov_genai.LLMPipeline("Qwen3-VL-2B-Instruct-ov-nf4", "NPU", pipeline_config)

Export command

optimum-cli export openvino \
    --model Qwen/Qwen3-VL-2B-Instruct \
    --trust-remote-code \
    --weight-format nf4 \
    --sym \
    --ratio 1.0 \
    --group-size -1 \
    Qwen3-VL-2B-Instruct-ov-nf4

Additional resources

Downloads last month: 13

Model tree for 0ldev/Qwen3-VL-2B-Instruct-ov-nf4-npu

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(185)

this model