Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU)

Pre-converted Qwen3-VL-2B-Instruct for the Rockchip RK3588 NPU using rknn-llm runtime v1.2.3.

Runs on Orange Pi 5 Plus, Rock 5B, Radxa NX5, and other RK3588-based SBCs with 8GB+ RAM.

Files

File Size Description
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm 2.3 GB LLM decoder (W8A8 quantized) β€” shared by all vision resolutions
qwen3-vl-2b_vision_448_rk3588.rknn 812 MB Vision encoder @ 448Γ—448 (default, 196 tokens)
qwen3-vl-2b_vision_672_rk3588.rknn 854 MB Vision encoder @ 672Γ—672 (441 tokens) ⭐ Recommended
qwen3-vl-2b_vision_896_rk3588.rknn 923 MB Vision encoder @ 896Γ—896 (784 tokens)

Choosing a Vision Encoder Resolution

The LLM decoder (.rkllm) is resolution-independent β€” only the vision encoder (.rknn) changes. Place one .rknn file alongside the .rkllm in your model directory, or rename alternatives to .rknn.alt to disable them.

Resolution Visual Tokens Encoder Time* Total Response* Best For
448Γ—448 196 (14Γ—14) ~2s ~5-10s General scene description, fast responses
672Γ—672 ⭐ 441 (21Γ—21) ~4s ~9-11s Balanced: good detail + reasonable speed
896Γ—896 784 (28Γ—28) ~12s ~25-28s Maximum detail, fine text/OCR tasks

*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image.

Resolution Math

Qwen3-VL uses patch_size=16 and merge_size=2, so:

  • Resolution must be divisible by 32 (16 Γ— 2)
  • Visual tokens = (height/32)Β² = 196 / 441 / 784 for 448 / 672 / 896

Higher resolution = more visual tokens = better fine detail but:

  • Proportionally more NPU compute for the vision encoder
  • More tokens for the LLM to process (longer prefill)
  • Same decode speed (~15 tok/s) β€” only "time to first token" increases

Quick Start

Directory Structure

~/models/Qwen3-VL-2b/
    qwen3-vl-2b-instruct_w8a8_rk3588.rkllm   # LLM decoder (always needed)
    qwen3-vl-2b_vision_672_rk3588.rknn        # Active vision encoder
    qwen3-vl-2b_vision_448_rk3588.rknn.alt    # Alternative (inactive)
    qwen3-vl-2b_vision_896_rk3588.rknn.alt    # Alternative (inactive)

Switching Resolution

To switch to a different resolution, rename the files:

cd ~/models/Qwen3-VL-2b/

# Deactivate current encoder
mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt

# Activate the 896 encoder
mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn

# Restart your API server
sudo systemctl restart rkllm-api

Using with RKLLM API Server

This model is designed for use with the RKLLM API Server, which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the .rknn file's input tensor attributes.

Export Details

LLM Decoder

  • Source: Qwen/Qwen3-VL-2B-Instruct
  • Quantization: W8A8 (8-bit weights, 8-bit activations)
  • Tool: rkllm-toolkit v1.2.3
  • Context: 4096 tokens

Vision Encoders

  • Source: Qwen3-VL-2B-Instruct visual encoder weights
  • Export pipeline: HuggingFace model β†’ ONNX (export_vision.py) β†’ RKNN (export_vision_rknn.py)
  • Tool: rknn-toolkit2 v2.3.2
  • Precision: FP32 (no quantization β€” vision encoder quality is critical)
  • Target: rk3588

The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom --height and --width flags to export_vision.py and export_vision_rknn.py from the rknn-llm multimodal demo.

Re-exporting at a Custom Resolution

To export the vision encoder at a different resolution (must be divisible by 32):

# Activate the export environment
source ~/rkllm-env/bin/activate
cd ~/rknn-llm/examples/multimodal_model_demo

# Step 1: Export HuggingFace model to ONNX
python3 export/export_vision.py \
  --path ~/models-hf/Qwen3-VL-2B-Instruct \
  --model_name qwen3-vl \
  --height 672 --width 672 \
  --device cpu

# Step 2: Convert ONNX to RKNN
python3 export/export_vision_rknn.py \
  --path ./onnx/qwen3-vl_vision.onnx \
  --model_name qwen3-vl \
  --target-platform rk3588 \
  --height 672 --width 672

Memory requirements: ~20 GB RAM (or swap) for 672Γ—672, ~35 GB for 896Γ—896. CPU-only export works fine (no GPU needed).

Dependencies (in a Python 3.10 venv):

  • rknn-toolkit2 >= 2.3.2
  • torch == 2.4.0
  • transformers >= 4.57.0
  • onnx >= 1.18.0

Performance Benchmarks

Tested on Orange Pi 5 Plus (16GB RAM), RK3588 SoC, RKNPU driver 0.9.8:

Metric 448Γ—448 672Γ—672 896Γ—896
Vision encode time ~2 s ~4 s ~12 s
Total VL response 5–10 s 9–11 s 25–28 s
Text-only decode ~15 tok/s ~15 tok/s ~15 tok/s
Peak RAM (VL inference) ~5.5 GB ~6.5 GB ~8.5 GB
RKNN file size 812 MB 854 MB 923 MB

Known Limitations

  • OCR accuracy: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters.
  • Fixed resolution: Each .rknn file is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file.
  • REGTASK warnings: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588.

License

Apache 2.0, inherited from Qwen3-VL-2B-Instruct.

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3

Finetuned
(120)
this model

Collection including GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3