--- language: - en - zh license: apache-2.0 library_name: rkllm tags: - rkllm - rknn - rk3588 - npu - qwen3-vl - vision-language - orange-pi - edge-ai - ocr base_model: Qwen/Qwen3-VL-2B-Instruct pipeline_tag: image-text-to-text --- # Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU) Pre-converted [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for the **Rockchip RK3588 NPU** using [rknn-llm](https://github.com/airockchip/rknn-llm) runtime v1.2.3. Runs on **Orange Pi 5 Plus**, **Rock 5B**, **Radxa NX5**, and other RK3588-based SBCs with 8GB+ RAM. ## Files | File | Size | Description | |---|---|---| | `qwen3-vl-2b-instruct_w8a8_rk3588.rkllm` | 2.3 GB | LLM decoder (W8A8 quantized) — shared by all vision resolutions | | `qwen3-vl-2b_vision_448_rk3588.rknn` | 812 MB | Vision encoder @ 448×448 (default, 196 tokens) | | `qwen3-vl-2b_vision_672_rk3588.rknn` | 854 MB | Vision encoder @ 672×672 (441 tokens) ⭐ **Recommended** | | `qwen3-vl-2b_vision_896_rk3588.rknn` | 923 MB | Vision encoder @ 896×896 (784 tokens) | ## Choosing a Vision Encoder Resolution The LLM decoder (`.rkllm`) is resolution-independent — only the vision encoder (`.rknn`) changes. Place **one** `.rknn` file alongside the `.rkllm` in your model directory, or rename alternatives to `.rknn.alt` to disable them. | Resolution | Visual Tokens | Encoder Time* | Total Response* | Best For | |---|---|---|---|---| | **448×448** | 196 (14×14) | ~2s | ~5-10s | General scene description, fast responses | | **672×672** ⭐ | 441 (21×21) | ~4s | ~9-11s | **Balanced: good detail + reasonable speed** | | **896×896** | 784 (28×28) | ~12s | ~25-28s | Maximum detail, fine text/OCR tasks | \*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image. ### Resolution Math Qwen3-VL uses `patch_size=16` and `merge_size=2`, so: - Resolution must be **divisible by 32** (16 × 2) - Visual tokens = (height/32)² = 196 / 441 / 784 for 448 / 672 / 896 Higher resolution = more visual tokens = better fine detail but: - Proportionally more NPU compute for the vision encoder - More tokens for the LLM to process (longer prefill) - Same decode speed (~15 tok/s) — only "time to first token" increases ## Quick Start ### Directory Structure ``` ~/models/Qwen3-VL-2b/ qwen3-vl-2b-instruct_w8a8_rk3588.rkllm # LLM decoder (always needed) qwen3-vl-2b_vision_672_rk3588.rknn # Active vision encoder qwen3-vl-2b_vision_448_rk3588.rknn.alt # Alternative (inactive) qwen3-vl-2b_vision_896_rk3588.rknn.alt # Alternative (inactive) ``` ### Switching Resolution To switch to a different resolution, rename the files: ```bash cd ~/models/Qwen3-VL-2b/ # Deactivate current encoder mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt # Activate the 896 encoder mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn # Restart your API server sudo systemctl restart rkllm-api ``` ### Using with RKLLM API Server This model is designed for use with the [RKLLM API Server](https://github.com/jdacostap/rkllm-api), which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the `.rknn` file's input tensor attributes. ## Export Details ### LLM Decoder - **Source**: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) - **Quantization**: W8A8 (8-bit weights, 8-bit activations) - **Tool**: rkllm-toolkit v1.2.3 - **Context**: 4096 tokens ### Vision Encoders - **Source**: Qwen3-VL-2B-Instruct visual encoder weights - **Export pipeline**: HuggingFace model → ONNX (`export_vision.py`) → RKNN (`export_vision_rknn.py`) - **Tool**: rknn-toolkit2 v2.3.2 - **Precision**: FP32 (no quantization — vision encoder quality is critical) - **Target**: rk3588 The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom `--height` and `--width` flags to `export_vision.py` and `export_vision_rknn.py` from the [rknn-llm multimodal demo](https://github.com/airockchip/rknn-llm/tree/main/examples/multimodal_model_demo/export). ### Re-exporting at a Custom Resolution To export the vision encoder at a different resolution (must be divisible by 32): ```bash # Activate the export environment source ~/rkllm-env/bin/activate cd ~/rknn-llm/examples/multimodal_model_demo # Step 1: Export HuggingFace model to ONNX python3 export/export_vision.py \ --path ~/models-hf/Qwen3-VL-2B-Instruct \ --model_name qwen3-vl \ --height 672 --width 672 \ --device cpu # Step 2: Convert ONNX to RKNN python3 export/export_vision_rknn.py \ --path ./onnx/qwen3-vl_vision.onnx \ --model_name qwen3-vl \ --target-platform rk3588 \ --height 672 --width 672 ``` **Memory requirements**: ~20 GB RAM (or swap) for 672×672, ~35 GB for 896×896. CPU-only export works fine (no GPU needed). **Dependencies** (in a Python 3.10 venv): - `rknn-toolkit2 >= 2.3.2` - `torch == 2.4.0` - `transformers >= 4.57.0` - `onnx >= 1.18.0` ## Performance Benchmarks Tested on **Orange Pi 5 Plus (16GB RAM)**, RK3588 SoC, RKNPU driver 0.9.8: | Metric | 448×448 | 672×672 | 896×896 | |---|---|---|---| | Vision encode time | ~2 s | ~4 s | ~12 s | | Total VL response | 5–10 s | 9–11 s | 25–28 s | | Text-only decode | ~15 tok/s | ~15 tok/s | ~15 tok/s | | Peak RAM (VL inference) | ~5.5 GB | ~6.5 GB | ~8.5 GB | | RKNN file size | 812 MB | 854 MB | 923 MB | ## Known Limitations - **OCR accuracy**: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters. - **Fixed resolution**: Each `.rknn` file is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file. - **REGTASK warnings**: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588. ## License Apache 2.0, inherited from [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct). ## Credits - **Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-VL-2B-Instruct - **Runtime**: [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for rkllm-toolkit and rknn-toolkit2 - **API Server**: [RKLLM API Server](https://github.com/jdacostap/rkllm-api) — OpenAI-compatible server for RK3588 NPU