Qwen3-VL-2B-Instruct-RKLLM-v1.2.3 / README.md

Add RKLLM v1.2.3 model files: LLM decoder (W8A8) + vision encoders at 448/672/896

5c93381 verified 1 day ago

6.81 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	library_name: rkllm
	tags:
	- rkllm
	- rknn
	- rk3588
	- npu
	- qwen3-vl
	- vision-language
	- orange-pi
	- edge-ai
	- ocr
	base_model: Qwen/Qwen3-VL-2B-Instruct
	pipeline_tag: image-text-to-text
	---

	# Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU)

	Pre-converted [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for the Rockchip RK3588 NPU using [rknn-llm](https://github.com/airockchip/rknn-llm) runtime v1.2.3.

	Runs on Orange Pi 5 Plus, Rock 5B, Radxa NX5, and other RK3588-based SBCs with 8GB+ RAM.

	## Files

	\| File \| Size \| Description \|
	\|---\|---\|---\|
	\| `qwen3-vl-2b-instruct_w8a8_rk3588.rkllm` \| 2.3 GB \| LLM decoder (W8A8 quantized) — shared by all vision resolutions \|
	\| `qwen3-vl-2b_vision_448_rk3588.rknn` \| 812 MB \| Vision encoder @ 448×448 (default, 196 tokens) \|
	\| `qwen3-vl-2b_vision_672_rk3588.rknn` \| 854 MB \| Vision encoder @ 672×672 (441 tokens) ⭐ Recommended \|
	\| `qwen3-vl-2b_vision_896_rk3588.rknn` \| 923 MB \| Vision encoder @ 896×896 (784 tokens) \|

	## Choosing a Vision Encoder Resolution

	The LLM decoder (`.rkllm`) is resolution-independent — only the vision encoder (`.rknn`) changes. Place one `.rknn` file alongside the `.rkllm` in your model directory, or rename alternatives to `.rknn.alt` to disable them.

	\| Resolution \| Visual Tokens \| Encoder Time* \| Total Response* \| Best For \|
	\|---\|---\|---\|---\|---\|
	\| 448×448 \| 196 (14×14) \| ~2s \| ~5-10s \| General scene description, fast responses \|
	\| 672×672 ⭐ \| 441 (21×21) \| ~4s \| ~9-11s \| Balanced: good detail + reasonable speed \|
	\| 896×896 \| 784 (28×28) \| ~12s \| ~25-28s \| Maximum detail, fine text/OCR tasks \|

	\*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image.

	### Resolution Math

	Qwen3-VL uses `patch_size=16` and `merge_size=2`, so:
	- Resolution must be divisible by 32 (16 × 2)
	- Visual tokens = (height/32)² = 196 / 441 / 784 for 448 / 672 / 896

	Higher resolution = more visual tokens = better fine detail but:
	- Proportionally more NPU compute for the vision encoder
	- More tokens for the LLM to process (longer prefill)
	- Same decode speed (~15 tok/s) — only "time to first token" increases

	## Quick Start

	### Directory Structure

	```
	~/models/Qwen3-VL-2b/
	qwen3-vl-2b-instruct_w8a8_rk3588.rkllm # LLM decoder (always needed)
	qwen3-vl-2b_vision_672_rk3588.rknn # Active vision encoder
	qwen3-vl-2b_vision_448_rk3588.rknn.alt # Alternative (inactive)
	qwen3-vl-2b_vision_896_rk3588.rknn.alt # Alternative (inactive)
	```

	### Switching Resolution

	To switch to a different resolution, rename the files:

	```bash
	cd ~/models/Qwen3-VL-2b/

	# Deactivate current encoder
	mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt

	# Activate the 896 encoder
	mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn

	# Restart your API server
	sudo systemctl restart rkllm-api
	```

	### Using with RKLLM API Server

	This model is designed for use with the [RKLLM API Server](https://github.com/jdacostap/rkllm-api), which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the `.rknn` file's input tensor attributes.

	## Export Details

	### LLM Decoder

	- Source: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)
	- Quantization: W8A8 (8-bit weights, 8-bit activations)
	- Tool: rkllm-toolkit v1.2.3
	- Context: 4096 tokens

	### Vision Encoders

	- Source: Qwen3-VL-2B-Instruct visual encoder weights
	- Export pipeline: HuggingFace model → ONNX (`export_vision.py`) → RKNN (`export_vision_rknn.py`)
	- Tool: rknn-toolkit2 v2.3.2
	- Precision: FP32 (no quantization — vision encoder quality is critical)
	- Target: rk3588

	The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom `--height` and `--width` flags to `export_vision.py` and `export_vision_rknn.py` from the [rknn-llm multimodal demo](https://github.com/airockchip/rknn-llm/tree/main/examples/multimodal_model_demo/export).

	### Re-exporting at a Custom Resolution

	To export the vision encoder at a different resolution (must be divisible by 32):

	```bash
	# Activate the export environment
	source ~/rkllm-env/bin/activate
	cd ~/rknn-llm/examples/multimodal_model_demo

	# Step 1: Export HuggingFace model to ONNX
	python3 export/export_vision.py \
	--path ~/models-hf/Qwen3-VL-2B-Instruct \
	--model_name qwen3-vl \
	--height 672 --width 672 \
	--device cpu

	# Step 2: Convert ONNX to RKNN
	python3 export/export_vision_rknn.py \
	--path ./onnx/qwen3-vl_vision.onnx \
	--model_name qwen3-vl \
	--target-platform rk3588 \
	--height 672 --width 672
	```

	Memory requirements: ~20 GB RAM (or swap) for 672×672, ~35 GB for 896×896. CPU-only export works fine (no GPU needed).

	Dependencies (in a Python 3.10 venv):
	- `rknn-toolkit2 >= 2.3.2`
	- `torch == 2.4.0`
	- `transformers >= 4.57.0`
	- `onnx >= 1.18.0`

	## Performance Benchmarks

	Tested on Orange Pi 5 Plus (16GB RAM), RK3588 SoC, RKNPU driver 0.9.8:

	\| Metric \| 448×448 \| 672×672 \| 896×896 \|
	\|---\|---\|---\|---\|
	\| Vision encode time \| ~2 s \| ~4 s \| ~12 s \|
	\| Total VL response \| 5–10 s \| 9–11 s \| 25–28 s \|
	\| Text-only decode \| ~15 tok/s \| ~15 tok/s \| ~15 tok/s \|
	\| Peak RAM (VL inference) \| ~5.5 GB \| ~6.5 GB \| ~8.5 GB \|
	\| RKNN file size \| 812 MB \| 854 MB \| 923 MB \|

	## Known Limitations

	- OCR accuracy: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters.
	- Fixed resolution: Each `.rknn` file is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file.
	- REGTASK warnings: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588.

	## License

	Apache 2.0, inherited from [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct).

	## Credits

	- Model: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-VL-2B-Instruct
	- Runtime: [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for rkllm-toolkit and rknn-toolkit2
	- API Server: [RKLLM API Server](https://github.com/jdacostap/rkllm-api) — OpenAI-compatible server for RK3588 NPU