Add RKLLM v1.2.3 model files: LLM decoder (W8A8) + vision encoders at 448/672/896

Browse files

Files changed (7) hide show

.gitattributes +4 -0
README.md +169 -0
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm +3 -0
qwen3-vl-2b_vision_448_rk3588.rknn +3 -0
qwen3-vl-2b_vision_672_rk3588.rknn +3 -0
qwen3-vl-2b_vision_896_rk3588.rknn +3 -0
upload.py +22 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+qwen3-vl-2b-instruct_w8a8_rk3588.rkllm filter=lfs diff=lfs merge=lfs -text
+qwen3-vl-2b_vision_448_rk3588.rknn filter=lfs diff=lfs merge=lfs -text
+qwen3-vl-2b_vision_672_rk3588.rknn filter=lfs diff=lfs merge=lfs -text
+qwen3-vl-2b_vision_896_rk3588.rknn filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,169 @@

+---
+language:
+  - en
+  - zh
+license: apache-2.0
+library_name: rkllm
+tags:
+  - rkllm
+  - rknn
+  - rk3588
+  - npu
+  - qwen3-vl
+  - vision-language
+  - orange-pi
+  - edge-ai
+  - ocr
+base_model: Qwen/Qwen3-VL-2B-Instruct
+pipeline_tag: image-text-to-text
+---
+# Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU)
+Pre-converted [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for the **Rockchip RK3588 NPU** using [rknn-llm](https://github.com/airockchip/rknn-llm) runtime v1.2.3.
+Runs on **Orange Pi 5 Plus**, **Rock 5B**, **Radxa NX5**, and other RK3588-based SBCs with 8GB+ RAM.
+## Files
+| File | Size | Description |
+|---|---|---|
+| `qwen3-vl-2b-instruct_w8a8_rk3588.rkllm` | 2.3 GB | LLM decoder (W8A8 quantized) — shared by all vision resolutions |
+| `qwen3-vl-2b_vision_448_rk3588.rknn` | 812 MB | Vision encoder @ 448×448 (default, 196 tokens) |
+| `qwen3-vl-2b_vision_672_rk3588.rknn` | 854 MB | Vision encoder @ 672×672 (441 tokens) ⭐ **Recommended** |
+| `qwen3-vl-2b_vision_896_rk3588.rknn` | 923 MB | Vision encoder @ 896×896 (784 tokens) |
+## Choosing a Vision Encoder Resolution
+The LLM decoder (`.rkllm`) is resolution-independent — only the vision encoder (`.rknn`) changes. Place **one** `.rknn` file alongside the `.rkllm` in your model directory, or rename alternatives to `.rknn.alt` to disable them.
+| Resolution | Visual Tokens | Encoder Time* | Total Response* | Best For |
+|---|---|---|---|---|
+| **448×448** | 196 (14×14) | ~2s | ~5-10s | General scene description, fast responses |
+| **672×672** ⭐ | 441 (21×21) | ~4s | ~9-11s | **Balanced: good detail + reasonable speed** |
+| **896×896** | 784 (28×28) | ~12s | ~25-28s | Maximum detail, fine text/OCR tasks |
+\*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image.
+### Resolution Math
+Qwen3-VL uses `patch_size=16` and `merge_size=2`, so:
+- Resolution must be **divisible by 32** (16 × 2)
+- Visual tokens = (height/32)² = 196 / 441 / 784 for 448 / 672 / 896
+Higher resolution = more visual tokens = better fine detail but:
+- Proportionally more NPU compute for the vision encoder
+- More tokens for the LLM to process (longer prefill)
+- Same decode speed (~15 tok/s) — only "time to first token" increases
+## Quick Start
+### Directory Structure
+```
+~/models/Qwen3-VL-2b/
+    qwen3-vl-2b-instruct_w8a8_rk3588.rkllm   # LLM decoder (always needed)
+    qwen3-vl-2b_vision_672_rk3588.rknn        # Active vision encoder
+    qwen3-vl-2b_vision_448_rk3588.rknn.alt    # Alternative (inactive)
+    qwen3-vl-2b_vision_896_rk3588.rknn.alt    # Alternative (inactive)
+```
+### Switching Resolution
+To switch to a different resolution, rename the files:
+```bash
+cd ~/models/Qwen3-VL-2b/
+# Deactivate current encoder
+mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt
+# Activate the 896 encoder
+mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn
+# Restart your API server
+sudo systemctl restart rkllm-api
+```
+### Using with RKLLM API Server
+This model is designed for use with the [RKLLM API Server](https://github.com/jdacostap/rkllm-api), which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the `.rknn` file's input tensor attributes.
+## Export Details
+### LLM Decoder
+- **Source**: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)
+- **Quantization**: W8A8 (8-bit weights, 8-bit activations)
+- **Tool**: rkllm-toolkit v1.2.3
+- **Context**: 4096 tokens
+### Vision Encoders
+- **Source**: Qwen3-VL-2B-Instruct visual encoder weights
+- **Export pipeline**: HuggingFace model → ONNX (`export_vision.py`) → RKNN (`export_vision_rknn.py`)
+- **Tool**: rknn-toolkit2 v2.3.2
+- **Precision**: FP32 (no quantization — vision encoder quality is critical)
+- **Target**: rk3588
+The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom `--height` and `--width` flags to `export_vision.py` and `export_vision_rknn.py` from the [rknn-llm multimodal demo](https://github.com/airockchip/rknn-llm/tree/main/examples/multimodal_model_demo/export).
+### Re-exporting at a Custom Resolution
+To export the vision encoder at a different resolution (must be divisible by 32):
+```bash
+# Activate the export environment
+source ~/rkllm-env/bin/activate
+cd ~/rknn-llm/examples/multimodal_model_demo
+# Step 1: Export HuggingFace model to ONNX
+python3 export/export_vision.py \
+  --path ~/models-hf/Qwen3-VL-2B-Instruct \
+  --model_name qwen3-vl \
+  --height 672 --width 672 \
+  --device cpu
+# Step 2: Convert ONNX to RKNN
+python3 export/export_vision_rknn.py \
+  --path ./onnx/qwen3-vl_vision.onnx \
+  --model_name qwen3-vl \
+  --target-platform rk3588 \
+  --height 672 --width 672
+```
+**Memory requirements**: ~20 GB RAM (or swap) for 672×672, ~35 GB for 896×896. CPU-only export works fine (no GPU needed).
+**Dependencies** (in a Python 3.10 venv):
+- `rknn-toolkit2 >= 2.3.2`
+- `torch == 2.4.0`
+- `transformers >= 4.57.0`
+- `onnx >= 1.18.0`
+## Performance Benchmarks
+Tested on **Orange Pi 5 Plus (16GB RAM)**, RK3588 SoC, RKNPU driver 0.9.8:
+| Metric | 448×448 | 672×672 | 896×896 |
+|---|---|---|---|
+| Vision encode time | ~2 s | ~4 s | ~12 s |
+| Total VL response | 5–10 s | 9–11 s | 25–28 s |
+| Text-only decode | ~15 tok/s | ~15 tok/s | ~15 tok/s |
+| Peak RAM (VL inference) | ~5.5 GB | ~6.5 GB | ~8.5 GB |
+| RKNN file size | 812 MB | 854 MB | 923 MB |
+## Known Limitations
+- **OCR accuracy**: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters.
+- **Fixed resolution**: Each `.rknn` file is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file.
+- **REGTASK warnings**: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588.
+## License
+Apache 2.0, inherited from [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct).
+## Credits
+- **Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-VL-2B-Instruct
+- **Runtime**: [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for rkllm-toolkit and rknn-toolkit2
+- **API Server**: [RKLLM API Server](https://github.com/jdacostap/rkllm-api) — OpenAI-compatible server for RK3588 NPU

qwen3-vl-2b-instruct_w8a8_rk3588.rkllm ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d5474340221fc495c70e1ec2c7dafc4ebf88292ce466db7e771e3a20b99cf21f
+size 2375022956

qwen3-vl-2b_vision_448_rk3588.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d707ef5dbf0e420ac48e57b5bf0ed6c0fd1d5d048c29d81e1b5a8d051ab7ea8
+size 850488413

qwen3-vl-2b_vision_672_rk3588.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4e6fb4baeb27fa4e2b88b311716b166cc000c10ec218c91e70a4bdb1db3dfe9
+size 894505821

qwen3-vl-2b_vision_896_rk3588.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b395dee006d26ac8e7a97b2d9e154473a8c64b8b5594888957e67864d047cc01
+size 967465181

upload.py ADDED Viewed

	@@ -0,0 +1,22 @@

+#!/usr/bin/env python3
+"""Upload model files to HuggingFace repo."""
+import os
+from huggingface_hub import HfApi
+api = HfApi()
+repo_id = "GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3"
+upload_dir = os.path.expanduser("~/hf-upload")
+print(f"Uploading all files from {upload_dir} to {repo_id}...")
+files = os.listdir(upload_dir)
+for f in sorted(files):
+    size_mb = os.path.getsize(os.path.join(upload_dir, f)) / 1024 / 1024
+    print(f"  {f} ({size_mb:.0f} MB)")
+api.upload_folder(
+    folder_path=upload_dir,
+    repo_id=repo_id,
+    repo_type="model",
+    commit_message="Add RKLLM v1.2.3 model files: LLM decoder (W8A8) + vision encoders at 448/672/896",
+)
+print("Upload complete!")