Qwen2.5-VL-3B-Instruct-RKLLM / README.md

Upload 6 files

4019d6d verified 5 months ago

11.9 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	tags:
	- rknn
	- rkllm
	---
	# Qwen2.5-VL-3B-Instruct-RKLLM

	## (English README see below)

	在RK3588上运行强大的Qwen2.5-VL-3B-Instruct-RKLLM视觉大模型!

	- 推理速度(RK3588): 视觉编码器 3.4s(三核并行) + LLM 填充 2.3s (320 tokens / 138 tps) + 解码 8.2 tps
	- 内存占用(RK3588, 上下文长度1024): 6.1GB

	## 使用方法

	1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间.

	2. 开发板的RKNPU2内核驱动版本必须>=0.9.6才能运行这么大的模型.
	使用root权限运行以下命令检查驱动版本:
	```bash
	> cat /sys/kernel/debug/rknpu/version
	RKNPU driver: v0.9.8
	```
	如果版本过低, 请更新驱动. 你可能需要更新内核, 或查找官方文档以获取帮助.

	3. 安装依赖

	```bash
	pip install "numpy<2" opencv-python rknn-toolkit-lite2
	```

	4. 运行

	```bash
	python ./run_rkllm.py ./test.jpg ./vision_encoder.rknn ./language_model_w8a8.rkllm 512 1024 3
	```

	参数说明:
	- `512`: max_new_tokens, 最大生成token数.
	- `1024`: max_context_len, 最大上下文长度.
	- `3`: npu_core_num, 使用的NPU核心数.

	如果实测性能不理想, 可以调整CPU调度器让CPU始终运行在最高频率, 并把推理程序绑定到大核(`taskset -c 4-7 python ...`)

	test.jpg:
	![test.jpg](./test.jpg)

	```
	Initializing ONNX Runtime for vision encoder...
	W rknn-toolkit-lite2 version: 2.3.2
	W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.)
	Vision encoder loaded successfully.
	ONNX Input: pixel_values, ONNX Output: vision_features
	Initializing RKLLM Runtime...
	I rkllm: rkllm-runtime version: 1.2.1, rknpu driver version: 0.9.8, platform: RK3588
	I rkllm: loading rkllm model from ./language_model_w8a8.rkllm
	I rkllm: rkllm-toolkit version: 1.2.1, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: W8A8
	I rkllm: Enabled cpus: [4, 5, 6, 7]
	I rkllm: Enabled cpus num: 4
	I rkllm: Using mrope
	RKLLM initialized successfully.
	Preprocessing image...
	Running vision encoder...
	W The input[0] need NHWC data format, but NCHW set, the data format and data buffer will be changed to NHWC.
	视觉编码器推理耗时: 3.5427 秒
	Image encoded successfully.
	I rkllm: reset chat template:
	I rkllm: system_prompt: <\|im_start\|>system\nYou are a helpful assistant.<\|im_end\|>\n
	I rkllm: prompt_prefix: <\|im_start\|>user\n
	I rkllm: prompt_postfix: <\|im_end\|>\n<\|im_start\|>assistant\n
	W rkllm: Calling rkllm_set_chat_template will disable the internal automatic chat template parsing, including enable_thinking. Make sure your custom prompt is complete and valid.

	********************可输入以下问题对应序号获取回答/或自定义输入******************

	[0] Picture 1: <image> What is in the image?
	[1] Picture 1: <image> 这张图片中有什么？

	*************************************************************************


	user: 0
	Picture 1: <image> What is in the image?
	robot: n_image_tokens: 289
	The image shows a cozy bedroom with several notable features:

	- A large bed covered with a blue comforter.
	- A wooden dresser next to the bed, topped with various items including a mirror and some decorative objects.
	- A window allowing natural light into the room, offering a view of greenery outside.
	- A bookshelf filled with numerous books on shelves.
	- A basket placed near the foot of the bed.
	- A lamp on a side table beside the bed.

	The overall ambiance is warm and inviting.

	I rkllm: --------------------------------------------------------------------------------------
	I rkllm: Model init time (ms) 3361.48
	I rkllm: --------------------------------------------------------------------------------------
	I rkllm: Stage Total Time (ms) Tokens Time per Token (ms) Tokens per Second
	I rkllm: --------------------------------------------------------------------------------------
	I rkllm: Prefill 2201.45 321 6.86 145.81
	I rkllm: Generate 12419.47 102 121.76 8.21
	I rkllm: --------------------------------------------------------------------------------------
	I rkllm: Peak Memory Usage (GB)
	I rkllm: 6.19
	I rkllm: --------------------------------------------------------------------------------------

	user: 1
	Picture 1: <image> 这张图片中有什么？
	robot: n_image_tokens: 289
	这张照片展示了一个卧室的内部。房间有一扇大窗户，可以看到外面的绿色植物。房间里有各种物品：一个蓝色的大床单覆盖在一张床上；一盏灯放在梳妆台上；一面镜子挂在墙上；书架上摆满了书籍和一些装饰品；还有一些篮子、花盆和其他小物件散落在周围。

	I rkllm: --------------------------------------------------------------------------------------
	I rkllm: Stage Total Time (ms) Tokens Time per Token (ms) Tokens per Second
	I rkllm: --------------------------------------------------------------------------------------
	I rkllm: Prefill 184.35 13 14.18 70.52
	I rkllm: Generate 8711.49 72 120.99 8.26
	I rkllm: --------------------------------------------------------------------------------------
	I rkllm: Peak Memory Usage (GB)
	I rkllm: 6.19
	I rkllm: --------------------------------------------------------------------------------------
	```

	## 模型转换

	#### 准备工作

	1. 安装rknn-toolkit2以及rkllm-toolkit:
	```bash
	pip install -U rknn-toolkit2
	```
	rkllm-toolkit需要在这里手动下载: https://github.com/airockchip/rknn-llm/tree/main/rkllm-toolkit

	2. 下载此仓库到本地, 但不需要下载`.rkllm`和`.rknn`结尾的模型文件.
	3. 下载Qwen2.5-VL-3B-Instruct的huggingface模型仓库到本地. ( https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct )

	#### 转换LLM

	将`rkllm-convert.py`拷贝到Qwen2.5-VL-3B-Instruct的模型文件夹中，执行:
	```bash
	python rkllm-convert.py
	```
	默认是w8a8量化的，你可以自行打开脚本修改量化方式等。

	#### 转换视觉编码器

	1. 导出ONNX

	将`export_vision_onnx.py`拷贝到Qwen2.5-VL-3B-Instruct的模型文件夹根目录中，然后在该根目录下执行:
	```bash
	mkdir vision
	python ./export_vision_onnx.py . --savepath ./vision/vision_encoder.onnx
	```
	视觉编码器会导出到`vision/vision_encoder.onnx`. 默认宽高为476，你可以自行通过`--height`和`--width`参数修改。

	2. 模型优化 (可选)

	从 https://github.com/happyme531/rknn-toolkit2-utils 下载`split_matmul_onnx_profile.py`, 之后运行:
	```bash
	python ./split_matmul_onnx_profile.py --input vision/vision_encoder.onnx --output vision_encoder_opt.onnx --pattern "/visual/blocks\..?/mlp/down_proj." --factor 5
	```
	优化后的模型会输出到`vision_encoder_opt.onnx`

	3. 转换rknn

	```bash
	python ./convert_vision_encoder.py ./vision_encoder_opt.onnx
	```
	(这一步可能需要20分钟以上)
	转换后模型会输出到`vision_encoder_opt.rknn`

	为了与"使用方法"中的命令保持一致, 你可以将其重命名:
	```bash
	mv vision_encoder_opt.rknn vision_encoder.rknn
	```

	## 已知问题

	- 由于RKLLM的多模态输入的限制, 在整个对话中只能加载一张图片.
	- 没有实现多轮对话.
	- RKLLM的w8a8量化貌似存在不小的精度损失.
	- 可能由于RKNPU2的访存模式问题，输入尺寸边长不为64的整数倍时模型运行速度会有奇怪的明显提升。

	## 参考

	- [Qwen/Qwen2.5-VL-3B-Instruct-RKLLM](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-RKLLM)

	---

	# English README

	Run the powerful Qwen2.5-VL-3B-Instruct-RKLLM vision large model on RK3588!

	- Inference Speed (RK3588): Vision Encoder 3.4s (3-core parallel) + LLM Prefill 2.3s (320 tokens / 138 tps) + Decode 8.2 tps
	- Memory Usage (RK3588, context length 1024): 6.1GB

	## How to Use

	1. Clone or download this repository locally. The model is large, so ensure you have enough disk space.

	2. The RKNPU2 kernel driver version on your board must be `>=0.9.6` to run such a large model. Run the following command with root privileges to check the driver version:
	```bash
	> cat /sys/kernel/debug/rknpu/version
	RKNPU driver: v0.9.8
	```
	If the version is too old, please update the driver. You may need to update your kernel or consult the official documentation for help.

	3. Install dependencies:
	```bash
	pip install "numpy<2" opencv-python rknn-toolkit-lite2
	```

	4. Run the model:
	```bash
	python ./run_rkllm.py ./test.jpg ./vision_encoder.rknn ./language_model_w8a8.rkllm 512 1024 3
	```
	Parameter Descriptions:
	- `512`: `max_new_tokens`, the maximum number of tokens to generate.
	- `1024`: `max_context_len`, the maximum context length.
	- `3`: `npu_core_num`, the number of NPU cores to use.

	If the performance is not ideal, you can adjust the CPU scheduler to keep the CPU running at its highest frequency and bind the inference program to the big cores (`taskset -c 4-7 python ...`).

	The example output is shown in the Chinese section above.

	## Model Conversion

	#### Prerequisites

	1. Install rknn-toolkit2 and rkllm-toolkit:
	```bash
	pip install -U rknn-toolkit2
	```
	rkllm-toolkit needs to be downloaded manually from here: https://github.com/airockchip/rknn-llm/tree/main/rkllm-toolkit

	2. Download this repository locally, but you don't need the model files ending with `.rkllm` and `.rknn`.
	3. Download the Qwen2.5-VL-3B-Instruct huggingface model repository locally from: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct

	#### Convert LLM

	Copy `rkllm-convert.py` into the Qwen2.5-VL-3B-Instruct model folder and execute:
	```bash
	python rkllm-convert.py
	```
	It uses w8a8 quantization by default. You can open the script to modify the quantization method and other settings.

	#### Convert Vision Encoder

	1. Export ONNX

	Copy `export_vision_onnx.py` to the root directory of the Qwen2.5-VL-3B-Instruct model folder, then execute the following in the root directory:
	```bash
	mkdir vision
	python ./export_vision_onnx.py . --savepath ./vision/vision_encoder.onnx
	```
	The vision encoder will be exported to `vision/vision_encoder.onnx`. The default height and width are 476, which you can modify using the `--height` and `--width` parameters.

	2. Model Optimization (Optional)

	Download `split_matmul_onnx_profile.py` from https://github.com/happyme531/rknn-toolkit2-utils, then run:
	```bash
	python ./split_matmul_onnx_profile.py --input vision/vision_encoder.onnx --output vision_encoder_opt.onnx --pattern "/visual/blocks\..?/mlp/down_proj." --factor 5
	```
	The optimized model will be saved as `vision_encoder_opt.onnx`.

	3. Convert to RKNN

	```bash
	python ./convert_vision_encoder.py ./vision_encoder_opt.onnx
	```
	(This step may take over 20 minutes)

	The converted model will be saved as `vision_encoder_opt.rknn`. To match the command in the "How to Use" section, you can rename it:
	```bash
	mv vision_encoder_opt.rknn vision_encoder.rknn
	```

	## Known Issues

	- Due to limitations in RKLLM's multimodal input, only one image can be loaded per conversation.
	- Multi-turn conversation is not implemented.
	- The w8a8 quantization in RKLLM seems to cause a non-trivial loss of precision.
	- Possibly due to memory access patterns of the RKNPU2, weirdly the model runs faster when the input image dimensions are not multiples of 64.

	## References

	- [Qwen/Qwen2.5-VL-3B-Instruct-RKLLM](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-RKLLM)