Update README.md

f3c9f23 verified 2 months ago

10.9 kB

	---
	license: mit
	language:
	- en
	- zh
	base_model:
	- Qwen/Qwen3-VL-2B-Instruct
	- Qwen/Qwen3-VL-4B-Instruct
	- Qwen/Qwen3-VL-8B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- Qwen3-VL
	- Qwen3-VL-2B-Instruct
	- Qwen3-VL-4B-Instruct
	- Qwen3-VL-8B-Instruct
	- Int8
	- VLM
	- GPTQ
	---

	# Qwen3-VL

	This version of Qwen3-VL-2B-Instruct has been converted to run on the Axera NPU using w8a16 quantization.

	Compatible with Pulsar2 version: 5.0

	## Convert tools links:

	For those who are interested in model conversion, you can try to export axmodel through the original repo :

	- https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
	- https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct

	[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)

	[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen3-VL.AXERA)


	## Support Platform

	- AX650
	- AX650N DEMO Board
	- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
	- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

	Image Process
	\|Chips\| input size \| image num \| image encoder \| ttft(168 tokens) \| w8a16 \| CMM \| Flash \|
	\|--\|--\|--\|--\|--\|--\|--\|--\|
	\|AX650\| 384*384 \| 1 \| 280 ms \| 1476 ms \| 2.5 tokens/sec\| 11.8 GB \| 14 GB \|

	Video Process
	\|Chips\| input size \| image num \| image encoder \|ttft(600 tokens) \| w8a16 \| CMM \| Flash \|
	\|--\|--\|--\|--\|--\|--\|--\|--\|
	\|AX650\| 384*384 \| 8 \| 1114 ms \| 4520 ms \| 2.5 tokens/sec\| 11.8 GB \| 14 GB \|

	The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

	## How to use

	Download all files from this repository to the device

	If you using AX650 Board

	### Prepare tokenizer server

	#### Install transformer

	```
	pip install -r requirements.txt
	```

	### Demo Run

	#### Image understand demo

	##### start tokenizer server for image understand demo

	```
	python3 tokenizer_images.py --port 8080
	```

	##### run image understand demo

	- input text

	```
	描述这张图片
	```

	- input image

	![](./images/recoAll_attractions_1.jpg)

	```
	root@ax650 ~/Qwen3-VL-2B-Instruct-GPTQ-Int4 # bash run_image_ax650.sh
	[I][ Init][ 156]: LLM init start
	[I][ Init][ 158]: Total CMM:4353 MB
	[I][ Init][ 34]: connect http://127.0.0.1:8080 ok
	bos_id: -1, eos_id: 151645
	img_start_token: 151652
	img_context_token: 151655
	3% \| ██ \| 1 / 31 [0.01s<0.46s, 66.67 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
	6% \| ███ \| 2 / 31 [0.02s<0.34s, 90.91 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:28
	103% \| ██████████████████████████████████ \| 32 / 31 [34.03s<32.96s, 0.94 count/s] init vpm axmodel ok,remain_cmm(854 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
	[I][ Init][ 309]: image encoder output float32

	[I][ Init][ 339]: max_token_len : 2047
	[I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
	[I][ Init][ 352]: prefill_token_num : 128
	[I][ Init][ 356]: grp: 1, prefill_max_token_num : 1
	[I][ Init][ 356]: grp: 2, prefill_max_token_num : 128
	[I][ Init][ 356]: grp: 3, prefill_max_token_num : 256
	[I][ Init][ 356]: grp: 4, prefill_max_token_num : 384
	[I][ Init][ 356]: grp: 5, prefill_max_token_num : 512
	[I][ Init][ 356]: grp: 6, prefill_max_token_num : 640
	[I][ Init][ 356]: grp: 7, prefill_max_token_num : 768
	[I][ Init][ 356]: grp: 8, prefill_max_token_num : 896
	[I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024
	[I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152
	[I][ Init][ 360]: prefill_max_token_num : 1152
	[I][ Init][ 372]: LLM init ok
	[I][ Init][ 374]: Left CMM:854 MB
	Type "q" to exit, Ctrl+c to stop current running
	prompt >> 描述这张图片
	image >> images/recoAll_attractions_1.jpg
	[I][ EncodeImage][ 440]: pixel_values size 1
	[I][ EncodeImage][ 441]: grid_h 24 grid_w 24
	[I][ EncodeImage][ 489]: image encode time : 237.778000 ms, size : 1
	[I][ Encode][ 532]: input_ids size:168
	[I][ Encode][ 540]: offset 15
	[I][ Encode][ 569]: img_embed.size:1, 294912
	[I][ Encode][ 583]: out_embed size:344064
	[I][ Encode][ 584]: input_ids size 168
	[I][ Encode][ 586]: position_ids size:168
	[I][ Run][ 607]: input token num : 168, prefill_split_num : 2
	[I][ Run][ 641]: input_num_token:128
	[I][ Run][ 641]: input_num_token:40
	[I][ Run][ 865]: ttft: 313.60 ms
	这是一张在埃及沙漠中拍摄的风景照片。画面中，三座巨大的金字塔在晴朗的天空下矗立，它们是古埃及文明的象征。这些金字塔由巨大的石块堆叠而成，表面因岁月侵蚀而显得斑驳。在金字塔的前方，有几个人影在沙地上行走，这为整个场景提供了比例感和尺度感。整个场景充满了历史的厚重感和神秘的氛围。

	[N][ Run][ 992]: hit eos,avg 14.14 token/s
	```

	#### Video understand demo

	##### start tokenizer server for image understand demo

	```
	python tokenizer_video.py --port 8080
	```

	##### run video understand demo
	- input text

	```
	描述这个视频
	```

	- input video

	./video

	```
	root@ax650 ~/Qwen3-VL-2B-Instruct-GPTQ-Int4 # bash run_video_ax650.sh
	[I][ Init][ 156]: LLM init start
	[I][ Init][ 158]: Total CMM:7884 MB
	[I][ Init][ 34]: connect http://127.0.0.1:8080 ok
	bos_id: -1, eos_id: 151645
	img_start_token: 151652
	img_context_token: 151656
	3% \| ██ \| 1 / 31 [0.01s<0.34s, 90.91 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
	6% \| ███ \| 2 / 31 [0.01s<0.23s, 133.33 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:28
	103% \| ██████████████████████████████████ \| 32 / 31 [32.37s<31.36s, 0.99 count/s] init vpm axmodel ok,remain_cmm(4385 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
	[I][ Init][ 309]: image encoder output float32

	[I][ Init][ 339]: max_token_len : 2047
	[I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
	[I][ Init][ 352]: prefill_token_num : 128
	[I][ Init][ 356]: grp: 1, prefill_max_token_num : 1
	[I][ Init][ 356]: grp: 2, prefill_max_token_num : 128
	[I][ Init][ 356]: grp: 3, prefill_max_token_num : 256
	[I][ Init][ 356]: grp: 4, prefill_max_token_num : 384
	[I][ Init][ 356]: grp: 5, prefill_max_token_num : 512
	[I][ Init][ 356]: grp: 6, prefill_max_token_num : 640
	[I][ Init][ 356]: grp: 7, prefill_max_token_num : 768
	[I][ Init][ 356]: grp: 8, prefill_max_token_num : 896
	[I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024
	[I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152
	[I][ Init][ 360]: prefill_max_token_num : 1152
	[I][ Init][ 372]: LLM init ok
	[I][ Init][ 374]: Left CMM:4385 MB
	Type "q" to exit, Ctrl+c to stop current running
	prompt >> 描述这个视频
	video >> video
	video/frame_0000.jpg
	video/frame_0008.jpg
	video/frame_0016.jpg
	video/frame_0024.jpg
	video/frame_0032.jpg
	video/frame_0040.jpg
	video/frame_0048.jpg
	video/frame_0056.jpg
	[I][ EncodeImage][ 440]: pixel_values size 4
	[I][ EncodeImage][ 441]: grid_h 24 grid_w 24
	[I][ EncodeImage][ 489]: image encode time : 751.481018 ms, size : 4
	[I][ Encode][ 532]: input_ids size:600
	[I][ Encode][ 540]: offset 15
	[I][ Encode][ 569]: img_embed.size:4, 294912
	[I][ Encode][ 574]: offset:159
	[I][ Encode][ 574]: offset:303
	[I][ Encode][ 574]: offset:447
	[I][ Encode][ 583]: out_embed size:1228800
	[I][ Encode][ 584]: input_ids size 600
	[I][ Encode][ 586]: position_ids size:600
	[I][ Run][ 607]: input token num : 600, prefill_split_num : 5
	[I][ Run][ 641]: input_num_token:128
	[I][ Run][ 641]: input_num_token:128
	[I][ Run][ 641]: input_num_token:128
	[I][ Run][ 641]: input_num_token:128
	[I][ Run][ 641]: input_num_token:88
	[I][ Run][ 865]: ttft: 843.36 ms
	这是一段关于两只山地旱獭（也称“山地土拨鼠”）在山地环境中互动的视频。

	在画面中，两只山地旱獭正站在布满碎石的山坡上，背景是连绵起伏的山脉和蓝天。它们的毛色以灰、棕、黑相间，脸部和耳朵周围有明显的黑白条纹，显得非常可爱。

	这两只旱獭正在进行一场激烈的“拳击”或“格斗”游戏。它们的前爪高高举起，像在互相击打，但它们的姿势和动作表明它们可能是在进行一场激烈的“拳击”或“格斗”游戏。它们的嘴巴和前爪在空中挥舞，似乎在互相攻击或展示力量。

	整个场景充满了动感和活力，展现了这些小动物在自然环境中充满活力和趣味的一面。

	[N][ Run][ 992]: hit eos,avg 14.16 token/s

	```