|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-2B-Instruct |
|
|
- Qwen/Qwen3-VL-4B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- Qwen3-VL |
|
|
- Qwen3-VL-2B-Instruct |
|
|
- Qwen3-VL-4B-Instruct |
|
|
- Int8 |
|
|
- VLM |
|
|
--- |
|
|
|
|
|
# Qwen3-VL |
|
|
|
|
|
This version of Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using **w8a16** quantization. |
|
|
|
|
|
Compatible with Pulsar2 version: 5.0 |
|
|
|
|
|
## Convert tools links: |
|
|
|
|
|
For those who are interested in model conversion, you can try to export axmodel through the original repo : |
|
|
|
|
|
- https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct |
|
|
- https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct |
|
|
|
|
|
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) |
|
|
|
|
|
[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen3-VL.AXERA) |
|
|
|
|
|
|
|
|
## Support Platform |
|
|
|
|
|
- AX650 |
|
|
- AX650N DEMO Board |
|
|
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) |
|
|
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html) |
|
|
|
|
|
**Image Process** |
|
|
|Chips| input size | image num | image encoder | ttft(168 tokens) | w8a16 | CMM | Flash | |
|
|
|--|--|--|--|--|--|--|--| |
|
|
|AX650| 384*384 | 1 | 236 ms | 907 ms | 4.3 tokens/sec| 7.3GiB | 7.9GiB | |
|
|
|
|
|
**Video Process** |
|
|
|Chips| input size | image num | image encoder |ttft(600 tokens) | w8a16 | CMM | Flash | |
|
|
|--|--|--|--|--|--|--|--| |
|
|
|AX650| 384*384 | 8 | 778 ms | 2442 ms | 4.3 tokens/sec| 7.3GiB | 7.9GiB | |
|
|
|
|
|
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value. |
|
|
|
|
|
## How to use |
|
|
|
|
|
Download all files from this repository to the device |
|
|
|
|
|
**If you using AX650 Board** |
|
|
|
|
|
### Prepare tokenizer server |
|
|
|
|
|
#### Install transformer |
|
|
|
|
|
``` |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Demo Run |
|
|
|
|
|
#### Image understand demo |
|
|
|
|
|
##### start tokenizer server for image understand demo |
|
|
|
|
|
``` |
|
|
python3 qwen3_tokenizer.py --port 8080 |
|
|
``` |
|
|
|
|
|
##### run image understand demo |
|
|
|
|
|
- input text |
|
|
|
|
|
``` |
|
|
描述这张图片 |
|
|
``` |
|
|
|
|
|
- input image |
|
|
|
|
|
 |
|
|
|
|
|
``` |
|
|
root@ax650 ~/Qwen3-VL-4B-Instruct # bash run_image_ax650.sh |
|
|
[I][ Init][ 156]: LLM init start |
|
|
[I][ Init][ 158]: Total CMM:7884 MB |
|
|
[I][ Init][ 34]: connect http://127.0.0.1:8080 ok |
|
|
bos_id: -1, eos_id: 151645 |
|
|
img_start_token: 151652 |
|
|
img_context_token: 151655 |
|
|
2% | █ | 1 / 39 [0.01s<0.58s, 66.67 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap |
|
|
5% | ██ | 2 / 39 [0.02s<0.43s, 90.91 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:36 |
|
|
102% | █████████████████████████████████ | 40 / 39 [75.14s<73.27s, 0.53 count/s] init vpm axmodel ok,remain_cmm(369 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652 |
|
|
[I][ Init][ 309]: image encoder output float32 |
|
|
|
|
|
[I][ Init][ 339]: max_token_len : 2047 |
|
|
[I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047 |
|
|
[I][ Init][ 352]: prefill_token_num : 128 |
|
|
[I][ Init][ 356]: grp: 1, prefill_max_token_num : 1 |
|
|
[I][ Init][ 356]: grp: 2, prefill_max_token_num : 128 |
|
|
[I][ Init][ 356]: grp: 3, prefill_max_token_num : 256 |
|
|
[I][ Init][ 356]: grp: 4, prefill_max_token_num : 384 |
|
|
[I][ Init][ 356]: grp: 5, prefill_max_token_num : 512 |
|
|
[I][ Init][ 356]: grp: 6, prefill_max_token_num : 640 |
|
|
[I][ Init][ 356]: grp: 7, prefill_max_token_num : 768 |
|
|
[I][ Init][ 356]: grp: 8, prefill_max_token_num : 896 |
|
|
[I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024 |
|
|
[I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152 |
|
|
[I][ Init][ 360]: prefill_max_token_num : 1152 |
|
|
[I][ Init][ 372]: LLM init ok |
|
|
[I][ Init][ 374]: Left CMM:369 MB |
|
|
Type "q" to exit, Ctrl+c to stop current running |
|
|
prompt >> 描述这张图片 |
|
|
image >> images/recoAll_attractions_1.jpg |
|
|
[I][ EncodeImage][ 440]: pixel_values size 1 |
|
|
[I][ EncodeImage][ 441]: grid_h 24 grid_w 24 |
|
|
[I][ EncodeImage][ 489]: image encode time : 236.550995 ms, size : 1 |
|
|
[I][ Encode][ 532]: input_ids size:168 |
|
|
[I][ Encode][ 540]: offset 15 |
|
|
[I][ Encode][ 569]: img_embed.size:1, 368640 |
|
|
[I][ Encode][ 583]: out_embed size:430080 |
|
|
[I][ Encode][ 584]: input_ids size 168 |
|
|
[I][ Encode][ 586]: position_ids size:168 |
|
|
[I][ Run][ 607]: input token num : 168, prefill_split_num : 2 |
|
|
[I][ Run][ 641]: input_num_token:128 |
|
|
[I][ Run][ 641]: input_num_token:40 |
|
|
[I][ Run][ 865]: ttft: 907.21 ms |
|
|
这张图片展示了埃及吉萨金字塔群的壮丽景象,背景是清澈的蓝天,前景是广袤的沙漠。 |
|
|
|
|
|
画面中,最引人注目的是三座宏伟的金字塔,它们是古埃及文明的象征。其中,位于中央的是一座巨大的金字塔,其石块结构清晰可见,显示出古代工匠的精湛技艺。在它的左侧,是一座较小的金字塔,可能是为法老或贵族建造的。在右侧,还有一座金字塔,虽然部分被遮挡,但依然能感受到其雄伟的气势。 |
|
|
|
|
|
金字塔的周围是平坦的沙地,阳光照射下,金字塔的轮廓在蓝天的映衬下显得格外清晰。整个场景充满了历史的厚重感和神秘的氛围,让人不禁感叹古埃及文明的辉煌成就。 |
|
|
|
|
|
这张图片不仅展现了金字塔的建筑之美,也体现了古埃及人对宇宙和永恒的追求。它是一幅令人震撼的自然与人文景观的完美结合。 |
|
|
|
|
|
[N][ Run][ 992]: hit eos,avg 4.29 token/s |
|
|
``` |
|
|
|
|
|
#### Video understand demo |
|
|
|
|
|
##### start tokenizer server for image understand demo |
|
|
|
|
|
``` |
|
|
python qwen3_tokenizer.py --port 8080 |
|
|
``` |
|
|
|
|
|
##### run video understand demo |
|
|
- input text |
|
|
|
|
|
``` |
|
|
描述这个视频 |
|
|
``` |
|
|
|
|
|
- input video |
|
|
|
|
|
./video |
|
|
|
|
|
``` |
|
|
root@ax650 ~/Qwen3-VL-4B-Instruct # bash run_video_ax650.sh |
|
|
[I][ Init][ 156]: LLM init start |
|
|
[I][ Init][ 158]: Total CMM:7884 MB |
|
|
[I][ Init][ 34]: connect http://127.0.0.1:8080 ok |
|
|
bos_id: -1, eos_id: 151645 |
|
|
img_start_token: 151652 |
|
|
img_context_token: 151656 |
|
|
2% | █ | 1 / 39 [0.01s<0.43s, 90.91 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap |
|
|
5% | ██ | 2 / 39 [0.01s<0.29s, 133.33 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:36 |
|
|
102% | █████████████████████████████████ | 40 / 39 [73.00s<71.17s, 0.55 count/s] init vpm axmodel ok,remain_cmm(369 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652 |
|
|
[I][ Init][ 309]: image encoder output float32 |
|
|
|
|
|
[I][ Init][ 339]: max_token_len : 2047 |
|
|
[I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047 |
|
|
[I][ Init][ 352]: prefill_token_num : 128 |
|
|
[I][ Init][ 356]: grp: 1, prefill_max_token_num : 1 |
|
|
[I][ Init][ 356]: grp: 2, prefill_max_token_num : 128 |
|
|
[I][ Init][ 356]: grp: 3, prefill_max_token_num : 256 |
|
|
[I][ Init][ 356]: grp: 4, prefill_max_token_num : 384 |
|
|
[I][ Init][ 356]: grp: 5, prefill_max_token_num : 512 |
|
|
[I][ Init][ 356]: grp: 6, prefill_max_token_num : 640 |
|
|
[I][ Init][ 356]: grp: 7, prefill_max_token_num : 768 |
|
|
[I][ Init][ 356]: grp: 8, prefill_max_token_num : 896 |
|
|
[I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024 |
|
|
[I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152 |
|
|
[I][ Init][ 360]: prefill_max_token_num : 1152 |
|
|
[I][ Init][ 372]: LLM init ok |
|
|
[I][ Init][ 374]: Left CMM:369 MB |
|
|
Type "q" to exit, Ctrl+c to stop current running |
|
|
prompt >> 描述这个视频 |
|
|
video >> video |
|
|
video/frame_0000.jpg |
|
|
video/frame_0008.jpg |
|
|
video/frame_0016.jpg |
|
|
video/frame_0024.jpg |
|
|
video/frame_0032.jpg |
|
|
video/frame_0040.jpg |
|
|
video/frame_0048.jpg |
|
|
video/frame_0056.jpg |
|
|
[I][ EncodeImage][ 440]: pixel_values size 4 |
|
|
[I][ EncodeImage][ 441]: grid_h 24 grid_w 24 |
|
|
[I][ EncodeImage][ 489]: image encode time : 778.210022 ms, size : 4 |
|
|
[I][ Encode][ 532]: input_ids size:600 |
|
|
[I][ Encode][ 540]: offset 15 |
|
|
[I][ Encode][ 569]: img_embed.size:4, 368640 |
|
|
[I][ Encode][ 574]: offset:159 |
|
|
[I][ Encode][ 574]: offset:303 |
|
|
[I][ Encode][ 574]: offset:447 |
|
|
[I][ Encode][ 583]: out_embed size:1536000 |
|
|
[I][ Encode][ 584]: input_ids size 600 |
|
|
[I][ Encode][ 586]: position_ids size:600 |
|
|
[I][ Run][ 607]: input token num : 600, prefill_split_num : 5 |
|
|
[I][ Run][ 641]: input_num_token:128 |
|
|
[I][ Run][ 641]: input_num_token:128 |
|
|
[I][ Run][ 641]: input_num_token:128 |
|
|
[I][ Run][ 641]: input_num_token:128 |
|
|
[I][ Run][ 641]: input_num_token:88 |
|
|
[I][ Run][ 865]: ttft: 2441.51 ms |
|
|
这个视频展示了一群**土拨鼠**(或称旱獭)在山地环境中嬉戏打闹的生动场景。 |
|
|
|
|
|
**画面内容:** |
|
|
|
|
|
- **主体动物**:画面中有多只土拨鼠,它们毛色以灰、棕、白相间,体型圆润,四肢短小,尾巴蓬松。它们正互相追逐、扑打、推搡,动作非常活跃,看起来像是在玩耍或争斗。 |
|
|
- **动作细节**:土拨鼠们用前爪互相拍打、推搡,有的甚至用后腿蹬地,姿态充满动感。其中一只土拨鼠的前爪高高举起,似乎在“击打”另一只,画面充满动感和趣味。 |
|
|
- **背景环境**:背景是连绵起伏的山峦,山坡上覆盖着绿色植被,远处可见裸露的岩石和一条蜿蜒的山路。天空湛蓝,阳光明媚,整个场景充满自然野趣。 |
|
|
- **构图与视觉效果**:画面采用近景特写,聚焦于土拨鼠的互动,背景则略显模糊,突出了主体。画面中还出现了轻微的“多重曝光”或“动态模糊”效果,增强了动作的动感和趣味性。 |
|
|
|
|
|
**整体氛围:** |
|
|
|
|
|
视频充满活力和趣味,展现了野生动物在自然环境中的自然行为,尤其是它们之间充满“斗殴”趣味的互动,让人忍俊不禁。这种“打斗”在动物界中常是社交、领地争夺或玩耍行为,但在这里被拍摄得极具戏剧性和趣味性。 |
|
|
|
|
|
**总结:** |
|
|
|
|
|
这是一段充满动感和趣味的野生动物视频,展现了土拨鼠在山地环境中活泼好动、互相嬉戏的可爱瞬间,背景壮丽,画面生动,令人印象深刻。 |
|
|
|
|
|
[N][ Run][ 992]: hit eos,avg 4.30 token/s |
|
|
``` |
|
|
|