lihongjie
update
b88b51a
---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen3-VL-2B-Instruct
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- Qwen3-VL
- Qwen3-VL-2B-Instruct
- Qwen3-VL-4B-Instruct
- Int8
- VLM
---
# Qwen3-VL
This version of Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using **w8a16** quantization.
Compatible with Pulsar2 version: 5.0
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
- https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
- https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen3-VL.AXERA)
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
**Image Process**
|Chips| input size | image num | image encoder | ttft(168 tokens) | w8a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 384*384 | 1 | 238 ms | 392 ms | 9.5 tokens/sec| 4.1GiB | 4.2GiB |
**Video Process**
|Chips| input size | image num | image encoder |ttft(600 tokens) | w8a16 | CMM | Flash |
|--|--|--|--|--|--|--|--|
|AX650| 384*384 | 8 | 751 ms | 1045 ms | 9.5 tokens/sec| 4.1GiB | 4.2GiB |
The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
## How to use
Download all files from this repository to the device
**If you using AX650 Board**
### Prepare tokenizer server
#### Install transformer
```
pip install -r requirements.txt
```
### Demo Run
#### Image understand demo
##### start tokenizer server for image understand demo
```
python3 qwen3_tokenizer.py --port 8080
```
##### run image understand demo
- input text
```
描述这张图片
```
- input image
![](./images/recoAll_attractions_1.jpg)
```
root@ax650 ~/Qwen3-VL-2B-Instruct # bash run_image_ax650.sh
[I][ Init][ 156]: LLM init start
[I][ Init][ 158]: Total CMM:7884 MB
[I][ Init][ 34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151655
3% | ██ | 1 / 31 [0.01s<0.31s, 100.00 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
6% | ███ | 2 / 31 [0.01s<0.20s, 153.85 count/s] embed_selector init ok[I][ Init][ 201]: attr.axmodel_num:28
103% | ██████████████████████████████████ | 32 / 31 [13.72s<13.29s, 2.33 count/s] init vpm axmodel ok,remain_cmm(3678 MB)[I][ Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][ Init][ 309]: image encoder output float32
[I][ Init][ 339]: max_token_len : 2047
[I][ Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][ Init][ 352]: prefill_token_num : 128
[I][ Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][ Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][ Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][ Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][ Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][ Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][ Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][ Init][ 360]: prefill_max_token_num : 1152
[I][ Init][ 372]: LLM init ok
[I][ Init][ 374]: Left CMM:3678 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这张图片
image >> images/recoAll_attractions_1.jpg
[I][ EncodeImage][ 440]: pixel_values size 1
[I][ EncodeImage][ 441]: grid_h 24 grid_w 24
[I][ EncodeImage][ 489]: image encode time : 230.444000 ms, size : 1
[I][ Encode][ 532]: input_ids size:168
[I][ Encode][ 540]: offset 15
[I][ Encode][ 569]: img_embed.size:1, 294912
[I][ Encode][ 583]: out_embed size:344064
[I][ Encode][ 584]: input_ids size 168
[I][ Encode][ 586]: position_ids size:168
[I][ Run][ 607]: input token num : 168, prefill_split_num : 2
[I][ Run][ 641]: input_num_token:128
[I][ Run][ 641]: input_num_token:40
[I][ Run][ 865]: ttft: 392.89 ms
好的,这是一张关于埃及吉萨金字塔的图片。
这张图片展示了埃及吉萨金字塔群的壮丽景象。在广阔的沙漠中,几座巨大的金字塔巍然耸立,它们由巨大的石块堆砌而成,呈现出经典的阶梯状结构。这些金字塔是古埃及文明的杰作,是世界著名的文化遗产。
在画面的前景,可以看到一些游客或探险者,他们与金字塔相比显得微不足道,这更突显了金字塔的宏伟与古老。天空晴朗,阳光明媚,为整个场景增添了明亮的色彩。整个画面充满了历史的厚重感和自然的壮美。
[N][ Run][ 992]: hit eos,avg 9.39 token/s
```
#### Video understand demo
##### start tokenizer server for image understand demo
```
python qwen3_tokenizer.py --port 8080
```
##### run video understand demo
- input text
```
描述这个视频
```
- input video
./video
```
root@ax650 ~/Qwen3-VL # bash run_qwen3_vl_2b_video.sh
[I][ Init][ 156]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151656
3% | ██ | 1 / 31 [0.01s<0.31s, 100.00 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
6% | ███ | 2 / 31 [0.01s<0.20s, 153.85 count/s] embed_selector init ok[I][ Init][ 198]: attr.axmodel_num:28
103% | ██████████████████████████████████ | 32 / 31 [30.34s<29.39s, 1.05 count/s] init vpm axmodel ok,remain_cmm(3678 MB)[I][ Init][ 263]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][ Init][ 306]: image encoder output float32
[I][ Init][ 336]: max_token_len : 2047
[I][ Init][ 341]: kv_cache_size : 1024, kv_cache_num: 2047
[I][ Init][ 349]: prefill_token_num : 128
[I][ Init][ 353]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 353]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 353]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 353]: grp: 4, prefill_max_token_num : 384
[I][ Init][ 353]: grp: 5, prefill_max_token_num : 512
[I][ Init][ 353]: grp: 6, prefill_max_token_num : 640
[I][ Init][ 353]: grp: 7, prefill_max_token_num : 768
[I][ Init][ 353]: grp: 8, prefill_max_token_num : 896
[I][ Init][ 353]: grp: 9, prefill_max_token_num : 1024
[I][ Init][ 353]: grp: 10, prefill_max_token_num : 1152
[I][ Init][ 357]: prefill_max_token_num : 1152
[I][ Init][ 366]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][ Encode][ 490]: image encode time : 751.804993 ms, size : 4
[I][ Encode][ 533]: input_ids size:600
[I][ Encode][ 541]: offset 15
[I][ Encode][ 557]: img_embed.size:4, 294912
[I][ Encode][ 562]: offset:159
[I][ Encode][ 562]: offset:303
[I][ Encode][ 562]: offset:447
[I][ Encode][ 571]: out_embed size:1228800
[I][ Encode][ 573]: position_ids size:600
[I][ Run][ 591]: input token num : 600, prefill_split_num : 5
[I][ Run][ 625]: input_num_token:128
[I][ Run][ 625]: input_num_token:128
[I][ Run][ 625]: input_num_token:128
[I][ Run][ 625]: input_num_token:128
[I][ Run][ 625]: input_num_token:88
[I][ Run][ 786]: ttft: 1040.91 ms
根据您提供的图片,这是一段关于两只土拨鼠在山地环境中互动的视频片段。
- **主体**:画面中有两只土拨鼠(也称“山地土拨鼠”或“黑背土拨鼠”),它们正站在一块布满碎石的草地上。它们的毛色为灰褐色与黑色相间,面部有明显的黑色条纹,这是土拨鼠的典型特征。
- **行为**:这两只土拨鼠正进行着一种看似玩耍或社交的互动。它们用前爪互相拍打,身体前倾,姿态充满活力。这种行为在土拨鼠中通常表示友好、玩耍或建立社交联系。
- **环境**:背景是连绵起伏的山脉,山坡上覆盖着绿色的植被,天空晴朗,阳光明媚。整个场景给人一种自然、宁静又充满生机的感觉。
- **视频风格**:从画面的清晰度和动态感来看,这可能是一段慢动作或高清晰度的视频片段,捕捉了土拨鼠活泼、生动的瞬间。
综上所述,这段视频生动地记录了两只土拨鼠在自然山地环境中友好互动的场景,展现了它们活泼、充满活力的天性。
[N][ Run][ 913]: hit eos,avg 9.44 token/s
prompt >>
```