Qwen2.5-VL-3B-Instruct

This version of Qwen2.5-VL-3B-Instruct has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 3.4

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Image Process

Chips input size image num image encoder ttft(384 tokens) w8a16 DDR Flash
AX650 448*448 1 780 ms 1651 ms 5.9 tokens/sec 4.3 GiB 4.6 GiB

Video Process

Chips input size image num image encoder ttft(512 tokens) w8a16 DDR Flash
AX650 308*308 8 1400 ms 2455 ms 5.9 tokens/sec 4.4 GiB 4.7 GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

Download all files from this repository to the device

If you using AX650 Board

root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# tree -L 2
.
├── image
│   └── ssd_car.jpg
├── main
├── main_axcl_x86
├── main_axcl_aarch64
├── python
│   ├── cv_resize.py
│   ├── infer_image.py
│   ├── infer_text.py
│   ├── infer_video.py
│   ├── preprocess.py
│   └── utils.py
├── qwen2_5-vl-3b-image-ax650
│   ├── Qwen2.5-VL-3B-Instruct_vision_nchw448.axmodel
│   ├── model.embed_tokens.weight.bfloat16.bin
│   ├── qwen2_5_vl_p320_l0_together.axmodel
......
│   ├── qwen2_5_vl_p320_l9_together.axmodel
│   └── qwen2_5_vl_post.axmodel
├── qwen2_5-vl-3b-video-ax650
│   ├── Qwen2.5-VL-3B-Instruct_vision_nhwc.axmodel
│   ├── model.embed_tokens.weight.bfloat16.bin
│   ├── qwen2_5_vl_p512_l0_together.axmodel
......
│   ├── qwen2_5_vl_p512_l9_together.axmodel
│   └── qwen2_5_vl_post.axmodel
├── qwen2_5-vl-tokenizer
│   ├── chat_template.json
│   ├── config.json
│   ├── generation_config.json
│   ├── merges.txt
│   ├── model.safetensors.index.json
│   ├── preprocessor_config.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── qwen2_tokenizer_images.py
├── qwen2_tokenizer_video_308.py
├── run_qwen2_5_vl_image.sh
├── run_qwen2_5_vl_video.sh
├── run_qwen2_5_vl_image_axcl_x86.sh
├── run_qwen2_5_vl_image_axcl_aarch64.sh
├── run_qwen2_5_vl_video_axcl_x86.sh
├── run_qwen2_5_vl_video_axcl_aarch64.sh
└── video
    ├── frame_0075.jpg
......
    └── frame_0089.jpg

Prepare tokenizer server

Install transformer

pip install transformers==4.55.2 jinja2

Demo Run

Image understand demo

start tokenizer server for image understand demo
python3 qwen2_tokenizer_images.py --port 12345
run image understand demo
  • input text
描述下图片
  • input image

(base) root@ax650:~/AXERA-TECH/Qwen2.5-VL-3B-Instruct# ./run_qwen2_5_vl_image.sh
[I][                            Init][ 134]: LLM init start
[I][                            Init][ 136]: Total CMM:7478 MB
tokenizer_type = 1
  2% | █                                 |   1 /  39 [0.31s<12.21s, 3.19 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.31s<6.10s, 6.39 count/s] embed_selector init ok
[I][                            Init][ 181]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [17.30s<16.86s, 2.31 count/s] init vpm axmodel ok,remain_cmm(2939 MB)
[I][                            Init][ 287]: image encoder output float32

[I][                            Init][ 317]: max_token_len : 1023
[I][                            Init][ 322]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 330]: prefill_token_num : 128
[I][                            Init][ 334]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 334]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 334]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 334]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 334]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 338]: prefill_max_token_num : 512
[E][                     load_config][ 277]: config file(post_config.json) open failed
[W][                            Init][ 351]: load postprocess config(post_config.json) failed
[I][                            Init][ 355]: LLM init ok
[I][                            Init][ 357]: Left CMM:2939 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> what in the images?
image >> image/ssd_car.jpg
[I][                     EncodeImage][ 432]: pixel_values size 1
[I][                     EncodeImage][ 433]: grid_h 32 grid_w 32
[I][                     EncodeImage][ 460]: image encode time : 781.932983 ms, size : 1
[I][                          Encode][ 513]: input_ids size:282
[I][                          Encode][ 521]: offset 15
[I][                          Encode][ 537]: img_embed.size:1, 524288
[I][                          Encode][ 553]: out_embed size:577536
[I][                          Encode][ 554]: input_ids size 282
[I][                          Encode][ 556]: position_ids size:282
[I][                             Run][ 575]: input token num : 282, prefill_split_num : 3
[I][                             Run][ 609]: input_num_token:128
[I][                             Run][ 609]: input_num_token:128
[I][                             Run][ 609]: input_num_token:26
[I][                             Run][ 798]: ttft: 1651.51 ms

The image shows a red double-decker bus on a city street. The bus has an advertisement on its side that reads,
"THINGS GET MORE EXITING WHEN YOU SAY 'YES' VirginMoney.co.uk." The bus is parked on the side of the road,
and there is a person standing next to it. The background features a building with large windows and a few pedestrians walking on the sidewalk.
 The street appears to be in an urban area, possibly in a city like London.

[N][                             Run][ 924]: hit eos,avg 5.83 token/s

Video understand demo

Please pre-process the image of the video file into a 308x308 size picture

run image understand demo
(base) root@ax650:~/AXERA-TECH/Qwen2.5-VL-3B-Instruct# ./run_qwen2_5_vl_video.sh
[I][                            Init][ 134]: LLM init start
[I][                            Init][ 136]: Total CMM:7478 MB
tokenizer_type = 1
  2% | █                                 |   1 /  39 [0.32s<12.36s, 3.15 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.32s<6.20s, 6.29 count/s] embed_selector init ok
[I][                            Init][ 181]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [17.79s<17.35s, 2.25 count/s] init vpm axmodel ok,remain_cmm(3094 MB)
[I][                            Init][ 287]: image encoder output float32

[I][                            Init][ 317]: max_token_len : 1023
[I][                            Init][ 322]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 330]: prefill_token_num : 128
[I][                            Init][ 334]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 334]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 334]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 334]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 334]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 338]: prefill_max_token_num : 512
[E][                     load_config][ 277]: config file(post_config.json) open failed
[W][                            Init][ 351]: load postprocess config(post_config.json) failed
[I][                            Init][ 355]: LLM init ok
[I][                            Init][ 357]: Left CMM:3094 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> what is this?
video >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                     EncodeImage][ 432]: pixel_values size 4
[I][                     EncodeImage][ 433]: grid_h 22 grid_w 22
[I][                     EncodeImage][ 460]: image encode time : 1484.067993 ms, size : 4
[I][                          Encode][ 513]: input_ids size:509
[I][                          Encode][ 521]: offset 15
[I][                          Encode][ 537]: img_embed.size:4, 247808
[I][                          Encode][ 544]: offset:136
[I][                          Encode][ 544]: offset:257
[I][                          Encode][ 544]: offset:378
[I][                          Encode][ 553]: out_embed size:1042432
[I][                          Encode][ 554]: input_ids size 509
[I][                          Encode][ 556]: position_ids size:509
[I][                             Run][ 575]: input token num : 509, prefill_split_num : 4
[I][                             Run][ 609]: input_num_token:128
[I][                             Run][ 609]: input_num_token:128
[I][                             Run][ 609]: input_num_token:128
[I][                             Run][ 609]: input_num_token:125
[I][                             Run][ 798]: ttft: 2455.20 ms

This image shows two ground squirrels, also known as marmots, engaging in a playful interaction.
They are standing on their hind legs and appear to be playfully biting or nipping at each other. The background features a scenic mountain landscape with a clear blue sky.

[N][                             Run][ 924]: hit eos,avg 5.82 token/s

Inference with M.2 Accelerator card

What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.

Image understand demo

start tokenizer server for image understand demo
python3 qwen2_tokenizer_images.py --port 12345
run image understand demo
  • input text
描述这张图片
  • input image

(base) axera@raspberrypi:~/lhj/Qwen2.5-VL-3B-Instruct $ bash run_qwen2_5_vl_image_axcl_aarch64.sh 
[I][                            Init][ 162]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32

[I][                            Init][ 340]: max_token_len : 1023
[I][                            Init][ 343]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 351]: prefill_token_num : 128
[I][                            Init][ 355]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 355]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 355]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 355]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 355]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 359]: prefill_max_token_num : 512
________________________
|    ID| remain cmm(MB)|
========================
|     0|           2286|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[E][                     load_config][ 278]: config file(post_config.json) open failed
[W][                            Init][ 452]: load postprocess config(post_config.json) failed
[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这张图片
image >> image/ssd_car.jpg
[I][                          Encode][ 539]: image encode time : 772.851990 ms, size : 524288
[I][                             Run][ 625]: input token num : 280, prefill_split_num : 3
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:24
[I][                             Run][ 796]: ttft: 2067.18 ms

这张图片展示了一条繁忙的城市街道。前景中,一名女子站在人行道上,穿着黑色外套,面带微笑。她旁边是一辆红色的双层巴士,
巴士上有一个广告,上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’ VirginMoney.co.uk”。巴士的车牌号是“L15”。
巴士旁边停着一辆黑色的面包车。背景中可以看到一些商店和行人,街道两旁有路灯和商店的招牌。整体环境显得非常繁忙和现代。

[N][                             Run][ 949]: hit eos,avg 4.12 token/s

Video understand demo

Please pre-process the image of the video file into a 308x308 size picture

start tokenizer server for image understand demo
python qwen2_tokenizer_video_308.py --port 12345
run image understand demo
(base) axera@raspberrypi:~/lhj/Qwen2.5-VL-3B-Instruct $ bash run_qwen2_5_vl_video_axcl_aarch64.sh 
[I][                            Init][ 162]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32

[I][                            Init][ 340]: max_token_len : 1023
[I][                            Init][ 343]: kv_cache_size : 256, kv_cache_num: 1023
[I][                            Init][ 351]: prefill_token_num : 128
[I][                            Init][ 355]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 355]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 355]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 355]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 355]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 359]: prefill_max_token_num : 512
________________________
|    ID| remain cmm(MB)|
========================
|     0|           2464|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[E][                     load_config][ 278]: config file(post_config.json) open failed
[W][                            Init][ 452]: load postprocess config(post_config.json) failed
[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频的内容
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                          Encode][ 539]: image encode time : 1481.107056 ms, size : 991232
[I][                             Run][ 625]: input token num : 509, prefill_split_num : 4
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:128
[I][                             Run][ 659]: input_num_token:125
[I][                             Run][ 796]: ttft: 3049.59 ms

视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天,前景中有松鼠在互动。松鼠的毛色是棕色和灰色的混合,它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢,
它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。

[N][                             Run][ 949]: hit eos,avg 4.15 token/s
Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen2.5-VL-3B-Instruct

Finetuned
(611)
this model

Collections including AXERA-TECH/Qwen2.5-VL-3B-Instruct