AXERA-TECH
/

Qwen2.5-VL-3B-Instruct

@@ -40,14 +40,14 @@ https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
   - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
 **Image Process**
-|Chips| input size | image num | image encoder | ttft(320 tokens) | w8a16 | DDR | Flash |
 |--|--|--|--|--|--|--|--|
-|AX650| 448*448 | 1 | 780 ms | 2857 ms | 6.2 tokens/sec| 4.3 GiB |  4.6 GiB  |
 **Video Process**
 |Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
 |--|--|--|--|--|--|--|--|
-|AX650| 308*308 | 8  | 1400 ms | 5400 ms | 6.1 tokens/sec| 4.4 GiB |  4.7 GiB  |
 The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
@@ -141,65 +141,90 @@ python3 qwen2_tokenizer_images.py --port 12345
 ![](./image/ssd_car.jpg)
 ```
-root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_image.sh
-[I][                            Init][ 129]: LLM init start
-bos_id: -1, eos_id: 151645
-  2% | █                                 |   1 /  40 [0.01s<0.24s, 166.67 count/s] tokenizer init ok
 [I][                            Init][  26]: LLaMaEmbedSelector use mmap
-100% | ████████████████████████████████ |  40 /  40 [38.23s<38.23s, 1.05 count/s] init vpm axmodel ok,remain_cmm(7600 MB)
-[I][                            Init][ 277]: max_token_len : 1023
-[I][                            Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
-[I][                            Init][ 290]: prefill_token_num : 320
-[I][                            Init][ 292]: vpm_height : 1024,vpm_width : 392
-[I][                            Init][ 301]: LLM init ok
 Type "q" to exit, Ctrl+c to stop current running
-prompt >> who are you?
-image >>
-[I][                             Run][ 638]: ttft: 2854.47 ms
-I am a large language model created by Alibaba Cloud. I am called Qwen.
-[N][                             Run][ 779]: hit eos,avg 6.05 token/s
-prompt >> 描述下图片
 image >> image/ssd_car.jpg
-[I][                          Encode][ 416]: image encode time : 795.614014 ms, size : 524288
-[I][                             Run][ 638]: ttft: 2856.88 ms
-这张图片展示了一条繁忙的城市街道。前景中，一名女子站在人行道上，她穿着黑色外套，面带微笑。她旁边是一辆红色的双层巴士，巴士上有一个广告，
-上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的小型货车。背景中可以看到一些商店和行人，
-街道两旁的建筑物是现代的玻璃幕墙建筑。整体氛围显得繁忙而充满活力。
-[N][                             Run][ 779]: hit eos,avg 5.96 token/s
 ```
 #### Video understand demo
 Please pre-process the image of the video file into a 308x308 size picture
-##### start tokenizer server for image understand demo
-```
-python qwen2_tokenizer_video_308.py --port 12345
-```
 ##### run image understand demo
 ```
-root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_video.sh
-[I][                            Init][ 129]: LLM init start
-bos_id: -1, eos_id: 151645
-  2% | █                                 |   1 /  40 [0.00s<0.12s, 333.33 count/s] tokenizer init ok
 [I][                            Init][  26]: LLaMaEmbedSelector use mmap
-100% | ████████████████████████████████ |  40 /  40 [40.05s<40.05s, 1.00 count/s] init vpm axmodel ok,remain_cmm(7680 MB)
-[I][                            Init][ 277]: max_token_len : 1023
-[I][                            Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
-[I][                            Init][ 290]: prefill_token_num : 512
-[I][                            Init][ 292]: vpm_height : 484,vpm_width : 392
-[I][                            Init][ 301]: LLM init ok
 Type "q" to exit, Ctrl+c to stop current running
-prompt >> 描述下视频
-image >> video
 video/frame_0000.jpg
 video/frame_0008.jpg
 video/frame_0016.jpg
@@ -208,9 +233,29 @@ video/frame_0032.jpg
 video/frame_0040.jpg
 video/frame_0048.jpg
 video/frame_0056.jpg
-[I][                          Encode][ 416]: image encode time : 1487.557007 ms, size : 991232
-[I][                             Run][ 638]: ttft: 5488.29 ms
-视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天，前景中有松鼠在互动。松鼠的毛色主要是棕色和白色，它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢，它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。
 ```
 #### Inference with M.2 Accelerator card
@@ -269,7 +314,10 @@ image >> image/ssd_car.jpg
 [I][                             Run][ 659]: input_num_token:128
 [I][                             Run][ 659]: input_num_token:24
 [I][                             Run][ 796]: ttft: 2067.18 ms
-这张图片展示了一条繁忙的城市街道。前景中，一名女子站在人行道上，穿着黑色外套，面带微笑。她旁边是一辆红色的双层巴士，巴士上有一个广告，上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’ VirginMoney.co.uk”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的面包车。背景中可以看到一些商店和行人，街道两旁有路灯和商店的招牌。整体环境显得非常繁忙和现代。
 [N][                             Run][ 949]: hit eos,avg 4.12 token/s
 ```
@@ -328,7 +376,9 @@ video/frame_0056.jpg
 [I][                             Run][ 659]: input_num_token:128
 [I][                             Run][ 659]: input_num_token:125
 [I][                             Run][ 796]: ttft: 3049.59 ms
-视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天，前景中有松鼠在互动。松鼠的毛色是棕色和灰色的混合，它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢，它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。
 [N][                             Run][ 949]: hit eos,avg 4.15 token/s
 ```

   - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
 **Image Process**
+|Chips| input size | image num | image encoder | ttft(384 tokens) | w8a16 | DDR | Flash |
 |--|--|--|--|--|--|--|--|
+|AX650| 448*448 | 1 | 780 ms | 1651 ms | 5.9 tokens/sec| 4.3 GiB |  4.6 GiB  |
 **Video Process**
 |Chips| input size | image num | image encoder |ttft(512 tokens) | w8a16 | DDR | Flash |
 |--|--|--|--|--|--|--|--|
+|AX650| 308*308 | 8  | 1400 ms | 2455 ms | 5.9 tokens/sec| 4.4 GiB |  4.7 GiB  |
 The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
 ![](./image/ssd_car.jpg)
 ```
+(base) root@ax650:~/AXERA-TECH/Qwen2.5-VL-3B-Instruct# ./run_qwen2_5_vl_image.sh
+[I][                            Init][ 134]: LLM init start
+[I][                            Init][ 136]: Total CMM:7478 MB
+tokenizer_type = 1
+  2% | █                                 |   1 /  39 [0.31s<12.21s, 3.19 count/s] tokenizer init ok
 [I][                            Init][  26]: LLaMaEmbedSelector use mmap
+  5% | ██                                |   2 /  39 [0.31s<6.10s, 6.39 count/s] embed_selector init ok
+[I][                            Init][ 181]: attr.axmodel_num:36
+102% | █████████████████████████████████ |  40 /  39 [17.30s<16.86s, 2.31 count/s] init vpm axmodel ok,remain_cmm(2939 MB)
+[I][                            Init][ 287]: image encoder output float32
+[I][                            Init][ 317]: max_token_len : 1023
+[I][                            Init][ 322]: kv_cache_size : 256, kv_cache_num: 1023
+[I][                            Init][ 330]: prefill_token_num : 128
+[I][                            Init][ 334]: grp: 1, prefill_max_token_num : 1
+[I][                            Init][ 334]: grp: 2, prefill_max_token_num : 128
+[I][                            Init][ 334]: grp: 3, prefill_max_token_num : 256
+[I][                            Init][ 334]: grp: 4, prefill_max_token_num : 384
+[I][                            Init][ 334]: grp: 5, prefill_max_token_num : 512
+[I][                            Init][ 338]: prefill_max_token_num : 512
+[E][                     load_config][ 277]: config file(post_config.json) open failed
+[W][                            Init][ 351]: load postprocess config(post_config.json) failed
+[I][                            Init][ 355]: LLM init ok
+[I][                            Init][ 357]: Left CMM:2939 MB
 Type "q" to exit, Ctrl+c to stop current running
+prompt >> what in the images?
 image >> image/ssd_car.jpg
+[I][                     EncodeImage][ 432]: pixel_values size 1
+[I][                     EncodeImage][ 433]: grid_h 32 grid_w 32
+[I][                     EncodeImage][ 460]: image encode time : 781.932983 ms, size : 1
+[I][                          Encode][ 513]: input_ids size:282
+[I][                          Encode][ 521]: offset 15
+[I][                          Encode][ 537]: img_embed.size:1, 524288
+[I][                          Encode][ 553]: out_embed size:577536
+[I][                          Encode][ 554]: input_ids size 282
+[I][                          Encode][ 556]: position_ids size:282
+[I][                             Run][ 575]: input token num : 282, prefill_split_num : 3
+[I][                             Run][ 609]: input_num_token:128
+[I][                             Run][ 609]: input_num_token:128
+[I][                             Run][ 609]: input_num_token:26
+[I][                             Run][ 798]: ttft: 1651.51 ms
+The image shows a red double-decker bus on a city street. The bus has an advertisement on its side that reads,
+"THINGS GET MORE EXITING WHEN YOU SAY 'YES' VirginMoney.co.uk." The bus is parked on the side of the road,
+and there is a person standing next to it. The background features a building with large windows and a few pedestrians walking on the sidewalk.
+ The street appears to be in an urban area, possibly in a city like London.
+[N][                             Run][ 924]: hit eos,avg 5.83 token/s
 ```
 #### Video understand demo
 Please pre-process the image of the video file into a 308x308 size picture
 ##### run image understand demo
 ```
+(base) root@ax650:~/AXERA-TECH/Qwen2.5-VL-3B-Instruct# ./run_qwen2_5_vl_video.sh
+[I][                            Init][ 134]: LLM init start
+[I][                            Init][ 136]: Total CMM:7478 MB
+tokenizer_type = 1
+  2% | █                                 |   1 /  39 [0.32s<12.36s, 3.15 count/s] tokenizer init ok
 [I][                            Init][  26]: LLaMaEmbedSelector use mmap
+  5% | ██                                |   2 /  39 [0.32s<6.20s, 6.29 count/s] embed_selector init ok
+[I][                            Init][ 181]: attr.axmodel_num:36
+102% | █████████████████████████████████ |  40 /  39 [17.79s<17.35s, 2.25 count/s] init vpm axmodel ok,remain_cmm(3094 MB)
+[I][                            Init][ 287]: image encoder output float32
+[I][                            Init][ 317]: max_token_len : 1023
+[I][                            Init][ 322]: kv_cache_size : 256, kv_cache_num: 1023
+[I][                            Init][ 330]: prefill_token_num : 128
+[I][                            Init][ 334]: grp: 1, prefill_max_token_num : 1
+[I][                            Init][ 334]: grp: 2, prefill_max_token_num : 128
+[I][                            Init][ 334]: grp: 3, prefill_max_token_num : 256
+[I][                            Init][ 334]: grp: 4, prefill_max_token_num : 384
+[I][                            Init][ 334]: grp: 5, prefill_max_token_num : 512
+[I][                            Init][ 338]: prefill_max_token_num : 512
+[E][                     load_config][ 277]: config file(post_config.json) open failed
+[W][                            Init][ 351]: load postprocess config(post_config.json) failed
+[I][                            Init][ 355]: LLM init ok
+[I][                            Init][ 357]: Left CMM:3094 MB
 Type "q" to exit, Ctrl+c to stop current running
+prompt >> what is this?
+video >> video
 video/frame_0000.jpg
 video/frame_0008.jpg
 video/frame_0016.jpg
 video/frame_0040.jpg
 video/frame_0048.jpg
 video/frame_0056.jpg
+[I][                     EncodeImage][ 432]: pixel_values size 4
+[I][                     EncodeImage][ 433]: grid_h 22 grid_w 22
+[I][                     EncodeImage][ 460]: image encode time : 1484.067993 ms, size : 4
+[I][                          Encode][ 513]: input_ids size:509
+[I][                          Encode][ 521]: offset 15
+[I][                          Encode][ 537]: img_embed.size:4, 247808
+[I][                          Encode][ 544]: offset:136
+[I][                          Encode][ 544]: offset:257
+[I][                          Encode][ 544]: offset:378
+[I][                          Encode][ 553]: out_embed size:1042432
+[I][                          Encode][ 554]: input_ids size 509
+[I][                          Encode][ 556]: position_ids size:509
+[I][                             Run][ 575]: input token num : 509, prefill_split_num : 4
+[I][                             Run][ 609]: input_num_token:128
+[I][                             Run][ 609]: input_num_token:128
+[I][                             Run][ 609]: input_num_token:128
+[I][                             Run][ 609]: input_num_token:125
+[I][                             Run][ 798]: ttft: 2455.20 ms
+This image shows two ground squirrels, also known as marmots, engaging in a playful interaction.
+They are standing on their hind legs and appear to be playfully biting or nipping at each other. The background features a scenic mountain landscape with a clear blue sky.
+[N][                             Run][ 924]: hit eos,avg 5.82 token/s
 ```
 #### Inference with M.2 Accelerator card
 [I][                             Run][ 659]: input_num_token:128
 [I][                             Run][ 659]: input_num_token:24
 [I][                             Run][ 796]: ttft: 2067.18 ms
+这张图片展示了一条繁忙的城市街道。前景中，一名女子站在人行道上，穿着黑色外套，面带微笑。她旁边是一辆红色的双层巴士，
+巴士上有一个广告，上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’ VirginMoney.co.uk”。巴士的车牌号是“L15”。
+巴士旁边停着一辆黑色的面包车。背景中可以看到一些商店和行人，街道两旁有路灯和商店的招牌。整体环境显得非常繁忙和现代。
 [N][                             Run][ 949]: hit eos,avg 4.12 token/s
 ```
 [I][                             Run][ 659]: input_num_token:128
 [I][                             Run][ 659]: input_num_token:125
 [I][                             Run][ 796]: ttft: 3049.59 ms
+视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天，前景中有松鼠在互动。松鼠的毛色是棕色和灰色的混合，它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢，
+它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。
 [N][                             Run][ 949]: hit eos,avg 4.15 token/s
 ```