lihongjie commited on
Commit
a0ee61c
·
1 Parent(s): f61442e

first commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,14 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ main_ax650 filter=lfs diff=lfs merge=lfs -text
37
+ video/frame_0000.jpg filter=lfs diff=lfs merge=lfs -text
38
+ video/frame_0008.jpg filter=lfs diff=lfs merge=lfs -text
39
+ video/frame_0016.jpg filter=lfs diff=lfs merge=lfs -text
40
+ video/frame_0024.jpg filter=lfs diff=lfs merge=lfs -text
41
+ video/frame_0032.jpg filter=lfs diff=lfs merge=lfs -text
42
+ video/frame_0040.jpg filter=lfs diff=lfs merge=lfs -text
43
+ video/frame_0048.jpg filter=lfs diff=lfs merge=lfs -text
44
+ video/frame_0056.jpg filter=lfs diff=lfs merge=lfs -text
45
+ image/ssd_car.jpg filter=lfs diff=lfs merge=lfs -text
46
+ image/ssd_horse.jpg filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,241 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - hfl/Qwen2.5-VL-3B-Instruct-GPTQ-Int4
8
+ pipeline_tag: image-text-to-text
9
+ library_name: transformers
10
+ tags:
11
+ - Qwen2.5-VL
12
+ - Qwen2.5-VL-3B-Instruct
13
+ - Int4
14
+ - VLM
15
+ ---
16
+
17
+ # Qwen2.5-VL-3B-Instruct
18
+
19
+ This version of Qwen2.5-VL-3B-Instruct-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
20
+
21
+ This model has been optimized with the following LoRA:
22
+
23
+ Compatible with Pulsar2 version: 3.4
24
+
25
+ ## Convert tools links:
26
+
27
+ For those who are interested in model conversion, you can try to export axmodel through the original repo :
28
+ https://huggingface.co/hfl/Qwen2.5-VL-3B-Instruct-GPTQ-Int4
29
+
30
+ [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
31
+
32
+ [AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera)
33
+
34
+
35
+ ## Support Platform
36
+
37
+ - AX650
38
+ - AX650N DEMO Board
39
+ - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
40
+ - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
41
+
42
+ **Image Process**
43
+ |Chips| input size | image num | image encoder | ttft(320 tokens) | w4a16 | DDR | Flash |
44
+ |--|--|--|--|--|--|--|--|
45
+ |AX650| | 1 | ms | ms | tokens/sec| GiB | GiB |
46
+
47
+ **Video Process**
48
+ |Chips| input size | image num | image encoder |ttft(512 tokens) | w4a16 | DDR | Flash |
49
+ |--|--|--|--|--|--|--|--|
50
+ |AX650| | 8 | ms | ms | tokens/sec| GiB | GiB |
51
+
52
+ The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.
53
+
54
+ ## How to use
55
+
56
+ Download all files from this repository to the device
57
+
58
+ **If you using AX650 Board**
59
+
60
+ ### Demo Run
61
+
62
+ #### Image understand demo
63
+
64
+ - input text
65
+
66
+ ```
67
+ 描述下图片
68
+ ```
69
+
70
+ - input image
71
+
72
+ ![](./image/ssd_car.jpg)
73
+
74
+ ```
75
+ root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_image.sh
76
+ [I][ Init][ 129]: LLM init start
77
+ bos_id: -1, eos_id: 151645
78
+ 2% | █ | 1 / 40 [0.01s<0.24s, 166.67 count/s] tokenizer init ok
79
+ [I][ Init][ 26]: LLaMaEmbedSelector use mmap
80
+ 100% | ████████████████████████████████ | 40 / 40 [38.23s<38.23s, 1.05 count/s] init vpm axmodel ok,remain_cmm(7600 MB)
81
+ [I][ Init][ 277]: max_token_len : 1023
82
+ [I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
83
+ [I][ Init][ 290]: prefill_token_num : 320
84
+ [I][ Init][ 292]: vpm_height : 1024,vpm_width : 392
85
+ [I][ Init][ 301]: LLM init ok
86
+ Type "q" to exit, Ctrl+c to stop current running
87
+
88
+ prompt >> who are you?
89
+ image >>
90
+ [I][ Run][ 638]: ttft: 2854.47 ms
91
+ I am a large language model created by Alibaba Cloud. I am called Qwen.
92
+
93
+ [N][ Run][ 779]: hit eos,avg 6.05 token/s
94
+
95
+ prompt >> 描述下图片
96
+ image >> image/ssd_car.jpg
97
+ [I][ Encode][ 416]: image encode time : 795.614014 ms, size : 524288
98
+ [I][ Run][ 638]: ttft: 2856.88 ms
99
+ 这张图片展示了一条繁忙的城市街道。前景中,一名女子站在人行道上,她穿着黑色外套,面带微笑。她旁边是一辆红色的双层巴士,巴士上有一个广告,
100
+ 上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的小型货车。背景中可以看到一些商店和行人,
101
+ 街道两旁的建筑物是现代的玻璃幕墙建筑。整体氛围显得繁忙而充满活力。
102
+
103
+ [N][ Run][ 779]: hit eos,avg 5.96 token/s
104
+ ```
105
+
106
+ #### Video understand demo
107
+
108
+ Please pre-process the image of the video file into a 308x308 size picture
109
+
110
+ ```
111
+ root@ax650:/mnt/qtang/llm-test/qwen2.5-vl-3b# ./run_qwen2_5_vl_video.sh
112
+ [I][ Init][ 129]: LLM init start
113
+ bos_id: -1, eos_id: 151645
114
+ 2% | █ | 1 / 40 [0.00s<0.12s, 333.33 count/s] tokenizer init ok
115
+ [I][ Init][ 26]: LLaMaEmbedSelector use mmap
116
+ 100% | ████████████████████████████████ | 40 / 40 [40.05s<40.05s, 1.00 count/s] init vpm axmodel ok,remain_cmm(7680 MB)
117
+ [I][ Init][ 277]: max_token_len : 1023
118
+ [I][ Init][ 282]: kv_cache_size : 256, kv_cache_num: 1023
119
+ [I][ Init][ 290]: prefill_token_num : 512
120
+ [I][ Init][ 292]: vpm_height : 484,vpm_width : 392
121
+ [I][ Init][ 301]: LLM init ok
122
+ Type "q" to exit, Ctrl+c to stop current running
123
+
124
+ prompt >> 描述下视频
125
+ image >> video
126
+ video/frame_0000.jpg
127
+ video/frame_0008.jpg
128
+ video/frame_0016.jpg
129
+ video/frame_0024.jpg
130
+ video/frame_0032.jpg
131
+ video/frame_0040.jpg
132
+ video/frame_0048.jpg
133
+ video/frame_0056.jpg
134
+ [I][ Encode][ 416]: image encode time : 1487.557007 ms, size : 991232
135
+ [I][ Run][ 638]: ttft: 5488.29 ms
136
+ 视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天,前景中有松鼠在互动。松鼠的毛色主要是棕色和白色,它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢,它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。
137
+ ```
138
+
139
+ #### Inference with M.2 Accelerator card
140
+ What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
141
+
142
+ #### Image understand demo
143
+
144
+ - input text
145
+
146
+ ```
147
+ 描述这张图片
148
+ ```
149
+
150
+ - input image
151
+
152
+ ![](./image/ssd_car.jpg)
153
+
154
+ ```
155
+ (base) axera@raspberrypi:~/lhj/Qwen2.5-VL-3B-Instruct $ bash run_qwen2_5_vl_image_axcl_aarch64.sh
156
+ [I][ Init][ 162]: LLM init start
157
+ [I][ Init][ 34]: connect http://127.0.0.1:12345 ok
158
+ [I][ Init][ 267]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
159
+ [I][ Init][ 328]: image encoder output float32
160
+
161
+ [I][ Init][ 340]: max_token_len : 1023
162
+ [I][ Init][ 343]: kv_cache_size : 256, kv_cache_num: 1023
163
+ [I][ Init][ 351]: prefill_token_num : 128
164
+ [I][ Init][ 355]: grp: 1, prefill_max_token_num : 1
165
+ [I][ Init][ 355]: grp: 2, prefill_max_token_num : 128
166
+ [I][ Init][ 355]: grp: 3, prefill_max_token_num : 256
167
+ [I][ Init][ 355]: grp: 4, prefill_max_token_num : 384
168
+ [I][ Init][ 355]: grp: 5, prefill_max_token_num : 512
169
+ [I][ Init][ 359]: prefill_max_token_num : 512
170
+ ________________________
171
+ | ID| remain cmm(MB)|
172
+ ========================
173
+ | 0| 2286|
174
+ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
175
+ [E][ load_config][ 278]: config file(post_config.json) open failed
176
+ [W][ Init][ 452]: load postprocess config(post_config.json) failed
177
+ [I][ Init][ 456]: LLM init ok
178
+ Type "q" to exit, Ctrl+c to stop current running
179
+ prompt >> 描述这张图片
180
+ image >> image/ssd_car.jpg
181
+ [I][ Encode][ 539]: image encode time : 772.851990 ms, size : 524288
182
+ [I][ Run][ 625]: input token num : 280, prefill_split_num : 3
183
+ [I][ Run][ 659]: input_num_token:128
184
+ [I][ Run][ 659]: input_num_token:128
185
+ [I][ Run][ 659]: input_num_token:24
186
+ [I][ Run][ 796]: ttft: 2067.18 ms
187
+ 这张图片展示了一条繁忙的城市街道。前景中,一名女子站在人行道上,穿着黑色外套,面带微笑。她旁边是一辆红色的双层巴士,巴士上有一个广告,上面写着“THINGS GET MORE EXITING WHEN YOU SAY ‘YES’ VirginMoney.co.uk”。巴士的车牌号是“L15”。巴士旁边停着一辆黑色的面包车。背景中可以看到一些商店和行人,街道两旁有路灯和商店的招牌。整体环境显得非常繁忙和现代。
188
+
189
+ [N][ Run][ 949]: hit eos,avg 4.12 token/s
190
+ ```
191
+
192
+ #### Video understand demo
193
+
194
+ Please pre-process the image of the video file into a 308x308 size picture
195
+
196
+ ```
197
+ (base) axera@raspberrypi:~/lhj/Qwen2.5-VL-3B-Instruct $ bash run_qwen2_5_vl_video_axcl_aarch64.sh
198
+ [I][ Init][ 162]: LLM init start
199
+ [I][ Init][ 34]: connect http://127.0.0.1:12345 ok
200
+ [I][ Init][ 267]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
201
+ [I][ Init][ 328]: image encoder output float32
202
+
203
+ [I][ Init][ 340]: max_token_len : 1023
204
+ [I][ Init][ 343]: kv_cache_size : 256, kv_cache_num: 1023
205
+ [I][ Init][ 351]: prefill_token_num : 128
206
+ [I][ Init][ 355]: grp: 1, prefill_max_token_num : 1
207
+ [I][ Init][ 355]: grp: 2, prefill_max_token_num : 128
208
+ [I][ Init][ 355]: grp: 3, prefill_max_token_num : 256
209
+ [I][ Init][ 355]: grp: 4, prefill_max_token_num : 384
210
+ [I][ Init][ 355]: grp: 5, prefill_max_token_num : 512
211
+ [I][ Init][ 359]: prefill_max_token_num : 512
212
+ ________________________
213
+ | ID| remain cmm(MB)|
214
+ ========================
215
+ | 0| 2464|
216
+ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
217
+ [E][ load_config][ 278]: config file(post_config.json) open failed
218
+ [W][ Init][ 452]: load postprocess config(post_config.json) failed
219
+ [I][ Init][ 456]: LLM init ok
220
+ Type "q" to exit, Ctrl+c to stop current running
221
+ prompt >> 描述这个视频的内容
222
+ image >> video
223
+ video/frame_0000.jpg
224
+ video/frame_0008.jpg
225
+ video/frame_0016.jpg
226
+ video/frame_0024.jpg
227
+ video/frame_0032.jpg
228
+ video/frame_0040.jpg
229
+ video/frame_0048.jpg
230
+ video/frame_0056.jpg
231
+ [I][ Encode][ 539]: image encode time : 1481.107056 ms, size : 991232
232
+ [I][ Run][ 625]: input token num : 509, prefill_split_num : 4
233
+ [I][ Run][ 659]: input_num_token:128
234
+ [I][ Run][ 659]: input_num_token:128
235
+ [I][ Run][ 659]: input_num_token:128
236
+ [I][ Run][ 659]: input_num_token:125
237
+ [I][ Run][ 796]: ttft: 3049.59 ms
238
+ 视频展示了两只松鼠在户外的场景。背景是模糊的山脉和蓝天,前景中有松鼠在互动。松鼠的毛色是棕色和灰色的混合,它们的爪子是橙色的。松鼠似乎在互相玩耍或争抢,它们的爪子和嘴巴都伸向对方。整个场景显得非常自然和生动。
239
+
240
+ [N][ Run][ 949]: hit eos,avg 4.15 token/s
241
+ ```
config.json ADDED
File without changes
image/ssd_car.jpg ADDED

Git LFS Details

  • SHA256: 92d459a39a9eef03956257cf9fec84114d9e5df8fb9c0662fb257488cdd4f365
  • Pointer size: 130 Bytes
  • Size of remote file: 50.5 kB
image/ssd_horse.jpg ADDED

Git LFS Details

  • SHA256: ed22f6b4c8c33e50e391e089ede14e8fa9402c623b09dbcf010e804770698fbb
  • Pointer size: 131 Bytes
  • Size of remote file: 123 kB
main_ax650 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f6cfcf0021a15a3baea0513e0eb6b17bdfbe08928de0e74619e6684a13a1493
3
+ size 6808392
qwen2.5_tokenizer.txt ADDED
The diff for this file is too large to render. See raw diff
 
run_qwen2_5_vl_image.sh ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ AXMODEL_DIR=./Qwen2.5-VL-3B-Instruct-AX650-chunk_prefill_512
2
+
3
+ ./main_ax650 \
4
+ --template_filename_axmodel "${AXMODEL_DIR}/qwen2_5_vl_p128_l%d_together.axmodel" \
5
+ --axmodel_num 36 \
6
+ --filename_image_encoder_axmodedl "${AXMODEL_DIR}/Qwen2.5-VL-3B-Instruct_vision_nchw448.axmodel" \
7
+ --bos 0 --eos 0 \
8
+ --dynamic_load_axmodel_layer 0 \
9
+ --use_mmap_load_embed 1 \
10
+ --filename_tokenizer_model "qwen2.5_tokenizer.txt" \
11
+ --filename_post_axmodel "${AXMODEL_DIR}/qwen2_5_vl_post.axmodel" \
12
+ --use_topk 0 \
13
+ --filename_tokens_embed "${AXMODEL_DIR}/model.embed_tokens.weight.bfloat16.bin" \
14
+ --tokens_embed_num 151936 \
15
+ --tokens_embed_size 2048 \
16
+ --live_print 1 \
17
+ --continue 1 \
18
+ --video 0 \
19
+ --img_width 448 \
20
+ --img_height 448 \
21
+ --vision_start_token_id 151652 \
22
+ --post_config_path post_config.json
run_qwen2_5_vl_video.sh ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ AXMODEL_DIR=./Qwen2.5-VL-3B-Instruct-AX650-chunk_prefill_512
2
+
3
+ ./main_ax650 \
4
+ --template_filename_axmodel "${AXMODEL_DIR}/qwen2_5_vl_p128_l%d_together.axmodel" \
5
+ --axmodel_num 36 \
6
+ --filename_image_encoder_axmodedl "${AXMODEL_DIR}/Qwen2.5-VL-3B-Instruct_vision_nhwc.axmodel" \
7
+ --bos 0 --eos 0 \
8
+ --dynamic_load_axmodel_layer 0 \
9
+ --use_mmap_load_embed 1 \
10
+ --filename_tokenizer_model "qwen2.5_tokenizer.txt" \
11
+ --filename_post_axmodel "${AXMODEL_DIR}/qwen2_5_vl_post.axmodel" \
12
+ --use_topk 0 \
13
+ --filename_tokens_embed "${AXMODEL_DIR}/model.embed_tokens.weight.bfloat16.bin" \
14
+ --tokens_embed_num 151936 \
15
+ --tokens_embed_size 2048 \
16
+ --live_print 1 \
17
+ --continue 1 \
18
+ --video 1 \
19
+ --img_width 308 \
20
+ --img_height 308 \
21
+ --vision_start_token_id 151652 \
22
+ --post_config_path post_config.json
video/frame_0000.jpg ADDED

Git LFS Details

  • SHA256: d0cea2769fd052ce3b24c3982a17135dbffd600cd612014c3cffe014c0224ffa
  • Pointer size: 130 Bytes
  • Size of remote file: 54.1 kB
video/frame_0008.jpg ADDED

Git LFS Details

  • SHA256: c812aed3407b41d474d859fedd4d9eaab971482e1dd0e22c5da16a627a740394
  • Pointer size: 130 Bytes
  • Size of remote file: 52.7 kB
video/frame_0016.jpg ADDED

Git LFS Details

  • SHA256: 3cc72377820bd9c47a41ebcae744acd8b3952b54e02854a9cf0b4a70e49def60
  • Pointer size: 130 Bytes
  • Size of remote file: 48.9 kB
video/frame_0024.jpg ADDED

Git LFS Details

  • SHA256: afee75df68ffda9f5ae59b0ba3badf29e56a60acce64554ecc9e49f20854c47c
  • Pointer size: 130 Bytes
  • Size of remote file: 49.2 kB
video/frame_0032.jpg ADDED

Git LFS Details

  • SHA256: 1cea98a54747fb32c1bf7375aae020b3703ee70da6eb967d1a7d590d9f997038
  • Pointer size: 130 Bytes
  • Size of remote file: 49.1 kB
video/frame_0040.jpg ADDED

Git LFS Details

  • SHA256: dc03d027d92549acc1164f01b8450623093b76e3945b9c2eaaa7f0073b827cf5
  • Pointer size: 130 Bytes
  • Size of remote file: 45.5 kB
video/frame_0048.jpg ADDED

Git LFS Details

  • SHA256: f1832ad904c7d25423b1389c769a2287815d1b62acee9474403caa18069d7c52
  • Pointer size: 130 Bytes
  • Size of remote file: 44.9 kB
video/frame_0056.jpg ADDED

Git LFS Details

  • SHA256: 2a000328952b1f092c438f687a55dfaeb822d763b68eb3685ee403c9859d5ebd
  • Pointer size: 130 Bytes
  • Size of remote file: 42.8 kB