wli1995 commited on
Commit
987f503
ยท
verified ยท
1 Parent(s): 10f34e8

update project

Browse files
Files changed (1) hide show
  1. README.md +103 -54
README.md CHANGED
@@ -9,6 +9,8 @@ tags:
9
  - vlm
10
  - en
11
  ---
 
 
12
  # FastVLM-1.5B-GPTQ-Int4
13
 
14
  This version of FastVLM-1.5B-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
@@ -79,12 +81,23 @@ hf download AXERA-TECH/FastVLM-1.5B-GPTQ-Int4 --local-dir .
79
  # structure of the downloaded files
80
  tree -L 3
81
  .
82
- โ””โ”€โ”€ AXERA-TECH
83
- โ””โ”€โ”€ FastVLM-1.5B-GPTQ-Int4
84
-
85
-
86
-
87
- 2 directories, 34 files
 
 
 
 
 
 
 
 
 
 
 
88
  ```
89
 
90
  ## Inference with AX650 Host, such as M4N-Dock(็ˆฑ่ŠฏๆดพPro) or AX650N DEMO Board
@@ -92,21 +105,23 @@ tree -L 3
92
  ### ่ฟ่กŒ๏ผˆCLI๏ผ‰
93
 
94
  ```shell
95
- (base) root@ax650:~# axllm run AXERA-TECH/Qwen3-0.6B/
96
- [I][ Init][ 127]: LLM init start
97
- tokenizer_type = 1
98
- 96% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 30 / 31 [2.35s<2.42s, 12.79 count/s] init post axmodel ok,remain_cmm(8662 MB)
99
- [I][ Init][ 188]: max_token_len : 2559
100
- [I][ Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
101
- [I][ Init][ 194]: prefill_token_num : 128
102
- [I][ Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
103
- [I][ Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
104
- [I][ Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
105
- [I][ Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
106
- [I][ Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
107
- [I][ Init][ 203]: prefill_max_token_num : 2048
108
  [I][ Init][ 27]: LLaMaEmbedSelector use mmap
109
- 100% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 31 / 31 [2.35s<2.35s, 13.21 count/s] embed_selector init ok
 
 
110
  [I][ load_config][ 282]: load config:
111
  {
112
  "enable_repetition_penalty": false,
@@ -120,49 +135,83 @@ tokenizer_type = 1
120
  "top_p": 0.8
121
  }
122
 
123
- [I][ Init][ 224]: LLM init ok
124
  Type "q" to exit
125
  Ctrl+c to stop current running
126
  "reset" to reset kvcache
127
  "dd" to remove last conversation.
128
  "pp" to print history.
 
129
  ----------------------------------------
130
  prompt >> who are you
131
- [I][ SetKVCache][ 357]: prefill_grpid:2 kv_cache_num:512 precompute_len:0 input_num_token:22
132
- [I][ SetKVCache][ 359]: current prefill_max_token_num:2048
133
- [I][ SetKVCache][ 360]: first run
134
- [I][ Run][ 412]: input token num : 22, prefill_split_num : 1
135
- [I][ Run][ 474]: ttft: 586.40 ms
136
- <think>
137
- Okay, the user asked, "Who are you?" I need to respond appropriately. Since I'm an AI assistant, I should acknowledge their question and explain my purpose. I should mention that I'm here to help and that I can assist with various tasks. I should keep the response friendly and open-ended to encourage further interaction. Let me make sure the language is clear and natural.
138
- </think>
139
-
140
- I'm an AI assistant designed to help you with a wide range of questions and tasks. How can I assist you today? ๐Ÿ˜Š
141
-
142
- [N][ Run][ 554]: hit eos,avg 15.63 token/s
143
-
144
- [I][ GetKVCache][ 331]: precompute_len:130, remaining:1918
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  prompt >> q
146
  ```
147
 
148
  ### ๅฏๅŠจๆœๅŠก๏ผˆOpenAI ๅ…ผๅฎน๏ผ‰
149
 
150
  ```shell
151
- (base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-0.6B/
152
- [I][ Init][ 127]: LLM init start
153
- tokenizer_type = 1
154
- 96% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 30 / 31 [2.06s<2.13s, 14.58 count/s] init post axmodel ok,remain_cmm(8662 MB)
155
- [I][ Init][ 188]: max_token_len : 2559
156
- [I][ Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
157
- [I][ Init][ 194]: prefill_token_num : 128
158
- [I][ Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
159
- [I][ Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
160
- [I][ Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
161
- [I][ Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
162
- [I][ Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
163
- [I][ Init][ 203]: prefill_max_token_num : 2048
164
  [I][ Init][ 27]: LLaMaEmbedSelector use mmap
165
- 100% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 31 / 31 [2.06s<2.06s, 15.07 count/s] embed_selector init ok
 
 
166
  [I][ load_config][ 282]: load config:
167
  {
168
  "enable_repetition_penalty": false,
@@ -176,11 +225,11 @@ tokenizer_type = 1
176
  "top_p": 0.8
177
  }
178
 
179
- [I][ Init][ 224]: LLM init ok
180
- Starting server on port 8000 with model 'AXERA-TECH/Qwen3-0.6B'...
181
  OpenAI API Server starting on http://0.0.0.0:8000
182
  Max concurrency: 1
183
- Models: AXERA-TECH/Qwen3-0.6B
184
  ```
185
 
186
  ### OpenAI ่ฐƒ็”จ็คบไพ‹
@@ -189,7 +238,7 @@ Models: AXERA-TECH/Qwen3-0.6B
189
  from openai import OpenAI
190
 
191
  API_URL = "http://127.0.0.1:8000/v1"
192
- MODEL = "AXERA-TECH/Qwen3-0.6B"
193
 
194
  messages = [
195
  {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
@@ -212,7 +261,7 @@ print(completion.choices[0].message.content)
212
  from openai import OpenAI
213
 
214
  API_URL = "http://127.0.0.1:8000/v1"
215
- MODEL = "AXERA-TECH/Qwen3-0.6B"
216
 
217
  messages = [
218
  {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
 
9
  - vlm
10
  - en
11
  ---
12
+
13
+
14
  # FastVLM-1.5B-GPTQ-Int4
15
 
16
  This version of FastVLM-1.5B-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
 
81
  # structure of the downloaded files
82
  tree -L 3
83
  .
84
+ `-- AXERA-TECH
85
+ `-- FastVLM-1.5B-GPTQ-Int4
86
+ |-- FastVLM_tokenizer.txt
87
+ |-- README.md
88
+ |-- config.json
89
+ |-- image.png
90
+ |-- image_encoder_1024x1024.axmodel
91
+ |-- image_encoder_512x512.axmodel
92
+ |-- llava_qwen2_p128_l0_together.axmodel
93
+ ...
94
+ |-- llava_qwen2_p128_l9_together.axmodel
95
+ |-- llava_qwen2_post.axmodel
96
+ |-- model.embed_tokens.weight.bfloat16.bin
97
+ |-- post_config.json
98
+ `-- vision_cache
99
+
100
+ 3 directories, 37 files
101
  ```
102
 
103
  ## Inference with AX650 Host, such as M4N-Dock(็ˆฑ่ŠฏๆดพPro) or AX650N DEMO Board
 
105
  ### ่ฟ่กŒ๏ผˆCLI๏ผ‰
106
 
107
  ```shell
108
+ root@ax650:~# axllm run AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
109
+ [I][ Init][ 138]: LLM init start
110
+ tokenizer_type = 3
111
+ 96% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 30 / 31 [3.66s<3.78s, 8.20 count/s] init post axmodel ok,remain_cmm(10593 MB)
112
+ [I][ Init][ 199]: max_token_len : 1024
113
+ [I][ Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
114
+ [I][ Init][ 205]: prefill_token_num : 128
115
+ [I][ Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
116
+ [I][ Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
117
+ [I][ Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
118
+ [I][ Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
119
+ [I][ Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
120
+ [I][ Init][ 214]: prefill_max_token_num : 640
121
  [I][ Init][ 27]: LLaMaEmbedSelector use mmap
122
+ 100% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 31 / 31 [3.66s<3.66s, 8.47 count/s] embed_selector init ok
123
+ [W][ Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
124
+ [I][ Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
125
  [I][ load_config][ 282]: load config:
126
  {
127
  "enable_repetition_penalty": false,
 
135
  "top_p": 0.8
136
  }
137
 
138
+ [I][ Init][ 272]: LLM init ok
139
  Type "q" to exit
140
  Ctrl+c to stop current running
141
  "reset" to reset kvcache
142
  "dd" to remove last conversation.
143
  "pp" to print history.
144
+ VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
145
  ----------------------------------------
146
  prompt >> who are you
147
+ image >>
148
+ [I][ SetKVCache][ 406]: prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:22
149
+ [I][ SetKVCache][ 408]: current prefill_max_token_num:640
150
+ [I][ SetKVCache][ 409]: first run
151
+ [I][ Run][ 457]: input token num : 22, prefill_split_num : 1
152
+ [I][ Run][ 497]: prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
153
+ [I][ Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
154
+ [I][ Run][ 627]: ttft: 137.01 ms
155
+ I am an AI language model, I am here to help answer any questions you may have. How can I assist you today?
156
+
157
+ [N][ Run][ 709]: hit eos,avg 14.77 token/s
158
+
159
+ [I][ GetKVCache][ 380]: precompute_len:48, remaining:592
160
+ prompt >> describe the image
161
+ image >> ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
162
+ [I][ EncodeForContent][ 994]: vision cache store: ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
163
+ [I][ SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:48 input_num_token:271
164
+ [I][ SetKVCache][ 408]: current prefill_max_token_num:512
165
+ [I][ Run][ 457]: input token num : 271, prefill_split_num : 3
166
+ [I][ Run][ 497]: prefill chunk p=0 history_len=48 grpid=2 kv_cache_num=128 input_tokens=128
167
+ [I][ Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
168
+ [I][ Run][ 497]: prefill chunk p=1 history_len=176 grpid=3 kv_cache_num=256 input_tokens=128
169
+ [I][ Run][ 519]: prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=0
170
+ [I][ Run][ 497]: prefill chunk p=2 history_len=304 grpid=4 kv_cache_num=512 input_tokens=15
171
+ [I][ Run][ 519]: prefill indices shape: p=2 idx_elems=128 idx_rows=1 pos_rows=0
172
+ [I][ Run][ 627]: ttft: 403.77 ms
173
+ The image depicts three astronauts standing in a forest, wearing full space suits with helmets. The scene is surreal and otherworldly, as the astronauts are dressed in space suits and are surrounded by a natural environment. The image is in black and white, which adds to the surreal and dreamlike quality of the scene. The astronauts appear to be exploring the forest, and the contrast between the natural environment and the space suits creates a striking and thought-provoking image.
174
+
175
+ [N][ Run][ 709]: hit eos,avg 14.79 token/s
176
+
177
+ [I][ GetKVCache][ 380]: precompute_len:412, remaining:228
178
+ prompt >> how many people in the image?
179
+ image >>
180
+ [I][ EncodeForContent][ 926]: vision cache hit (mem): ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
181
+ [I][ SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:412 input_num_token:17
182
+ [I][ SetKVCache][ 408]: current prefill_max_token_num:128
183
+ [I][ Run][ 457]: input token num : 17, prefill_split_num : 1
184
+ [I][ Run][ 497]: prefill chunk p=0 history_len=412 grpid=4 kv_cache_num=512 input_tokens=17
185
+ [I][ Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
186
+ [I][ Run][ 627]: ttft: 168.52 ms
187
+ There are three people in the image.
188
+
189
+ [N][ Run][ 709]: hit eos,avg 14.69 token/s
190
+
191
+ [I][ GetKVCache][ 380]: precompute_len:437, remaining:203
192
  prompt >> q
193
  ```
194
 
195
  ### ๅฏๅŠจๆœๅŠก๏ผˆOpenAI ๅ…ผๅฎน๏ผ‰
196
 
197
  ```shell
198
+ root@ax650:~# axllm serve AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
199
+ [I][ Init][ 138]: LLM init start
200
+ tokenizer_type = 3
201
+ 96% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 30 / 31 [2.72s<2.81s, 11.02 count/s] init post axmodel ok,remain_cmm(10593 MB)
202
+ [I][ Init][ 199]: max_token_len : 1024
203
+ [I][ Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
204
+ [I][ Init][ 205]: prefill_token_num : 128
205
+ [I][ Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
206
+ [I][ Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
207
+ [I][ Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
208
+ [I][ Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
209
+ [I][ Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
210
+ [I][ Init][ 214]: prefill_max_token_num : 640
211
  [I][ Init][ 27]: LLaMaEmbedSelector use mmap
212
+ 100% | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | 31 / 31 [2.72s<2.72s, 11.38 count/s] embed_selector init ok
213
+ [W][ Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
214
+ [I][ Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
215
  [I][ load_config][ 282]: load config:
216
  {
217
  "enable_repetition_penalty": false,
 
225
  "top_p": 0.8
226
  }
227
 
228
+ [I][ Init][ 272]: LLM init ok
229
+ Starting server on port 8000 with model 'AXERA-TECH/FastVLM-1.5B-GPTQ-Int4'...
230
  OpenAI API Server starting on http://0.0.0.0:8000
231
  Max concurrency: 1
232
+ Models: AXERA-TECH/FastVLM-1.5B-GPTQ-Int4
233
  ```
234
 
235
  ### OpenAI ่ฐƒ็”จ็คบไพ‹
 
238
  from openai import OpenAI
239
 
240
  API_URL = "http://127.0.0.1:8000/v1"
241
+ MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"
242
 
243
  messages = [
244
  {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
 
261
  from openai import OpenAI
262
 
263
  API_URL = "http://127.0.0.1:8000/v1"
264
+ MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"
265
 
266
  messages = [
267
  {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},