update project
Browse files
README.md
CHANGED
|
@@ -9,6 +9,8 @@ tags:
|
|
| 9 |
- vlm
|
| 10 |
- en
|
| 11 |
---
|
|
|
|
|
|
|
| 12 |
# FastVLM-1.5B-GPTQ-Int4
|
| 13 |
|
| 14 |
This version of FastVLM-1.5B-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
|
|
@@ -79,12 +81,23 @@ hf download AXERA-TECH/FastVLM-1.5B-GPTQ-Int4 --local-dir .
|
|
| 79 |
# structure of the downloaded files
|
| 80 |
tree -L 3
|
| 81 |
.
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
|
| 90 |
## Inference with AX650 Host, such as M4N-Dock(็ฑ่ฏๆดพPro) or AX650N DEMO Board
|
|
@@ -92,21 +105,23 @@ tree -L 3
|
|
| 92 |
### ่ฟ่ก๏ผCLI๏ผ
|
| 93 |
|
| 94 |
```shell
|
| 95 |
-
|
| 96 |
-
[I][ Init][
|
| 97 |
-
tokenizer_type =
|
| 98 |
-
96% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 30 / 31 [
|
| 99 |
-
[I][ Init][
|
| 100 |
-
[I][ Init][
|
| 101 |
-
[I][ Init][
|
| 102 |
-
[I][ Init][
|
| 103 |
-
[I][ Init][
|
| 104 |
-
[I][ Init][
|
| 105 |
-
[I][ Init][
|
| 106 |
-
[I][ Init][
|
| 107 |
-
[I][ Init][
|
| 108 |
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
|
| 109 |
-
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [
|
|
|
|
|
|
|
| 110 |
[I][ load_config][ 282]: load config:
|
| 111 |
{
|
| 112 |
"enable_repetition_penalty": false,
|
|
@@ -120,49 +135,83 @@ tokenizer_type = 1
|
|
| 120 |
"top_p": 0.8
|
| 121 |
}
|
| 122 |
|
| 123 |
-
[I][ Init][
|
| 124 |
Type "q" to exit
|
| 125 |
Ctrl+c to stop current running
|
| 126 |
"reset" to reset kvcache
|
| 127 |
"dd" to remove last conversation.
|
| 128 |
"pp" to print history.
|
|
|
|
| 129 |
----------------------------------------
|
| 130 |
prompt >> who are you
|
| 131 |
-
|
| 132 |
-
[I][ SetKVCache][
|
| 133 |
-
[I][ SetKVCache][
|
| 134 |
-
[I][
|
| 135 |
-
[I][ Run][
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
prompt >> q
|
| 146 |
```
|
| 147 |
|
| 148 |
### ๅฏๅจๆๅก๏ผOpenAI ๅ
ผๅฎน๏ผ
|
| 149 |
|
| 150 |
```shell
|
| 151 |
-
|
| 152 |
-
[I][ Init][
|
| 153 |
-
tokenizer_type =
|
| 154 |
-
96% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 30 / 31 [2.
|
| 155 |
-
[I][ Init][
|
| 156 |
-
[I][ Init][
|
| 157 |
-
[I][ Init][
|
| 158 |
-
[I][ Init][
|
| 159 |
-
[I][ Init][
|
| 160 |
-
[I][ Init][
|
| 161 |
-
[I][ Init][
|
| 162 |
-
[I][ Init][
|
| 163 |
-
[I][ Init][
|
| 164 |
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
|
| 165 |
-
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [2.
|
|
|
|
|
|
|
| 166 |
[I][ load_config][ 282]: load config:
|
| 167 |
{
|
| 168 |
"enable_repetition_penalty": false,
|
|
@@ -176,11 +225,11 @@ tokenizer_type = 1
|
|
| 176 |
"top_p": 0.8
|
| 177 |
}
|
| 178 |
|
| 179 |
-
[I][ Init][
|
| 180 |
-
Starting server on port 8000 with model 'AXERA-TECH/
|
| 181 |
OpenAI API Server starting on http://0.0.0.0:8000
|
| 182 |
Max concurrency: 1
|
| 183 |
-
Models: AXERA-TECH/
|
| 184 |
```
|
| 185 |
|
| 186 |
### OpenAI ่ฐ็จ็คบไพ
|
|
@@ -189,7 +238,7 @@ Models: AXERA-TECH/Qwen3-0.6B
|
|
| 189 |
from openai import OpenAI
|
| 190 |
|
| 191 |
API_URL = "http://127.0.0.1:8000/v1"
|
| 192 |
-
MODEL = "AXERA-TECH/
|
| 193 |
|
| 194 |
messages = [
|
| 195 |
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
|
|
@@ -212,7 +261,7 @@ print(completion.choices[0].message.content)
|
|
| 212 |
from openai import OpenAI
|
| 213 |
|
| 214 |
API_URL = "http://127.0.0.1:8000/v1"
|
| 215 |
-
MODEL = "AXERA-TECH/
|
| 216 |
|
| 217 |
messages = [
|
| 218 |
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
|
|
|
|
| 9 |
- vlm
|
| 10 |
- en
|
| 11 |
---
|
| 12 |
+
|
| 13 |
+
|
| 14 |
# FastVLM-1.5B-GPTQ-Int4
|
| 15 |
|
| 16 |
This version of FastVLM-1.5B-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.
|
|
|
|
| 81 |
# structure of the downloaded files
|
| 82 |
tree -L 3
|
| 83 |
.
|
| 84 |
+
`-- AXERA-TECH
|
| 85 |
+
`-- FastVLM-1.5B-GPTQ-Int4
|
| 86 |
+
|-- FastVLM_tokenizer.txt
|
| 87 |
+
|-- README.md
|
| 88 |
+
|-- config.json
|
| 89 |
+
|-- image.png
|
| 90 |
+
|-- image_encoder_1024x1024.axmodel
|
| 91 |
+
|-- image_encoder_512x512.axmodel
|
| 92 |
+
|-- llava_qwen2_p128_l0_together.axmodel
|
| 93 |
+
...
|
| 94 |
+
|-- llava_qwen2_p128_l9_together.axmodel
|
| 95 |
+
|-- llava_qwen2_post.axmodel
|
| 96 |
+
|-- model.embed_tokens.weight.bfloat16.bin
|
| 97 |
+
|-- post_config.json
|
| 98 |
+
`-- vision_cache
|
| 99 |
+
|
| 100 |
+
3 directories, 37 files
|
| 101 |
```
|
| 102 |
|
| 103 |
## Inference with AX650 Host, such as M4N-Dock(็ฑ่ฏๆดพPro) or AX650N DEMO Board
|
|
|
|
| 105 |
### ่ฟ่ก๏ผCLI๏ผ
|
| 106 |
|
| 107 |
```shell
|
| 108 |
+
root@ax650:~# axllm run AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
|
| 109 |
+
[I][ Init][ 138]: LLM init start
|
| 110 |
+
tokenizer_type = 3
|
| 111 |
+
96% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 30 / 31 [3.66s<3.78s, 8.20 count/s] init post axmodel ok,remain_cmm(10593 MB)
|
| 112 |
+
[I][ Init][ 199]: max_token_len : 1024
|
| 113 |
+
[I][ Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
|
| 114 |
+
[I][ Init][ 205]: prefill_token_num : 128
|
| 115 |
+
[I][ Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
|
| 116 |
+
[I][ Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
|
| 117 |
+
[I][ Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
|
| 118 |
+
[I][ Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
|
| 119 |
+
[I][ Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
|
| 120 |
+
[I][ Init][ 214]: prefill_max_token_num : 640
|
| 121 |
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
|
| 122 |
+
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [3.66s<3.66s, 8.47 count/s] embed_selector init ok
|
| 123 |
+
[W][ Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
|
| 124 |
+
[I][ Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
|
| 125 |
[I][ load_config][ 282]: load config:
|
| 126 |
{
|
| 127 |
"enable_repetition_penalty": false,
|
|
|
|
| 135 |
"top_p": 0.8
|
| 136 |
}
|
| 137 |
|
| 138 |
+
[I][ Init][ 272]: LLM init ok
|
| 139 |
Type "q" to exit
|
| 140 |
Ctrl+c to stop current running
|
| 141 |
"reset" to reset kvcache
|
| 142 |
"dd" to remove last conversation.
|
| 143 |
"pp" to print history.
|
| 144 |
+
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
|
| 145 |
----------------------------------------
|
| 146 |
prompt >> who are you
|
| 147 |
+
image >>
|
| 148 |
+
[I][ SetKVCache][ 406]: prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:22
|
| 149 |
+
[I][ SetKVCache][ 408]: current prefill_max_token_num:640
|
| 150 |
+
[I][ SetKVCache][ 409]: first run
|
| 151 |
+
[I][ Run][ 457]: input token num : 22, prefill_split_num : 1
|
| 152 |
+
[I][ Run][ 497]: prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
|
| 153 |
+
[I][ Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
|
| 154 |
+
[I][ Run][ 627]: ttft: 137.01 ms
|
| 155 |
+
I am an AI language model, I am here to help answer any questions you may have. How can I assist you today?
|
| 156 |
+
|
| 157 |
+
[N][ Run][ 709]: hit eos,avg 14.77 token/s
|
| 158 |
+
|
| 159 |
+
[I][ GetKVCache][ 380]: precompute_len:48, remaining:592
|
| 160 |
+
prompt >> describe the image
|
| 161 |
+
image >> ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
|
| 162 |
+
[I][ EncodeForContent][ 994]: vision cache store: ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
|
| 163 |
+
[I][ SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:48 input_num_token:271
|
| 164 |
+
[I][ SetKVCache][ 408]: current prefill_max_token_num:512
|
| 165 |
+
[I][ Run][ 457]: input token num : 271, prefill_split_num : 3
|
| 166 |
+
[I][ Run][ 497]: prefill chunk p=0 history_len=48 grpid=2 kv_cache_num=128 input_tokens=128
|
| 167 |
+
[I][ Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
|
| 168 |
+
[I][ Run][ 497]: prefill chunk p=1 history_len=176 grpid=3 kv_cache_num=256 input_tokens=128
|
| 169 |
+
[I][ Run][ 519]: prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=0
|
| 170 |
+
[I][ Run][ 497]: prefill chunk p=2 history_len=304 grpid=4 kv_cache_num=512 input_tokens=15
|
| 171 |
+
[I][ Run][ 519]: prefill indices shape: p=2 idx_elems=128 idx_rows=1 pos_rows=0
|
| 172 |
+
[I][ Run][ 627]: ttft: 403.77 ms
|
| 173 |
+
The image depicts three astronauts standing in a forest, wearing full space suits with helmets. The scene is surreal and otherworldly, as the astronauts are dressed in space suits and are surrounded by a natural environment. The image is in black and white, which adds to the surreal and dreamlike quality of the scene. The astronauts appear to be exploring the forest, and the contrast between the natural environment and the space suits creates a striking and thought-provoking image.
|
| 174 |
+
|
| 175 |
+
[N][ Run][ 709]: hit eos,avg 14.79 token/s
|
| 176 |
+
|
| 177 |
+
[I][ GetKVCache][ 380]: precompute_len:412, remaining:228
|
| 178 |
+
prompt >> how many people in the image?
|
| 179 |
+
image >>
|
| 180 |
+
[I][ EncodeForContent][ 926]: vision cache hit (mem): ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
|
| 181 |
+
[I][ SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:412 input_num_token:17
|
| 182 |
+
[I][ SetKVCache][ 408]: current prefill_max_token_num:128
|
| 183 |
+
[I][ Run][ 457]: input token num : 17, prefill_split_num : 1
|
| 184 |
+
[I][ Run][ 497]: prefill chunk p=0 history_len=412 grpid=4 kv_cache_num=512 input_tokens=17
|
| 185 |
+
[I][ Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
|
| 186 |
+
[I][ Run][ 627]: ttft: 168.52 ms
|
| 187 |
+
There are three people in the image.
|
| 188 |
+
|
| 189 |
+
[N][ Run][ 709]: hit eos,avg 14.69 token/s
|
| 190 |
+
|
| 191 |
+
[I][ GetKVCache][ 380]: precompute_len:437, remaining:203
|
| 192 |
prompt >> q
|
| 193 |
```
|
| 194 |
|
| 195 |
### ๅฏๅจๆๅก๏ผOpenAI ๅ
ผๅฎน๏ผ
|
| 196 |
|
| 197 |
```shell
|
| 198 |
+
root@ax650:~# axllm serve AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
|
| 199 |
+
[I][ Init][ 138]: LLM init start
|
| 200 |
+
tokenizer_type = 3
|
| 201 |
+
96% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 30 / 31 [2.72s<2.81s, 11.02 count/s] init post axmodel ok,remain_cmm(10593 MB)
|
| 202 |
+
[I][ Init][ 199]: max_token_len : 1024
|
| 203 |
+
[I][ Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
|
| 204 |
+
[I][ Init][ 205]: prefill_token_num : 128
|
| 205 |
+
[I][ Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
|
| 206 |
+
[I][ Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
|
| 207 |
+
[I][ Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
|
| 208 |
+
[I][ Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
|
| 209 |
+
[I][ Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
|
| 210 |
+
[I][ Init][ 214]: prefill_max_token_num : 640
|
| 211 |
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
|
| 212 |
+
100% | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 31 / 31 [2.72s<2.72s, 11.38 count/s] embed_selector init ok
|
| 213 |
+
[W][ Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
|
| 214 |
+
[I][ Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
|
| 215 |
[I][ load_config][ 282]: load config:
|
| 216 |
{
|
| 217 |
"enable_repetition_penalty": false,
|
|
|
|
| 225 |
"top_p": 0.8
|
| 226 |
}
|
| 227 |
|
| 228 |
+
[I][ Init][ 272]: LLM init ok
|
| 229 |
+
Starting server on port 8000 with model 'AXERA-TECH/FastVLM-1.5B-GPTQ-Int4'...
|
| 230 |
OpenAI API Server starting on http://0.0.0.0:8000
|
| 231 |
Max concurrency: 1
|
| 232 |
+
Models: AXERA-TECH/FastVLM-1.5B-GPTQ-Int4
|
| 233 |
```
|
| 234 |
|
| 235 |
### OpenAI ่ฐ็จ็คบไพ
|
|
|
|
| 238 |
from openai import OpenAI
|
| 239 |
|
| 240 |
API_URL = "http://127.0.0.1:8000/v1"
|
| 241 |
+
MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"
|
| 242 |
|
| 243 |
messages = [
|
| 244 |
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
|
|
|
|
| 261 |
from openai import OpenAI
|
| 262 |
|
| 263 |
API_URL = "http://127.0.0.1:8000/v1"
|
| 264 |
+
MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"
|
| 265 |
|
| 266 |
messages = [
|
| 267 |
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
|