File size: 12,052 Bytes
5e8d06a
 
 
 
 
 
 
 
 
 
 
987f503
 
5e8d06a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
037e018
5e8d06a
5eee450
 
 
 
 
5e8d06a
 
 
10f34e8
 
5e8d06a
10f34e8
 
 
 
5e8d06a
10f34e8
 
 
 
 
5e8d06a
 
10f34e8
 
 
 
 
5e8d06a
10f34e8
 
 
5e8d06a
10f34e8
 
 
 
 
 
 
 
 
 
 
 
987f503
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e8d06a
 
10f34e8
 
 
 
 
987f503
 
 
 
 
 
 
 
 
 
 
 
 
10f34e8
987f503
 
 
5eee450
 
 
10f34e8
 
5eee450
10f34e8
 
 
5eee450
 
 
 
987f503
10f34e8
 
 
 
 
987f503
10f34e8
5eee450
987f503
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5eee450
 
 
10f34e8
 
 
987f503
 
 
 
 
 
 
 
 
 
 
 
 
10f34e8
987f503
 
 
10f34e8
 
 
 
 
 
 
 
 
 
 
 
 
987f503
 
10f34e8
 
987f503
10f34e8
 
 
 
 
 
 
 
987f503
10f34e8
 
 
 
 
 
 
 
 
 
 
 
 
5e8d06a
10f34e8
 
 
 
 
 
 
 
987f503
10f34e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
license: apache-2.0
language:
- en
base_model:
- apple/FastVLM-1.5B
pipeline_tag: image-text-to-text
tags:
- vlm
- en
---


# FastVLM-1.5B-GPTQ-Int4

This version of FastVLM-1.5B-GPTQ-Int4 has been converted to run on the Axera NPU using **w4a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 5.1-patch1.

Please note that the context of the model is 1k and the maximum prefill length is 640 tokens.

## Convert tools links:

For those who are interested in model conversion, you can try to quant and export axmodel through the original repo:

https://huggingface.co/apple/FastVLM-1.5B

How to Convert LLM from Huggingface to axmodel[TODO]

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://docs.m5stack.com/zh_CN/ai_hardware/LLM-8850_Card)
 
|Chips|image encoder|ttft|w4a16|CMM(GiB)|
|--|--|--|--|--|
|AX650| 237.49 ms (1024x1024)| 418.43 ms (291tokens)| 19.87 tokens/sec|1.4|
|AXCL x86| 233.93 ms (1024x1024)| 779.51 ms (286tokens)| 12.47 tokens/sec|1.4|
|AX650| 58.33 ms (512x512)| 128.92 ms (100tokens)| 19.87 tokens/sec|1.4|

## How to use

## 安装 axllm
方式一:克隆仓库后执行安装脚本:

```shell
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
```

方式二:一行命令安装(默认分支 `axllm`):

```shell
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
```

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到:
`https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm`
下载 **最新 CI 导出的可执行程序**`axllm`),然后:

```shell
chmod +x axllm
sudo mv axllm /usr/bin/axllm
```

## 模型下载(Hugging Face)
先创建模型目录并进入,然后下载到该目录:

```shell
mkdir -p AXERA-TECH/FastVLM-1.5B-GPTQ-Int4
cd AXERA-TECH/FastVLM-1.5B-GPTQ-Int4
hf download AXERA-TECH/FastVLM-1.5B-GPTQ-Int4 --local-dir .

# structure of the downloaded files
tree -L 3
.
`-- AXERA-TECH
    `-- FastVLM-1.5B-GPTQ-Int4
        |-- FastVLM_tokenizer.txt
        |-- README.md
        |-- config.json
        |-- image.png
        |-- image_encoder_1024x1024.axmodel
        |-- image_encoder_512x512.axmodel
        |-- llava_qwen2_p128_l0_together.axmodel
        ...
        |-- llava_qwen2_p128_l9_together.axmodel
        |-- llava_qwen2_post.axmodel
        |-- model.embed_tokens.weight.bfloat16.bin
        |-- post_config.json
        `-- vision_cache

3 directories, 37 files
```

## Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

### 运行(CLI)

```shell
root@ax650:~# axllm run AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
[I][                            Init][ 138]: LLM init start
tokenizer_type = 3
 96% | ███████████████████████████████   |  30 /  31 [3.66s<3.78s, 8.20 count/s] init post axmodel ok,remain_cmm(10593 MB)
[I][                            Init][ 199]: max_token_len : 1024
[I][                            Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
[I][                            Init][ 214]: prefill_max_token_num : 640
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [3.66s<3.66s, 8.47 count/s] embed_selector init ok
[W][                            Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
[I][                            Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> who are you
image >>
[I][                      SetKVCache][ 406]: prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:22
[I][                      SetKVCache][ 408]: current prefill_max_token_num:640
[I][                      SetKVCache][ 409]: first run
[I][                             Run][ 457]: input token num : 22, prefill_split_num : 1
[I][                             Run][ 497]: prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 627]: ttft: 137.01 ms
I am an AI language model, I am here to help answer any questions you may have. How can I assist you today?

[N][                             Run][ 709]: hit eos,avg 14.77 token/s

[I][                      GetKVCache][ 380]: precompute_len:48, remaining:592
prompt >> describe the image
image >> ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
[I][                EncodeForContent][ 994]: vision cache store: ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
[I][                      SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:48 input_num_token:271
[I][                      SetKVCache][ 408]: current prefill_max_token_num:512
[I][                             Run][ 457]: input token num : 271, prefill_split_num : 3
[I][                             Run][ 497]: prefill chunk p=0 history_len=48 grpid=2 kv_cache_num=128 input_tokens=128
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 497]: prefill chunk p=1 history_len=176 grpid=3 kv_cache_num=256 input_tokens=128
[I][                             Run][ 519]: prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 497]: prefill chunk p=2 history_len=304 grpid=4 kv_cache_num=512 input_tokens=15
[I][                             Run][ 519]: prefill indices shape: p=2 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 627]: ttft: 403.77 ms
The image depicts three astronauts standing in a forest, wearing full space suits with helmets. The scene is surreal and otherworldly, as the astronauts are dressed in space suits and are surrounded by a natural environment. The image is in black and white, which adds to the surreal and dreamlike quality of the scene. The astronauts appear to be exploring the forest, and the contrast between the natural environment and the space suits creates a striking and thought-provoking image.

[N][                             Run][ 709]: hit eos,avg 14.79 token/s

[I][                      GetKVCache][ 380]: precompute_len:412, remaining:228
prompt >> how many people in the image?
image >>
[I][                EncodeForContent][ 926]: vision cache hit (mem): ./AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/image.png
[I][                      SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:512 precompute_len:412 input_num_token:17
[I][                      SetKVCache][ 408]: current prefill_max_token_num:128
[I][                             Run][ 457]: input token num : 17, prefill_split_num : 1
[I][                             Run][ 497]: prefill chunk p=0 history_len=412 grpid=4 kv_cache_num=512 input_tokens=17
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 627]: ttft: 168.52 ms
There are three people in the image.

[N][                             Run][ 709]: hit eos,avg 14.69 token/s

[I][                      GetKVCache][ 380]: precompute_len:437, remaining:203
prompt >> q
```

### 启动服务(OpenAI 兼容)

```shell
root@ax650:~# axllm serve AXERA-TECH/FastVLM-1.5B-GPTQ-Int4/
[I][                            Init][ 138]: LLM init start
tokenizer_type = 3
 96% | ███████████████████████████████   |  30 /  31 [2.72s<2.81s, 11.02 count/s] init post axmodel ok,remain_cmm(10593 MB)
[I][                            Init][ 199]: max_token_len : 1024
[I][                            Init][ 202]: kv_cache_size : 256, kv_cache_num: 1024
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 640
[I][                            Init][ 214]: prefill_max_token_num : 640
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [2.72s<2.72s, 11.38 count/s] embed_selector init ok
[W][                            Init][ 480]: classic vision size override: cfg=448x448 -> model=1024x1024 (from input shape)
[I][                            Init][ 666]: VisionModule init ok: type=FastVLM, tokens_per_block=256, embed_size=1536, out_dtype=fp32
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/FastVLM-1.5B-GPTQ-Int4'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/FastVLM-1.5B-GPTQ-Int4
```

### OpenAI 调用示例

```python
from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)
```


### OpenAI 流式调用示例

```python
from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/FastVLM-1.5B-GPTQ-Int4"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
```