File size: 15,027 Bytes
461170c a043a29 461170c a043a29 461170c a043a29 461170c 3916870 461170c a043a29 461170c a043a29 461170c 6f04984 a043a29 461170c a043a29 461170c 6f04984 461170c 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 461170c 6f04984 461170c 6f04984 461170c a043a29 6f04984 461170c 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 6f04984 a043a29 461170c 6f04984 461170c 6f04984 a043a29 6f04984 a043a29 6f04984 461170c 6f04984 a043a29 6f04984 461170c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
---
license: apache-2.0
language:
- en
base_model:
- apple/FastVLM-0.5B
pipeline_tag: image-text-to-text
tags:
- vlm
- en
---
# FastVLM-0.5B
This version of FastVLM-0.5B has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 5.1-patch1.
Please note that the context of the models is 1k and the maximum prefill length is 640 tokens.
## Convert tools links:
For those who are interested in model conversion, you can try to quant and export axmodel through the original repo:
https://huggingface.co/apple/FastVLM-0.5B
How to Convert LLM from Huggingface to axmodel[TODO]
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(η±θ―ζ΄ΎPro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
|Chips|image encoder|ttft|w4a16|
|--|--|--|--|
|AX650N| 59.83 ms (512x512)| 76.36 ms (100tokens)| 34.81 tokens/sec|
|AXCL x86| 51.80 ms (512x512)| 145.05 ms (93tokens)| 17.40 tokens/sec|
|AX630C| 205.961 ms (512x512)| 489.013 ms (99tokens)| 11.67 tokens/sec|
## How to use
Download all files from this repository to the device
```
$tree -L 1
.
βββ config.json
βββ embeds
βββ fastvlm_C128_CTX1024_P640_ax620e
βββ fastvlm_C128_CTX1024_P640_ax650
βββ fastvlm_tokenizer
βββ FastVLM_tokenizer.txt
βββ images
βββ infer_axmodel_620e.py
βββ infer_axmodel_650.py
βββ main_ax650
βββ main_ax650_api
βββ main_axcl_x86
βββ main_axcl_x86_api
βββ post_config.json
βββ README.md
βββ requirements.txt
βββ run_ax650_512.sh
βββ run_ax650_api.sh
βββ run_axcl_x86_api.sh
βββ run_axcl_x86.sh
βββ utils
7 directories, 15 files
```
#### Install transformer
```
pip install -r requirements.txt
```
#### Inference with AX650 Host, such as M4N-Dock(η±θ―ζ΄ΎPro) or AX650 DEMO Board
Run the following command on the Axera board to start a chat conversation:
```
root@ax650:~/FastVLM-0.5B# ./run_ax650_512.sh
[I][ Init][ 134]: LLM init start
tokenizer_type = 3
stop_tokens size: 2
151645
151645
7% | βββ | 2 / 27 [1.06s<14.26s, 1.89 count/s] embed_selector init ok
100% | ββββββββββββββββββββββββββββββββ | 27 / 27 [2.35s<2.35s, 11.51 count/s] init post axmodel ok,remain_cmm(9222 MB)[I][ Init][ 252]: IMAGE_CONTEXT_TOKEN: 151646
[I][ Init][ 284]: image encoder input nhwc@uint8
[I][ Init][ 308]: image encoder output float32
[I][ Init][ 318]: image_encoder_height : 512, image_encoder_width: 512
[I][ Init][ 320]: max_token_len : 1024
[I][ Init][ 323]: kv_cache_size : 128, kv_cache_num: 1024
[I][ Init][ 331]: prefill_token_num : 128
[I][ Init][ 335]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 335]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 335]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 335]: grp: 4, prefill_max_token_num : 512
[I][ Init][ 335]: grp: 5, prefill_max_token_num : 640
[I][ Init][ 339]: prefill_max_token_num : 640
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 30,
"repetition_penalty": 2,
"temperature": 0.1,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 348]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> who are you
image >>
[I][ Encode][ 470]: input_ids size: 33
[I][ Run][ 604]: input token num : 33, prefill_split_num : 1
[I][ Run][ 619]: prefill grpid 2
[I][ Run][ 646]: input_num_token:33
[I][ Run][ 770]: ttft: 76.40 ms
I am a language model created by Apple Inc. I am designed to assist users in generating human-like text based on the input they provide. I can understand and generate text based on the context and the input provided by the user. I am not capable of generating human-like text, but I can generate text based on the context and the input provided by the user.
[N][ Run][ 879]: hit eos,avg 31.22 token/s
prompt >> describe the image.
image >> ./images/image_1.jpg
[I][ Encode][ 442]: image encode time : 59.83 ms, size : 57344
[I][ Encode][ 496]: imgs_embed.size() : 1, media token size : 64
[I][ Run][ 604]: input token num : 100, prefill_split_num : 1
[I][ Run][ 619]: prefill grpid 2
[I][ Run][ 646]: input_num_token:100
[I][ Run][ 770]: ttft: 76.36 ms
The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is lying on its stomach with its head resting on a log, appearing relaxed and content. The panda's distinctive black and white fur is clearly visible, with its black ears, eyes, and nose contrasting against its white face and body. The enclosure is surrounded by greenery, including bamboo and other plants, which adds to the natural habitat feel of the scene. The panda appears to be in a comfortable and secure environment, with ample space to move around and interact with its surroundings.
[N][ Run][ 879]: hit eos,avg 31.30 token/s
prompt >> q
```
```sh
$ python3 infer_axmodel_650.py -v ./fastvlm_C128_CTX1024_P640_ax650/image_encoder_512x512_0.5b_ax650.axmodel -m ./fastvlm_C128_CTX1024_P640_ax650 -t fastvlm_tokenizer -i 512
```
output:
```bash
[INFO] Available providers: ['AxEngineExecutionProvider', 'AXCLRTExecutionProvider']
Loading config, tokenizer and init model.
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 24
Init InferenceSession: 0%| | 0/24 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession: 8%|ββββββββββ | 2/24 [00:00<00:01, 17.39it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
[INFO] Using provider: AxEngineExecutionProvider
...
Init InferenceSession: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24/24 [00:00<00:00, 24.30it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Model loaded successfully!
[INFO]: θΎε
₯ζζ¬θΏθ‘ε―Ήθ―οΌζθ
θΎε
₯εΎηθ·―εΎθΏθ‘εΎηηθ§£, ζθ
θΎε
₯qιεΊε―Ήθ―γ
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I'm an AI language model, I don't have personal identity or a physical body. I exist solely as a digital entity designed to assist and provide information to users. I don't have a name or a personal identity, but I can provide information and answer questions based on my training data and algorithms. Is there something specific you would like to know about me?
prompt<<./images/ssd_horse.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a person riding a brown horse with a white blaze on its face. The rider is wearing a gray hoodie and blue jeans, and is holding the reins of the horse. The horse is standing on a dirt ground with some grass and trees in the background.
To the left of the horse, there is a brown dog sitting on the ground. The dog is looking up at the rider with its mouth open, as if it is begging or reacting to something.
In the background, there is a gray pickup truck parked on the grass, and a person wearing a red shirt and blue jeans is standing near the truck. There is also a wooden fence and some trees in the background.
The overall scene appears to be taking place in a rural or outdoor setting, possibly a farm or ranch.
prompt<<./images/image_1.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is lying on its stomach with its head resting on its front paws, appearing relaxed and content. The panda's distinctive black and white fur is clearly visible, with the black fur covering its ears, eyes, and limbs, while the white fur covers its face, neck, and the underside of its body. The panda's black nose and mouth are also visible.
The panda is surrounded by green foliage, including bamboo shoots and other plants, which are typical of a panda's natural habitat. In the background, there is a wooden structure that resembles a tree stump or a small tree, adding to the naturalistic setting. The ground is covered with dirt and leaves, further emphasizing the natural environment.
The lighting in the image is natural, suggesting that the photo was taken during the day. The overall scene conveys a sense of tranquility and the panda's comfort in its environment.
prompt<<q
[INFO]: ε―Ήθ―η»ζοΌεθ§γ
```
#### Inference with AX630C Host
Run the following command on the Axera board to start a chat conversation:
```sh
python3 infer_axmodel_620e.py -v ./fastvlm_C128_CTX512_P256_ax620e/image_encoder_512x512_ax620e.axmodel -m ./fastvlm_C128_CTX512_P256_ax620e -t fastvlm_tokenizer -i 512
```
output:
```
[INFO] Available providers: ['AxEngineExecutionProvider']
Loading config, tokenizer and init model.
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC20E
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.7.2a
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 24
Init InferenceSession: 0%| | 0/24 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession: 4%|βββββ | 1/24
[00:02<00:00, 9.25it/s]
...
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24/24 [00:02<00:00, 9.12it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Model loaded successfully!
[INFO]: θΎε
₯ζζ¬θΏθ‘ε―Ήθ―οΌζθ
θΎε
₯εΎηθ·―εΎθΏθ‘εΎηηθ§£, ζθ
θΎε
₯qιεΊε―Ήθ―γ
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I'm an AI language model, I don't have personal identity or a physical body. I exist solely as a digital creation created by Apple. I don't have a name or a personal identity. I'm designed to assist and provide information to users. Is there anything else I can help you with?
prompt<<./images/ssd_horse.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a person riding a brown horse with a white blaze on its face. The rider is wearing a blue hoodie and blue jeans, and is holding the reins of the horse. The horse is standing on a dirt ground with some grass and trees in the background. The rider is also holding a rope in their left hand, which is attached to the horse's harness.
To the left of the horse, there is a brown dog standing on the ground, looking up at the rider. The dog appears to be in a begging or pleading position, with its front paws raised and its mouth open.
In the background, there is a gray pickup truck parked on the grass, and a wooden fence can be seen behind the horse and rider. There are also some people visible in the background, including a person in a red shirt and another person in a blue shirt. The overall scene appears to be taking place in an outdoor setting, possibly a ranch or a farm.
prompt<<./images/image_1.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is sitting on its hind legs, with its front paws resting on a wooden structure that resembles a tree stump. The panda's distinctive black and white fur is clearly visible, with the black fur covering its ears, eyes, and the area around its nose and mouth, while the white fur covers the rest of its body. The panda's black nose and the black fur around its mouth are also visible.
The panda is surrounded by green foliage, including bamboo shoots and other plants, which are typical of a panda's natural habitat. The ground appears to be covered with dirt and small rocks, and there are some larger rocks and a tree stump in the background. The lighting in the image suggests that it was taken during the daytime, with natural light illuminating the scene. The overall setting appears to be a well-maintained and naturalistic enclosure designed to mimic the panda's natural environment.
prompt<<q
[INFO]: ε―Ήθ―η»ζοΌεθ§γ
``` |