File size: 10,314 Bytes
08af416 e61f34a 49c9770 da15ebf e61f34a 49c9770 e61f34a 49c9770 e61f34a 49c9770 e61f34a 49c9770 e61f34a 49c9770 e61f34a 49c9770 e61f34a 49c9770 e61f34a 08af416 e61f34a 08af416 e61f34a da15ebf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
license: apache-2.0
language:
- en
base_model:
- apple/FastVLM-1.5B
pipeline_tag: image-text-to-text
tags:
- vlm
- en
---
# FastVLM-1.5B
This version of FastVLM-1.5B has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 5.1-patch1.
Please note that the context of the model is 1k and the maximum prefill length is 640 tokens.
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo:
https://huggingface.co/apple/FastVLM-1.5B
How to Convert LLM from Huggingface to axmodel[TODO]
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(η±θ―ζ΄ΎPro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
|Chips|image encoder|ttft|w8a16|
|--|--|--|--|
|AX650| 231.07 ms (1024x1024)| 567.97 ms (291tokens)| 11.53 tokens/sec|
|AXCL x86| 233.84 ms (1024x1024)| 904.36 ms (285tokens)| 8.65 tokens/sec|
|AX650| 58.56 ms (512x512)| 179.66 ms (100tokens)| 11.53 tokens/sec|
## How to use
Download all files from this repository to the device
```
$ tree -L 1
.
βββ config.json
βββ fastvlm_ax650_context_1k_prefill_640
βββ fastvlm_tokenizer
βββ FastVLM_tokenizer.txt
βββ images
βββ infer_axmodel.py
βββ main_ax650
βββ main_ax650_api
βββ main_axcl_x86
βββ main_axcl_x86_api
βββ post_config.json
βββ README.md
βββ requirements.txt
βββ run_ax650_1024.sh
βββ run_ax650_512.sh
βββ run_ax650_api.sh
βββ run_axcl_x86_api.sh
βββ run_axcl_x86.sh
βββ utils
5 directories, 15 files
```
#### Install transformer
```
pip install -r requirements.txt
```
#### Inference with AX650 Host, such as M4N-Dock(η±θ―ζ΄ΎPro) or AX650 DEMO Board
```
root@ax650:~/FastVLM-1.5B# ./run_ax650_1024.sh
[I][ Init][ 134]: LLM init start
tokenizer_type = 3
stop_tokens size: 2
151645
151645
6% | βββ | 2 / 31 [1.16s<18.04s, 1.72 count/s] embed_selector init ok
100% | ββββββββββββββββββββββββββββββββ | 31 / 31 [4.29s<4.29s, 7.22 count/s] init post axmodel ok,remain_cmm(7954 MB)[I][ Init][ 252]: IMAGE_CONTEXT_TOKEN: 151646
[I][ Init][ 284]: image encoder input nhwc@uint8
[I][ Init][ 308]: image encoder output float32
[I][ Init][ 318]: image_encoder_height : 1024, image_encoder_width: 1024
[I][ Init][ 320]: max_token_len : 1024
[I][ Init][ 323]: kv_cache_size : 256, kv_cache_num: 1024
[I][ Init][ 331]: prefill_token_num : 128
[I][ Init][ 335]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 335]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 335]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 335]: grp: 4, prefill_max_token_num : 512
[I][ Init][ 335]: grp: 5, prefill_max_token_num : 640
[I][ Init][ 339]: prefill_max_token_num : 640
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 50,
"repetition_penalty": 1.15,
"temperature": 0.8,
"top_k": 10,
"top_p": 0.9
}
[I][ Init][ 348]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> who are you
image >>
[I][ Encode][ 470]: input_ids size: 33
[I][ Run][ 604]: input token num : 33, prefill_split_num : 1
[I][ Run][ 619]: prefill grpid 2
[I][ Run][ 646]: input_num_token:33
[I][ Run][ 770]: ttft: 181.57 ms
I am FastVLM, an AI language model developed by Apple Inc.
[N][ Run][ 879]: hit eos,avg 11.49 token/s
prompt >> describe the image
image >> ./images/image_1.jpg
[I][ Encode][ 442]: image encode time : 235.81 ms, size : 393216
[I][ Encode][ 496]: imgs_embed.size() : 1, media token size : 256
[I][ Run][ 604]: input token num : 291, prefill_split_num : 3
[I][ Run][ 619]: prefill grpid 4
[I][ Run][ 646]: input_num_token:128
[I][ Run][ 646]: input_num_token:128
[I][ Run][ 646]: input_num_token:35
[I][ Run][ 770]: ttft: 567.97 ms
The image shows a giant panda sitting in a bamboo forest. The panda is predominantly black and white, with a black body and white face. Its eyes are black, and it has black ears and a black nose. The panda is holding some bamboo leaves in its paws, suggesting that it is eating. The panda is sitting on a bed of bamboo, which is a key food source for the species. The background of the image includes more bamboo trees and some fallen branches. The panda's posture and the surrounding environment give the impression of a peaceful, natural setting. The panda is looking directly at the camera, and its expression appears calm and content. Overall, the image captures the beauty and tranquility of the panda in its natural habitat.
[N][ Run][ 879]: hit eos,avg 11.51 token/s
prompt >> q
```
Run the following command on the Axera board to start a chat conversation:
```sh
$ python infer_axmodel.py -v ./fastvlm_ax650_context_1k_prefill_640/image_encoder_1024x1024.axmodel -m ./fastvlm_ax650_context_1k_prefill_640 -t ./fastvlm_tokenizer/ -i 1024
```
output:
```bash
[INFO] Available providers: ['AXCLRTExecutionProvider']
Loading config, tokenizer and init model.
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 28
Init InferenceSession: 0%| | 0/28 [00:00<?, ?it/s][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 4%|ββββ | 1/28 [00:01<00:28, 1.05s/it][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 7%|βββββββββ | 2/28 [00:01<00:21, 1.20it/s][INFO] Using provider: AXCLRTExecutionProvider
...
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 28/28 [00:19<00:00, 1.43it/s]
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Model loaded successfully!
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
[INFO]: θΎε
₯ζζ¬θΏθ‘ε―Ήθ―οΌζθ
θΎε
₯εΎηθ·―εΎθΏθ‘εΎηηθ§£, ζθ
θΎε
₯qιεΊε―Ήθ―γ
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I am an artificial intelligence designed and developed by Apple Inc. I am a natural language processing model that can understand and respond to user input in a conversational manner. I can answer questions, provide information, and engage in discussions on a wide range of topics. I am designed to be helpful, informative, and friendly, and I am constantly learning and improving to provide the best possible experience for users.
prompt<<./images/ssd_horse.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a serene outdoor scene featuring a person riding a brown horse with a white blaze on its face. The rider, who has short brown hair, is wearing a blue hoodie, blue jeans, and black boots. The horse is equipped with a saddle and a bridle, and it stands on a dirt ground.
In the foreground, a brown dog with a pink collar is sitting on the ground, looking up at the rider with its mouth open, possibly in anticipation or excitement.
In the background, there is a silver pickup truck parked near a fence, and beyond the fence, there are trees and a few people sitting on a bench. The sky is overcast, suggesting a cloudy day. The overall atmosphere of the image is calm and peaceful, capturing a moment of connection between the rider, the horse, and the dog.
prompt<<./images/image_1.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a panda bear in a natural setting. The panda is sitting on the ground, surrounded by green bamboo leaves and plants. The panda has a distinctive black and white fur pattern, with black patches around its eyes, ears, and limbs, and a white face and body. The panda appears to be holding a bamboo leaf in its mouth, which is a common food source for pandas. The background includes a wooden structure, possibly a part of a bamboo enclosure, and some rocks. The overall scene suggests that the panda is in a zoo or a wildlife sanctuary.
prompt<<q
[INFO]: ε―Ήθ―η»ζοΌεθ§γ
``` |