File size: 15,027 Bytes

---
license: apache-2.0
language:
- en
base_model:
- apple/FastVLM-0.5B
pipeline_tag: image-text-to-text
tags:
- vlm
- en
---
# FastVLM-0.5B

This version of FastVLM-0.5B has been converted to run on the Axera NPU using **w8a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 5.1-patch1.

Please note that the context of the models is 1k and the maximum prefill length is 640 tokens.

## Convert tools links:

For those who are interested in model conversion, you can try to quant and export axmodel through the original repo:

https://huggingface.co/apple/FastVLM-0.5B

How to Convert LLM from Huggingface to axmodel[TODO]

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

- AX630C

|Chips|image encoder|ttft|w4a16|
|--|--|--|--|
|AX650N| 59.83 ms (512x512)| 76.36 ms (100tokens)| 34.81 tokens/sec|
|AXCL x86| 51.80 ms (512x512)| 145.05 ms (93tokens)| 17.40 tokens/sec|
|AX630C| 205.961 ms (512x512)| 489.013 ms (99tokens)| 11.67 tokens/sec|


## How to use

Download all files from this repository to the device

```
$tree -L 1
.
├── config.json
├── embeds
├── fastvlm_C128_CTX1024_P640_ax620e
├── fastvlm_C128_CTX1024_P640_ax650
├── fastvlm_tokenizer
├── FastVLM_tokenizer.txt
├── images
├── infer_axmodel_620e.py
├── infer_axmodel_650.py
├── main_ax650
├── main_ax650_api
├── main_axcl_x86
├── main_axcl_x86_api
├── post_config.json
├── README.md
├── requirements.txt
├── run_ax650_512.sh
├── run_ax650_api.sh
├── run_axcl_x86_api.sh
├── run_axcl_x86.sh
└── utils

7 directories, 15 files
```

#### Install transformer

```
pip install -r requirements.txt
```

#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650 DEMO Board

Run the following command on the Axera board to start a chat conversation:

```
root@ax650:~/FastVLM-0.5B# ./run_ax650_512.sh
[I][                            Init][ 134]: LLM init start
tokenizer_type = 3
stop_tokens size: 2
151645
151645
  7% | ███                               |   2 /  27 [1.06s<14.26s, 1.89 count/s] embed_selector init ok
100% | ████████████████████████████████ |  27 /  27 [2.35s<2.35s, 11.51 count/s] init post axmodel ok,remain_cmm(9222 MB)[I][                            Init][ 252]: IMAGE_CONTEXT_TOKEN: 151646
[I][                            Init][ 284]: image encoder input nhwc@uint8
[I][                            Init][ 308]: image encoder output float32

[I][                            Init][ 318]: image_encoder_height : 512, image_encoder_width: 512
[I][                            Init][ 320]: max_token_len : 1024
[I][                            Init][ 323]: kv_cache_size : 128, kv_cache_num: 1024
[I][                            Init][ 331]: prefill_token_num : 128
[I][                            Init][ 335]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 335]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 335]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 335]: grp: 4, prefill_max_token_num : 512
[I][                            Init][ 335]: grp: 5, prefill_max_token_num : 640
[I][                            Init][ 339]: prefill_max_token_num : 640
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 30,
    "repetition_penalty": 2,
    "temperature": 0.1,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 348]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> who are you
image >>
[I][                          Encode][ 470]: input_ids size: 33
[I][                             Run][ 604]: input token num : 33, prefill_split_num : 1
[I][                             Run][ 619]: prefill grpid 2
[I][                             Run][ 646]: input_num_token:33
[I][                             Run][ 770]: ttft: 76.40 ms
I am a language model created by Apple Inc. I am designed to assist users in generating human-like text based on the input they provide. I can understand and generate text based on the context and the input provided by the user. I am not capable of generating human-like text, but I can generate text based on the context and the input provided by the user.

[N][                             Run][ 879]: hit eos,avg 31.22 token/s

prompt >> describe the image.
image >> ./images/image_1.jpg
[I][                          Encode][ 442]: image encode time : 59.83 ms, size : 57344
[I][                          Encode][ 496]: imgs_embed.size() : 1, media token size : 64
[I][                             Run][ 604]: input token num : 100, prefill_split_num : 1
[I][                             Run][ 619]: prefill grpid 2
[I][                             Run][ 646]: input_num_token:100
[I][                             Run][ 770]: ttft: 76.36 ms
The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is lying on its stomach with its head resting on a log, appearing relaxed and content. The panda's distinctive black and white fur is clearly visible, with its black ears, eyes, and nose contrasting against its white face and body. The enclosure is surrounded by greenery, including bamboo and other plants, which adds to the natural habitat feel of the scene. The panda appears to be in a comfortable and secure environment, with ample space to move around and interact with its surroundings.

[N][                             Run][ 879]: hit eos,avg 31.30 token/s

prompt >> q
```

```sh
$ python3 infer_axmodel_650.py -v ./fastvlm_C128_CTX1024_P640_ax650/image_encoder_512x512_0.5b_ax650.axmodel -m ./fastvlm_C128_CTX1024_P640_ax650 -t fastvlm_tokenizer -i 512
```
output:

```bash
[INFO] Available providers:  ['AxEngineExecutionProvider', 'AXCLRTExecutionProvider']
Loading config, tokenizer and init model.
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 24
Init InferenceSession:   0%|                                                                                                                          | 0/24 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession:   8%|█████████▌                                                                                                        | 2/24 [00:00<00:01, 17.39it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
[INFO] Using provider: AxEngineExecutionProvider
...
Init InferenceSession: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 24.30it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Model loaded successfully!
[INFO]: 输入文本进行对话，或者输入图片路径进行图片理解, 或者输入q退出对话。
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I'm an AI language model, I don't have personal identity or a physical body. I exist solely as a digital entity designed to assist and provide information to users. I don't have a name or a personal identity, but I can provide information and answer questions based on my training data and algorithms. Is there something specific you would like to know about me?

prompt<<./images/ssd_horse.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a person riding a brown horse with a white blaze on its face. The rider is wearing a gray hoodie and blue jeans, and is holding the reins of the horse. The horse is standing on a dirt ground with some grass and trees in the background.

To the left of the horse, there is a brown dog sitting on the ground. The dog is looking up at the rider with its mouth open, as if it is begging or reacting to something.

In the background, there is a gray pickup truck parked on the grass, and a person wearing a red shirt and blue jeans is standing near the truck. There is also a wooden fence and some trees in the background.

The overall scene appears to be taking place in a rural or outdoor setting, possibly a farm or ranch.

prompt<<./images/image_1.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is lying on its stomach with its head resting on its front paws, appearing relaxed and content. The panda's distinctive black and white fur is clearly visible, with the black fur covering its ears, eyes, and limbs, while the white fur covers its face, neck, and the underside of its body. The panda's black nose and mouth are also visible.

The panda is surrounded by green foliage, including bamboo shoots and other plants, which are typical of a panda's natural habitat. In the background, there is a wooden structure that resembles a tree stump or a small tree, adding to the naturalistic setting. The ground is covered with dirt and leaves, further emphasizing the natural environment.

The lighting in the image is natural, suggesting that the photo was taken during the day. The overall scene conveys a sense of tranquility and the panda's comfort in its environment.

prompt<<q
[INFO]: 对话结束，再见。
```

#### Inference with AX630C Host

Run the following command on the Axera board to start a chat conversation:

```sh
python3 infer_axmodel_620e.py -v ./fastvlm_C128_CTX512_P256_ax620e/image_encoder_512x512_ax620e.axmodel -m ./fastvlm_C128_CTX512_P256_ax620e -t fastvlm_tokenizer -i 512
```

output:
```
[INFO] Available providers:  ['AxEngineExecutionProvider']
Loading config, tokenizer and init model.
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC20E
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.7.2a
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 24
Init InferenceSession:   0%|                                                                                                                          | 0/24 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession:   4%|████▊                                                                                                             | 1/24 
[00:02<00:00,  9.25it/s]
...
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Init InferenceSession: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:02<00:00,  9.12it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 1 (full core)
[INFO] Compiler version: 5.1-patch1-dirty 0a5b164f-dirty
Model loaded successfully!
[INFO]: 输入文本进行对话，或者输入图片路径进行图片理解, 或者输入q退出对话。
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I'm an AI language model, I don't have personal identity or a physical body. I exist solely as a digital creation created by Apple. I don't have a name or a personal identity. I'm designed to assist and provide information to users. Is there anything else I can help you with?

prompt<<./images/ssd_horse.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a person riding a brown horse with a white blaze on its face. The rider is wearing a blue hoodie and blue jeans, and is holding the reins of the horse. The horse is standing on a dirt ground with some grass and trees in the background. The rider is also holding a rope in their left hand, which is attached to the horse's harness.

To the left of the horse, there is a brown dog standing on the ground, looking up at the rider. The dog appears to be in a begging or pleading position, with its front paws raised and its mouth open.

In the background, there is a gray pickup truck parked on the grass, and a wooden fence can be seen behind the horse and rider. There are also some people visible in the background, including a person in a red shirt and another person in a blue shirt. The overall scene appears to be taking place in an outdoor setting, possibly a ranch or a farm.

prompt<<./images/image_1.jpg
slice_indices: [0]
Slice prefill done: 0
answer >> The image depicts a panda bear in a naturalistic enclosure, likely within a zoo or wildlife sanctuary. The panda is sitting on its hind legs, with its front paws resting on a wooden structure that resembles a tree stump. The panda's distinctive black and white fur is clearly visible, with the black fur covering its ears, eyes, and the area around its nose and mouth, while the white fur covers the rest of its body. The panda's black nose and the black fur around its mouth are also visible.

The panda is surrounded by green foliage, including bamboo shoots and other plants, which are typical of a panda's natural habitat. The ground appears to be covered with dirt and small rocks, and there are some larger rocks and a tree stump in the background. The lighting in the image suggests that it was taken during the daytime, with natural light illuminating the scene. The overall setting appears to be a well-maintained and naturalistic enclosure designed to mimic the panda's natural environment.

prompt<<q
[INFO]: 对话结束，再见。
```