---
license: apache-2.0
language:
- en
base_model:
- apple/FastVLM-1.5B
pipeline_tag: image-text-to-text
tags:
- vlm
- en
---
# FastVLM-1.5B

This version of FastVLM-1.5B has been converted to run on the Axera NPU using **w8a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 5.1-patch1.

Please note that the context of the model is 1k and the maximum prefill length is 640 tokens.

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo:

https://huggingface.co/apple/FastVLM-1.5B

How to Convert LLM from Huggingface to axmodel[TODO]

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)

|Chips|image encoder|ttft|w8a16|
|--|--|--|--|
|AX650| 231.07 ms (1024x1024)| 567.97 ms (291tokens)| 11.53 tokens/sec|
|AXCL x86| 233.84 ms (1024x1024)| 904.36 ms (285tokens)| 8.65 tokens/sec|
|AX650| 58.56 ms (512x512)| 179.66 ms (100tokens)| 11.53 tokens/sec|

## How to use

Download all files from this repository to the device

```
$ tree -L 1
.
├── config.json
├── fastvlm_ax650_context_1k_prefill_640
├── fastvlm_tokenizer
├── FastVLM_tokenizer.txt
├── images
├── infer_axmodel.py
├── main_ax650
├── main_ax650_api
├── main_axcl_x86
├── main_axcl_x86_api
├── post_config.json
├── README.md
├── requirements.txt
├── run_ax650_1024.sh
├── run_ax650_512.sh
├── run_ax650_api.sh
├── run_axcl_x86_api.sh
├── run_axcl_x86.sh
└── utils

5 directories, 15 files
```

#### Install transformer

```
pip install -r requirements.txt
```

#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650 DEMO Board

```
root@ax650:~/FastVLM-1.5B# ./run_ax650_1024.sh
[I][                            Init][ 134]: LLM init start
tokenizer_type = 3
stop_tokens size: 2
151645
151645
  6% | ███                               |   2 /  31 [1.16s<18.04s, 1.72 count/s] embed_selector init ok
100% | ████████████████████████████████ |  31 /  31 [4.29s<4.29s, 7.22 count/s] init post axmodel ok,remain_cmm(7954 MB)[I][                            Init][ 252]: IMAGE_CONTEXT_TOKEN: 151646
[I][                            Init][ 284]: image encoder input nhwc@uint8
[I][                            Init][ 308]: image encoder output float32

[I][                            Init][ 318]: image_encoder_height : 1024, image_encoder_width: 1024
[I][                            Init][ 320]: max_token_len : 1024
[I][                            Init][ 323]: kv_cache_size : 256, kv_cache_num: 1024
[I][                            Init][ 331]: prefill_token_num : 128
[I][                            Init][ 335]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 335]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 335]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 335]: grp: 4, prefill_max_token_num : 512
[I][                            Init][ 335]: grp: 5, prefill_max_token_num : 640
[I][                            Init][ 339]: prefill_max_token_num : 640
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 50,
    "repetition_penalty": 1.15,
    "temperature": 0.8,
    "top_k": 10,
    "top_p": 0.9
}

[I][                            Init][ 348]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> who are you
image >>
[I][                          Encode][ 470]: input_ids size: 33
[I][                             Run][ 604]: input token num : 33, prefill_split_num : 1
[I][                             Run][ 619]: prefill grpid 2
[I][                             Run][ 646]: input_num_token:33
[I][                             Run][ 770]: ttft: 181.57 ms
I am FastVLM, an AI language model developed by Apple Inc.

[N][                             Run][ 879]: hit eos,avg 11.49 token/s

prompt >> describe the image
image >> ./images/image_1.jpg
[I][                          Encode][ 442]: image encode time : 235.81 ms, size : 393216
[I][                          Encode][ 496]: imgs_embed.size() : 1, media token size : 256
[I][                             Run][ 604]: input token num : 291, prefill_split_num : 3
[I][                             Run][ 619]: prefill grpid 4
[I][                             Run][ 646]: input_num_token:128
[I][                             Run][ 646]: input_num_token:128
[I][                             Run][ 646]: input_num_token:35
[I][                             Run][ 770]: ttft: 567.97 ms
The image shows a giant panda sitting in a bamboo forest. The panda is predominantly black and white, with a black body and white face. Its eyes are black, and it has black ears and a black nose. The panda is holding some bamboo leaves in its paws, suggesting that it is eating. The panda is sitting on a bed of bamboo, which is a key food source for the species. The background of the image includes more bamboo trees and some fallen branches. The panda's posture and the surrounding environment give the impression of a peaceful, natural setting. The panda is looking directly at the camera, and its expression appears calm and content. Overall, the image captures the beauty and tranquility of the panda in its natural habitat.

[N][                             Run][ 879]: hit eos,avg 11.51 token/s

prompt >> q
```

Run the following command on the Axera board to start a chat conversation:

```sh
$ python infer_axmodel.py -v ./fastvlm_ax650_context_1k_prefill_640/image_encoder_1024x1024.axmodel -m ./fastvlm_ax650_context_1k_prefill_640 -t ./fastvlm_tokenizer/ -i 1024
```
output:

```bash
[INFO] Available providers:  ['AXCLRTExecutionProvider']
Loading config, tokenizer and init model.
Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 28
Init InferenceSession:   0%|                                                                                                                          | 0/28 [00:00<?, ?it/s][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession:   4%|████                                                                                                              | 1/28 [00:01<00:28,  1.05s/it][INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession:   7%|████████▏                                                                                                         | 2/28 [00:01<00:21,  1.20it/s][INFO] Using provider: AXCLRTExecutionProvider
...
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Init InferenceSession: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:19<00:00,  1.43it/s]
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
Model loaded successfully!
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 5.1-patch1-dirty 140e8d4a-dirty
[INFO]: 输入文本进行对话，或者输入图片路径进行图片理解, 或者输入q退出对话。
prompt<<who are you
slice_indices: [0]
Slice prefill done: 0
answer >> I am an artificial intelligence designed and developed by Apple Inc. I am a natural language processing model that can understand and respond to user input in a conversational manner. I can answer questions, provide information, and engage in discussions on a wide range of topics. I am designed to be helpful, informative, and friendly, and I am constantly learning and improving to provide the best possible experience for users.

prompt<<./images/ssd_horse.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a serene outdoor scene featuring a person riding a brown horse with a white blaze on its face. The rider, who has short brown hair, is wearing a blue hoodie, blue jeans, and black boots. The horse is equipped with a saddle and a bridle, and it stands on a dirt ground.

In the foreground, a brown dog with a pink collar is sitting on the ground, looking up at the rider with its mouth open, possibly in anticipation or excitement.

In the background, there is a silver pickup truck parked near a fence, and beyond the fence, there are trees and a few people sitting on a bench. The sky is overcast, suggesting a cloudy day. The overall atmosphere of the image is calm and peaceful, capturing a moment of connection between the rider, the horse, and the dog.

prompt<<./images/image_1.jpg
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> The image depicts a panda bear in a natural setting. The panda is sitting on the ground, surrounded by green bamboo leaves and plants. The panda has a distinctive black and white fur pattern, with black patches around its eyes, ears, and limbs, and a white face and body. The panda appears to be holding a bamboo leaf in its mouth, which is a common food source for pandas. The background includes a wooden structure, possibly a part of a bamboo enclosure, and some rocks. The overall scene suggests that the panda is in a zoo or a wildlife sanctuary.

prompt<<q
[INFO]: 对话结束，再见。
```