--- license: apache-2.0 language: - en base_model: - apple/FastVLM-1.5B pipeline_tag: image-text-to-text tags: - vlm - en --- # FastVLM-1.5B This version of FastVLM-1.5B has been converted to run on the Axera NPU using **w8a16** quantization. This model has been optimized with the following LoRA: Compatible with Pulsar2 version: 5.1-patch1. Please note that the context of the model is 1k and the maximum prefill length is 640 tokens. ## Convert tools links: For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/apple/FastVLM-1.5B How to Convert LLM from Huggingface to axmodel[TODO] ## Support Platform - AX650 - AX650N DEMO Board - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html) |Chips|image encoder|ttft|w8a16| |--|--|--|--| |AX650| 231.07 ms (1024x1024)| 567.97 ms (291tokens)| 11.53 tokens/sec| |AXCL x86| 233.84 ms (1024x1024)| 904.36 ms (285tokens)| 8.65 tokens/sec| |AX650| 58.56 ms (512x512)| 179.66 ms (100tokens)| 11.53 tokens/sec| ## How to use Download all files from this repository to the device ``` $ tree -L 1 . ├── config.json ├── fastvlm_ax650_context_1k_prefill_640 ├── fastvlm_tokenizer ├── FastVLM_tokenizer.txt ├── images ├── infer_axmodel.py ├── main_ax650 ├── main_ax650_api ├── main_axcl_x86 ├── main_axcl_x86_api ├── post_config.json ├── README.md ├── requirements.txt ├── run_ax650_1024.sh ├── run_ax650_512.sh ├── run_ax650_api.sh ├── run_axcl_x86_api.sh ├── run_axcl_x86.sh └── utils 5 directories, 15 files ``` #### Install transformer ``` pip install -r requirements.txt ``` #### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650 DEMO Board ``` root@ax650:~/FastVLM-1.5B# ./run_ax650_1024.sh [I][ Init][ 134]: LLM init start tokenizer_type = 3 stop_tokens size: 2 151645 151645 6% | ███ | 2 / 31 [1.16s<18.04s, 1.72 count/s] embed_selector init ok 100% | ████████████████████████████████ | 31 / 31 [4.29s<4.29s, 7.22 count/s] init post axmodel ok,remain_cmm(7954 MB)[I][ Init][ 252]: IMAGE_CONTEXT_TOKEN: 151646 [I][ Init][ 284]: image encoder input nhwc@uint8 [I][ Init][ 308]: image encoder output float32 [I][ Init][ 318]: image_encoder_height : 1024, image_encoder_width: 1024 [I][ Init][ 320]: max_token_len : 1024 [I][ Init][ 323]: kv_cache_size : 256, kv_cache_num: 1024 [I][ Init][ 331]: prefill_token_num : 128 [I][ Init][ 335]: grp: 1, prefill_max_token_num : 1 [I][ Init][ 335]: grp: 2, prefill_max_token_num : 128 [I][ Init][ 335]: grp: 3, prefill_max_token_num : 256 [I][ Init][ 335]: grp: 4, prefill_max_token_num : 512 [I][ Init][ 335]: grp: 5, prefill_max_token_num : 640 [I][ Init][ 339]: prefill_max_token_num : 640 [I][ load_config][ 282]: load config: { "enable_repetition_penalty": false, "enable_temperature": true, "enable_top_k_sampling": true, "enable_top_p_sampling": false, "penalty_window": 50, "repetition_penalty": 1.15, "temperature": 0.8, "top_k": 10, "top_p": 0.9 } [I][ Init][ 348]: LLM init ok Type "q" to exit, Ctrl+c to stop current running prompt >> who are you image >> [I][ Encode][ 470]: input_ids size: 33 [I][ Run][ 604]: input token num : 33, prefill_split_num : 1 [I][ Run][ 619]: prefill grpid 2 [I][ Run][ 646]: input_num_token:33 [I][ Run][ 770]: ttft: 181.57 ms I am FastVLM, an AI language model developed by Apple Inc. [N][ Run][ 879]: hit eos,avg 11.49 token/s prompt >> describe the image image >> ./images/image_1.jpg [I][ Encode][ 442]: image encode time : 235.81 ms, size : 393216 [I][ Encode][ 496]: imgs_embed.size() : 1, media token size : 256 [I][ Run][ 604]: input token num : 291, prefill_split_num : 3 [I][ Run][ 619]: prefill grpid 4 [I][ Run][ 646]: input_num_token:128 [I][ Run][ 646]: input_num_token:128 [I][ Run][ 646]: input_num_token:35 [I][ Run][ 770]: ttft: 567.97 ms The image shows a giant panda sitting in a bamboo forest. The panda is predominantly black and white, with a black body and white face. Its eyes are black, and it has black ears and a black nose. The panda is holding some bamboo leaves in its paws, suggesting that it is eating. The panda is sitting on a bed of bamboo, which is a key food source for the species. The background of the image includes more bamboo trees and some fallen branches. The panda's posture and the surrounding environment give the impression of a peaceful, natural setting. The panda is looking directly at the camera, and its expression appears calm and content. Overall, the image captures the beauty and tranquility of the panda in its natural habitat. [N][ Run][ 879]: hit eos,avg 11.51 token/s prompt >> q ``` Run the following command on the Axera board to start a chat conversation: ```sh $ python infer_axmodel.py -v ./fastvlm_ax650_context_1k_prefill_640/image_encoder_1024x1024.axmodel -m ./fastvlm_ax650_context_1k_prefill_640 -t ./fastvlm_tokenizer/ -i 1024 ``` output: ```bash [INFO] Available providers: ['AXCLRTExecutionProvider'] Loading config, tokenizer and init model. Detected prefixes: ['llava_qwen2'], chosen: llava_qwen2, layers: 28 Init InferenceSession: 0%| | 0/28 [00:00> I am an artificial intelligence designed and developed by Apple Inc. I am a natural language processing model that can understand and respond to user input in a conversational manner. I can answer questions, provide information, and engage in discussions on a wide range of topics. I am designed to be helpful, informative, and friendly, and I am constantly learning and improving to provide the best possible experience for users. prompt<<./images/ssd_horse.jpg slice_indices: [0, 1, 2] Slice prefill done: 0 Slice prefill done: 1 Slice prefill done: 2 answer >> The image depicts a serene outdoor scene featuring a person riding a brown horse with a white blaze on its face. The rider, who has short brown hair, is wearing a blue hoodie, blue jeans, and black boots. The horse is equipped with a saddle and a bridle, and it stands on a dirt ground. In the foreground, a brown dog with a pink collar is sitting on the ground, looking up at the rider with its mouth open, possibly in anticipation or excitement. In the background, there is a silver pickup truck parked near a fence, and beyond the fence, there are trees and a few people sitting on a bench. The sky is overcast, suggesting a cloudy day. The overall atmosphere of the image is calm and peaceful, capturing a moment of connection between the rider, the horse, and the dog. prompt<<./images/image_1.jpg slice_indices: [0, 1, 2] Slice prefill done: 0 Slice prefill done: 1 Slice prefill done: 2 answer >> The image depicts a panda bear in a natural setting. The panda is sitting on the ground, surrounded by green bamboo leaves and plants. The panda has a distinctive black and white fur pattern, with black patches around its eyes, ears, and limbs, and a white face and body. The panda appears to be holding a bamboo leaf in its mouth, which is a common food source for pandas. The background includes a wooden structure, possibly a part of a bamboo enclosure, and some rocks. The overall scene suggests that the panda is in a zoo or a wildlife sanctuary. prompt<