|
|
--- |
|
|
license: apple-amlr |
|
|
library_name: ml-fastvlm |
|
|
tags: |
|
|
- transformers |
|
|
--- |
|
|
# FastVLM: Efficient Vision Encoding for Vision Language Models |
|
|
|
|
|
FastVLM was introduced in |
|
|
**[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)** |
|
|
|
|
|
[//]: # () |
|
|
<p align="center"> |
|
|
<img src="acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/> |
|
|
</p> |
|
|
|
|
|
### Highlights |
|
|
* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. |
|
|
* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. |
|
|
* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT. |
|
|
|
|
|
|
|
|
### Evaluations |
|
|
| Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B | |
|
|
|:--------------|:------------:|:------------:|:----------:| |
|
|
| Ai2D | 68.0 | 77.4 | 83.6 | |
|
|
| ScienceQA | 85.2 | 94.4 | 96.7 | |
|
|
| MMMU | 33.9 | 37.8 | 45.4 | |
|
|
| VQAv2 | 76.3 | 79.1 | 80.8 | |
|
|
| ChartQA | 76.0 | 80.1 | 85.0 | |
|
|
| TextVQA | 64.5 | 70.4 | 74.9 | |
|
|
| InfoVQA | 46.4 | 59.7 | 75.8 | |
|
|
| DocVQA | 82.5 | 88.3 | 93.2 | |
|
|
| OCRBench | 63.9 | 70.2 | 73.1 | |
|
|
| RealWorldQA | 56.1 | 61.2 | 67.2 | |
|
|
| SeedBench-Img | 71.0 | 74.2 | 75.4 | |
|
|
|
|
|
|
|
|
### Usage Example |
|
|
To run inference of PyTorch checkpoint, follow the instruction in the official repo: |
|
|
|
|
|
Download the model |
|
|
``` |
|
|
huggingface-cli download apple/FastVLM-0.5B |
|
|
``` |
|
|
|
|
|
Run inference using `predict.py` from the official repo. |
|
|
```bash |
|
|
python predict.py --model-path /path/to/checkpoint-dir \ |
|
|
--image-file /path/to/image.png \ |
|
|
--prompt "Describe the image." |
|
|
``` |
|
|
|
|
|
### Run inference with Transformers (Remote Code) |
|
|
To run inference with transformers we can leverage `trust_remote_code` along with the following snippet: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
|
|
|
model_id = "apple/FastVLM-0.5B" |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image_url}, |
|
|
{"type": "text", "text": "Describe this image in detail."}, |
|
|
] |
|
|
} |
|
|
] |
|
|
|
|
|
inputs = processor.apply_chat_template( |
|
|
messages, |
|
|
add_generation_prompt=True, |
|
|
tokenize=True, |
|
|
return_tensors="pt", |
|
|
return_dict=True, |
|
|
) |
|
|
|
|
|
out = model.generate( |
|
|
**inputs, |
|
|
do_sample=False, |
|
|
max_new_tokens=150, |
|
|
) |
|
|
|
|
|
print(processor.tokenizer.decode(out[0], skip_special_tokens=False)) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
If you found this model useful, please cite the following paper: |
|
|
``` |
|
|
@InProceedings{fastvlm2025, |
|
|
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari}, |
|
|
title = {FastVLM: Efficient Vision Encoding for Vision Language Models}, |
|
|
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
|
|
month = {June}, |
|
|
year = {2025}, |
|
|
} |
|
|
``` |
|
|
|