apple
/

FastVLM-0.5B

Text Generation

Model card Files Files and versions

FastVLM-0.5B / README.md

pcuenq's picture

pcuenq HF Staff

Remove spurious license fields

e61e70b 3 months ago

|

3.74 kB

	---
	license: apple-amlr
	library_name: ml-fastvlm
	tags:
	- transformers
	---
	# FastVLM: Efficient Vision Encoding for Vision Language Models

	FastVLM was introduced in
	[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)

	[//]: # (![FastViTHD Performance](acc_vs_latency_qwen-2.png))
	<p align="center">
	<img src="acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/>
	</p>

	### Highlights
	* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
	* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
	* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.


	### Evaluations
	\| Benchmark \| FastVLM-0.5B \| FastVLM-1.5B \| FastVLM-7B \|
	\|:--------------\|:------------:\|:------------:\|:----------:\|
	\| Ai2D \| 68.0 \| 77.4 \| 83.6 \|
	\| ScienceQA \| 85.2 \| 94.4 \| 96.7 \|
	\| MMMU \| 33.9 \| 37.8 \| 45.4 \|
	\| VQAv2 \| 76.3 \| 79.1 \| 80.8 \|
	\| ChartQA \| 76.0 \| 80.1 \| 85.0 \|
	\| TextVQA \| 64.5 \| 70.4 \| 74.9 \|
	\| InfoVQA \| 46.4 \| 59.7 \| 75.8 \|
	\| DocVQA \| 82.5 \| 88.3 \| 93.2 \|
	\| OCRBench \| 63.9 \| 70.2 \| 73.1 \|
	\| RealWorldQA \| 56.1 \| 61.2 \| 67.2 \|
	\| SeedBench-Img \| 71.0 \| 74.2 \| 75.4 \|


	### Usage Example
	To run inference of PyTorch checkpoint, follow the instruction in the official repo:

	Download the model
	```
	huggingface-cli download apple/FastVLM-0.5B
	```

	Run inference using `predict.py` from the official repo.
	```bash
	python predict.py --model-path /path/to/checkpoint-dir \
	--image-file /path/to/image.png \
	--prompt "Describe the image."
	```

	### Run inference with Transformers (Remote Code)
	To run inference with transformers we can leverage `trust_remote_code` along with the following snippet:

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor

	model_id = "apple/FastVLM-0.5B"

	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	)

	image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image_url},
	{"type": "text", "text": "Describe this image in detail."},
	]
	}
	]

	inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_tensors="pt",
	return_dict=True,
	)

	out = model.generate(
	**inputs,
	do_sample=False,
	max_new_tokens=150,
	)

	print(processor.tokenizer.decode(out[0], skip_special_tokens=False))
	```

	## Citation
	If you found this model useful, please cite the following paper:
	```
	@InProceedings{fastvlm2025,
	author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
	title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
	booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	month = {June},
	year = {2025},
	}
	```