| | --- |
| | license: other |
| | license_name: apple |
| | license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE |
| | language: |
| | - en |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - multimodal |
| | library_name: transformers |
| | --- |
| | |
| | # FastVLM-7B-Stage3 |
| |
|
| | ## Introduction |
| |
|
| | This is FastVLM-7B-Stage3, a multimodal language model that can understand things visually, being agentic, understand long videos and capture events, and generate structured outputs. |
| |
|
| | This model is exported from Github [apple/ml-fastvlm](https://github.com/apple/ml-fastvlm). |
| |
|
| | Model's weight: [llava-fastvithd_7b_stage3.zip](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_7b_stage3.zip). |
| |
|
| |
|
| | ### Usage |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | model_id = 'FastVLM-7B-Stage3' |
| | tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, use_fast=False) |
| | model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype='auto', trust_remote_code=True) |
| | ``` |
| |
|
| | ### Export to MNN |
| | ```python |
| | git clone https://github.com/alibaba/MNN |
| | cd MNN/transformers/llm/export |
| | python llmexport.py --path /path/to/FastVLM-7B-Stage3 --export mnn |
| | ``` |
| |
|
| |
|
| | ## Citation |
| |
|
| | If you find our work helpful, feel free to give us a cite. |
| |
|
| | ``` |
| | @InProceedings{fastvlm2025, |
| | author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari}, |
| | title = {FastVLM: Efficient Vision Encoding for Vision Language Models}, |
| | booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| | month = {June}, |
| | year = {2025}, |
| | }{2023} |
| | ``` |