| --- |
| license: other |
| license_name: apple |
| license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE |
| language: |
| - en |
| pipeline_tag: image-text-to-text |
| tags: |
| - multimodal |
| library_name: transformers |
| --- |
| |
| # FastVLM-7B-Stage2 |
|
|
| ## Introduction |
|
|
| This is FastVLM-7B-Stage2, a multimodal language model that can understand things visually, being agentic, understand long videos and capture events, and generate structured outputs. |
|
|
| This model is exported from Github [apple/ml-fastvlm](https://github.com/apple/ml-fastvlm). |
|
|
| Model's weight: [llava-fastvithd_7b_stage2.zip](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_7b_stage2.zip). |
|
|
|
|
| ### Usage |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model_id = 'FastVLM-7B-Stage2' |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, use_fast=False) |
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype='auto', trust_remote_code=True) |
| ``` |
|
|
| ### Export to MNN |
| ```python |
| git clone https://github.com/alibaba/MNN |
| cd MNN/transformers/llm/export |
| python llmexport.py --path /path/to/FastVLM-7B-Stage2 --export mnn |
| ``` |
|
|
|
|
| ## Citation |
|
|
| If you find our work helpful, feel free to give us a cite. |
|
|
| ``` |
| @InProceedings{fastvlm2025, |
| author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari}, |
| title = {FastVLM: Efficient Vision Encoding for Vision Language Models}, |
| booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| month = {June}, |
| year = {2025}, |
| }{2023} |
| ``` |