| library_name: pytorch | |
|  | |
| SmolVLM is a family of compact vision–language models designed for efficient multimodal understanding by integrating lightweight visual encoders with small language models, with a focus on edge deployment and low-latency multimodal AI. | |
| Original paper: [SmolVLM: Redefining small and efficient multimodal models](https://arxiv.org/abs/2504.05299) | |
| # SmolVLM2-500M-Video-Instruct | |
| SmolVLM2-500M-Video-Instruct is a highly efficient ~500M-parameter variant optimized for low-memory footprint and fast multimodal inference. It is well suited for applications such as visual question answering, image captioning, document understanding, and real-time multimodal assistants on edge devices or resource-constrained environments. | |
| Model Configuration: | |
| - Reference implementation: [smollm](https://github.com/huggingface/smollm) | |
| - Original Weight: [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) | |
| - Resolution: 3x512x512 | |
| - Support Cooper version: | |
| - Cooper SDK: [2.5.4] | |
| - Cooper Foundry: [2.3] | |
| | Model | Device | Model Link | | |
| | :-----: | :-----: | :-----: | | |
| | SmolVLM2-500M-Video-Instruct | CV7 | [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv7_smolvlm2_video_instruct_500M.tar) | | |
| | SmolVLM2-500M-Video-Instruct | CV72 | [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv72_smolvlm2_video_instruct_500M.tar) | | |
| | SmolVLM2-500M-Video-Instruct | CV75 | [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv75_smolvlm2_video_instruct_500M.tar) | | |