--- library_name: pytorch --- ![smolvlm_logo](resource/SmolVLM.png) SmolVLM is a family of compact vision–language models designed for efficient multimodal understanding by integrating lightweight visual encoders with small language models, with a focus on edge deployment and low-latency multimodal AI. Original paper: [SmolVLM: Redefining small and efficient multimodal models](https://arxiv.org/abs/2504.05299) # SmolVLM2-500M-Video-Instruct SmolVLM2-500M-Video-Instruct is a highly efficient ~500M-parameter variant optimized for low-memory footprint and fast multimodal inference. It is well suited for applications such as visual question answering, image captioning, document understanding, and real-time multimodal assistants on edge devices or resource-constrained environments. Model Configuration: - Reference implementation: [smollm](https://github.com/huggingface/smollm) - Original Weight: [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) - Resolution: 3x512x512 - Support Cooper version: - Cooper SDK: [2.5.4] - Cooper Foundry: [2.3] | Model | Device | Model Link | | :-----: | :-----: | :-----: | | SmolVLM2-500M-Video-Instruct | CV7 | [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv7_smolvlm2_video_instruct_500M.tar) | | SmolVLM2-500M-Video-Instruct | CV72 | [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv72_smolvlm2_video_instruct_500M.tar) | | SmolVLM2-500M-Video-Instruct | CV75 | [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv75_smolvlm2_video_instruct_500M.tar) |