Ambarella
/

SmolVLM2

Model card Files Files and versions

SmolVLM2 / README.md

cooper_robot

Add release note for v1.3.0

124f17a 10 days ago

|

History Blame Contribute Delete

1.62 kB

	---
	library_name: pytorch
	---

	![smolvlm_logo](resource/SmolVLM.png)

	SmolVLM is a family of compact vision–language models designed for efficient multimodal understanding by integrating lightweight visual encoders with small language models, with a focus on edge deployment and low-latency multimodal AI.

	Original paper: [SmolVLM: Redefining small and efficient multimodal models](https://arxiv.org/abs/2504.05299)

	# SmolVLM2-500M-Video-Instruct

	SmolVLM2-500M-Video-Instruct is a highly efficient ~500M-parameter variant optimized for low-memory footprint and fast multimodal inference. It is well suited for applications such as visual question answering, image captioning, document understanding, and real-time multimodal assistants on edge devices or resource-constrained environments.

	Model Configuration:
	- Reference implementation: [smollm](https://github.com/huggingface/smollm)
	- Original Weight: [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
	- Resolution: 3x512x512
	- Support Cooper version:
	- Cooper SDK: [2.5.4]
	- Cooper Foundry: [2.3]

	\| Model \| Device \| Model Link \|
	\| :-----: \| :-----: \| :-----: \|
	\| SmolVLM2-500M-Video-Instruct \| CV7 \| [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv7_smolvlm2_video_instruct_500M.tar) \|
	\| SmolVLM2-500M-Video-Instruct \| CV72 \| [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv72_smolvlm2_video_instruct_500M.tar) \|
	\| SmolVLM2-500M-Video-Instruct \| CV75 \| [Model_Link](https://huggingface.co/Ambarella/SmolVLM2/blob/main/cv75_smolvlm2_video_instruct_500M.tar) \|