Set correct pipeline tag, add link to paper
#17
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
datasets:
|
| 5 |
- HuggingFaceM4/the_cauldron
|
| 6 |
- HuggingFaceM4/Docmatix
|
|
@@ -14,17 +14,20 @@ datasets:
|
|
| 14 |
- TIGER-Lab/VISTA-400K
|
| 15 |
- Enxin/MovieChat-1K_train
|
| 16 |
- ShareGPT4Video/ShareGPT4Video
|
| 17 |
-
pipeline_tag: image-text-to-text
|
| 18 |
language:
|
| 19 |
- en
|
| 20 |
-
|
| 21 |
-
|
|
|
|
| 22 |
---
|
| 23 |
|
|
|
|
| 24 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
|
| 25 |
|
| 26 |
# SmolVLM2-500M-Video
|
| 27 |
|
|
|
|
|
|
|
| 28 |
SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
|
| 29 |
## Model Summary
|
| 30 |
|
|
@@ -268,3 +271,4 @@ In the following plots we give a general overview of the samples across modaliti
|
|
| 268 |
| vista-400k/combined | 2.2% |
|
| 269 |
| vript/long | 1.0% |
|
| 270 |
| ShareGPT4Video/all | 0.8% |
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- HuggingFaceTB/SmolVLM-500M-Instruct
|
| 4 |
datasets:
|
| 5 |
- HuggingFaceM4/the_cauldron
|
| 6 |
- HuggingFaceM4/Docmatix
|
|
|
|
| 14 |
- TIGER-Lab/VISTA-400K
|
| 15 |
- Enxin/MovieChat-1K_train
|
| 16 |
- ShareGPT4Video/ShareGPT4Video
|
|
|
|
| 17 |
language:
|
| 18 |
- en
|
| 19 |
+
library_name: transformers
|
| 20 |
+
license: apache-2.0
|
| 21 |
+
pipeline_tag: video-text-to-text
|
| 22 |
---
|
| 23 |
|
| 24 |
+
```markdown
|
| 25 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
|
| 26 |
|
| 27 |
# SmolVLM2-500M-Video
|
| 28 |
|
| 29 |
+
This repository contains the model of the paper [SmolVLM: Redefining small and efficient multimodal models](https://huggingface.co/papers/2504.05299).
|
| 30 |
+
|
| 31 |
SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
|
| 32 |
## Model Summary
|
| 33 |
|
|
|
|
| 271 |
| vista-400k/combined | 2.2% |
|
| 272 |
| vript/long | 1.0% |
|
| 273 |
| ShareGPT4Video/all | 0.8% |
|
| 274 |
+
```
|