Set correct pipeline tag, add link to paper

#17
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +9 -5
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- library_name: transformers
3
- license: apache-2.0
4
  datasets:
5
  - HuggingFaceM4/the_cauldron
6
  - HuggingFaceM4/Docmatix
@@ -14,17 +14,20 @@ datasets:
14
  - TIGER-Lab/VISTA-400K
15
  - Enxin/MovieChat-1K_train
16
  - ShareGPT4Video/ShareGPT4Video
17
- pipeline_tag: image-text-to-text
18
  language:
19
  - en
20
- base_model:
21
- - HuggingFaceTB/SmolVLM-500M-Instruct
 
22
  ---
23
 
 
24
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
25
 
26
  # SmolVLM2-500M-Video
27
 
 
 
28
  SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
29
  ## Model Summary
30
 
@@ -268,3 +271,4 @@ In the following plots we give a general overview of the samples across modaliti
268
  | vista-400k/combined | 2.2% |
269
  | vript/long | 1.0% |
270
  | ShareGPT4Video/all | 0.8% |
 
 
1
  ---
2
+ base_model:
3
+ - HuggingFaceTB/SmolVLM-500M-Instruct
4
  datasets:
5
  - HuggingFaceM4/the_cauldron
6
  - HuggingFaceM4/Docmatix
 
14
  - TIGER-Lab/VISTA-400K
15
  - Enxin/MovieChat-1K_train
16
  - ShareGPT4Video/ShareGPT4Video
 
17
  language:
18
  - en
19
+ library_name: transformers
20
+ license: apache-2.0
21
+ pipeline_tag: video-text-to-text
22
  ---
23
 
24
+ ```markdown
25
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
26
 
27
  # SmolVLM2-500M-Video
28
 
29
+ This repository contains the model of the paper [SmolVLM: Redefining small and efficient multimodal models](https://huggingface.co/papers/2504.05299).
30
+
31
  SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited.
32
  ## Model Summary
33
 
 
271
  | vista-400k/combined | 2.2% |
272
  | vript/long | 1.0% |
273
  | ShareGPT4Video/all | 0.8% |
274
+ ```