Update README.md
Browse files
README.md
CHANGED
|
@@ -221,9 +221,64 @@ You can cite us in the following way:
|
|
| 221 |
## Training Data
|
| 222 |
SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
|
| 223 |
In the following plots we give a general overview of the samples across modalities and the source of those samples.
|
| 224 |
-
|
| 225 |
<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
| 226 |
</center>
|
| 227 |
|
| 228 |
### Details
|
| 229 |
-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
## Training Data
|
| 222 |
SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
|
| 223 |
In the following plots we give a general overview of the samples across modalities and the source of those samples.
|
| 224 |
+
<!--
|
| 225 |
<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
| 226 |
</center>
|
| 227 |
|
| 228 |
### Details
|
| 229 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description"> -->
|
| 230 |
+
|
| 231 |
+
## Data Split per modality
|
| 232 |
+
|
| 233 |
+
| Data Type | Percentage |
|
| 234 |
+
|--------------|------------|
|
| 235 |
+
| Image | 34.4% |
|
| 236 |
+
| Text | 20.2% |
|
| 237 |
+
| Video | 33.0% |
|
| 238 |
+
| Multi-image | 12.3% |
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
## Granular dataset slices per modality
|
| 242 |
+
|
| 243 |
+
### Text Datasets
|
| 244 |
+
| Dataset | Percentage |
|
| 245 |
+
|--------------------------------------------|------------|
|
| 246 |
+
| llava-onevision/magpie_pro_ft3_80b_mt | 6.8% |
|
| 247 |
+
| llava-onevision/magpie_pro_ft3_80b_tt | 6.8% |
|
| 248 |
+
| llava-onevision/magpie_pro_qwen2_72b_tt | 5.8% |
|
| 249 |
+
| llava-onevision/mathqa | 0.9% |
|
| 250 |
+
|
| 251 |
+
### Multi-image Datasets
|
| 252 |
+
| Dataset | Percentage |
|
| 253 |
+
|--------------------------------------------|------------|
|
| 254 |
+
| m4-instruct-data/m4_instruct_multiimage | 10.4% |
|
| 255 |
+
| mammoth/multiimage-cap6 | 1.9% |
|
| 256 |
+
|
| 257 |
+
### Image Datasets
|
| 258 |
+
| Dataset | Percentage |
|
| 259 |
+
|--------------------------------------------|------------|
|
| 260 |
+
| llava-onevision/other | 17.4% |
|
| 261 |
+
| llava-onevision/vision_flan | 3.9% |
|
| 262 |
+
| llava-onevision/mavis_math_metagen | 2.6% |
|
| 263 |
+
| llava-onevision/mavis_math_rule_geo | 2.5% |
|
| 264 |
+
| llava-onevision/sharegpt4o | 1.7% |
|
| 265 |
+
| llava-onevision/sharegpt4v_coco | 1.5% |
|
| 266 |
+
| llava-onevision/image_textualization | 1.3% |
|
| 267 |
+
| llava-onevision/sharegpt4v_llava | 0.9% |
|
| 268 |
+
| llava-onevision/mapqa | 0.9% |
|
| 269 |
+
| llava-onevision/qa | 0.8% |
|
| 270 |
+
| llava-onevision/textocr | 0.8% |
|
| 271 |
+
|
| 272 |
+
### Video Datasets
|
| 273 |
+
| Dataset | Percentage |
|
| 274 |
+
|--------------------------------------------|------------|
|
| 275 |
+
| llava-video-178k/1-2m | 7.3% |
|
| 276 |
+
| llava-video-178k/2-3m | 7.0% |
|
| 277 |
+
| other-video/combined | 5.7% |
|
| 278 |
+
| llava-video-178k/hound | 4.4% |
|
| 279 |
+
| llava-video-178k/0-30s | 2.4% |
|
| 280 |
+
| video-star/starb | 2.2% |
|
| 281 |
+
| vista-400k/combined | 2.2% |
|
| 282 |
+
| vript/long | 1.0% |
|
| 283 |
+
| ShareGPT4Video/all | 0.8% |
|
| 284 |
+
|