HuggingFaceTB
/

SmolVLM-500M-Instruct

Image-Text-to-Text

Model card Files Files and versions

andito HF Staff commited on Jan 23, 2025

Commit

7a8ac21

·

verified ·

1 Parent(s): 401eccf

Update README.md

Files changed (1) hide show

README.md +6 -12

README.md CHANGED Viewed

@@ -37,6 +37,12 @@ SmolVLM can be used for inference on multimodal (image + text) tasks where the i
 To fine-tune SmolVLM on a specific task, you can follow [the fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/vision/finetuning/Smol_VLM_FT.ipynb).
 ### Technical Summary
 SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to the larger SmolVLM 2.2B model:
@@ -167,15 +173,3 @@ The training data comes from [The Cauldron](https://huggingface.co/datasets/Hugg
-## Evaluation
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smoller_vlm_benchmarks.png" alt="Example Image" style="width:90%;" />
-| Size  | Mathvista | MMMU | OCRBench | MMStar | AI2D  | ChartQA_Test | Science_QA | TextVQA Val | DocVQA Val |
-|-------|-----------|------|----------|--------|-------|--------------|------------|-------------|------------|
-| 256M  | 35.9      | 28.3 | 52.6     | 34.6   | 47    | 55.8         | 73.6       | 49.9        | 58.3       |
-| 500M  | 40.1      | 33.7 | 61       | 38.3   | 59.5  | 63.2         | 79.7       | 60.5        | 70.5       |
-| 2.2B  | 43.9      | 38.3 | 65.5     | 41.8   | 64    | 71.6         | 84.5       | 72.1        | 79.7       |

 To fine-tune SmolVLM on a specific task, you can follow [the fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/vision/finetuning/Smol_VLM_FT.ipynb).
+## Evaluation
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smoller_vlm_benchmarks.png" alt="Benchmarks" style="width:90%;" />
 ### Technical Summary
 SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to the larger SmolVLM 2.2B model: