Update README.md
Browse files
README.md
CHANGED
|
@@ -37,6 +37,12 @@ SmolVLM can be used for inference on multimodal (image + text) tasks where the i
|
|
| 37 |
|
| 38 |
To fine-tune SmolVLM on a specific task, you can follow [the fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/vision/finetuning/Smol_VLM_FT.ipynb).
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
### Technical Summary
|
| 41 |
|
| 42 |
SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to the larger SmolVLM 2.2B model:
|
|
@@ -167,15 +173,3 @@ The training data comes from [The Cauldron](https://huggingface.co/datasets/Hugg
|
|
| 167 |
|
| 168 |
|
| 169 |
|
| 170 |
-
|
| 171 |
-
## Evaluation
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smoller_vlm_benchmarks.png" alt="Example Image" style="width:90%;" />
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
| Size | Mathvista | MMMU | OCRBench | MMStar | AI2D | ChartQA_Test | Science_QA | TextVQA Val | DocVQA Val |
|
| 178 |
-
|-------|-----------|------|----------|--------|-------|--------------|------------|-------------|------------|
|
| 179 |
-
| 256M | 35.9 | 28.3 | 52.6 | 34.6 | 47 | 55.8 | 73.6 | 49.9 | 58.3 |
|
| 180 |
-
| 500M | 40.1 | 33.7 | 61 | 38.3 | 59.5 | 63.2 | 79.7 | 60.5 | 70.5 |
|
| 181 |
-
| 2.2B | 43.9 | 38.3 | 65.5 | 41.8 | 64 | 71.6 | 84.5 | 72.1 | 79.7 |
|
|
|
|
| 37 |
|
| 38 |
To fine-tune SmolVLM on a specific task, you can follow [the fine-tuning tutorial](https://github.com/huggingface/smollm/blob/main/vision/finetuning/Smol_VLM_FT.ipynb).
|
| 39 |
|
| 40 |
+
## Evaluation
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smoller_vlm_benchmarks.png" alt="Benchmarks" style="width:90%;" />
|
| 44 |
+
|
| 45 |
+
|
| 46 |
### Technical Summary
|
| 47 |
|
| 48 |
SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to the larger SmolVLM 2.2B model:
|
|
|
|
| 173 |
|
| 174 |
|
| 175 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|