Jax-dan
/

SmolVLM-0.6B-Cap

Image-Text-to-Text

Model card Files Files and versions

Jax-dan commited on Sep 26, 2025

Commit

b50bb10

·

verified ·

1 Parent(s): b0aee7d

Update README.md

Files changed (1) hide show

README.md +12 -1

README.md CHANGED Viewed

@@ -32,7 +32,18 @@ In this repository, we propose `SmolVLM-0.6B-Cap`, which is built upon the `Qwen
 - **Composited Structure**: Following the SOTA VLM models, e.g., `LLaVA`, `Qwen-VL`, `GLM-V`, `SmolVLM`, etc., `SmolVLM-0.6B-Cap` leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB `SmolVLM-0.5B-Instruct` model, and the text encoder is `Qwen2.5-0.5B-Instruct` model.
 - **Bilingual support**: Unlike English-only original `SmolVLM` models, `SmolVLM-0.6B-Cap` supports both Chinese and English captions.
 - **Optimized Capacity**: Despite its moderate volume, `SmolVLM-0.6B-Cap` is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.
 ## Usage
 ```python
 from transformers import AutoProcessor, AutoModelForImageTextToText

 - **Composited Structure**: Following the SOTA VLM models, e.g., `LLaVA`, `Qwen-VL`, `GLM-V`, `SmolVLM`, etc., `SmolVLM-0.6B-Cap` leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB `SmolVLM-0.5B-Instruct` model, and the text encoder is `Qwen2.5-0.5B-Instruct` model.
 - **Bilingual support**: Unlike English-only original `SmolVLM` models, `SmolVLM-0.6B-Cap` supports both Chinese and English captions.
 - **Optimized Capacity**: Despite its moderate volume, `SmolVLM-0.6B-Cap` is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.
+## Comparisons
+| Model | Params | BLEU | BERTScore | ROUGE-1 | ROUGE-2 | ROUGE-L |
+| --- | --- | --- | --- | --- | --- |  --- |
+| Ovis2.5 | 2B | 0.3402 | 0.8393 | 0.4877 | 0.2000 | 0.3320 |
+| Qwen2.5-VL | 3B | 0.4394 | 0.8623 | 0.5208 | 0.2260 | 0.3555 |
+| GLM-4.1V | 9B | 0.4016 | 0.8731 | 0.4847 | 0.1745 | 0.3158 |
+| SmolVLM-Cap | 0.6B | **0.4927** | **0.8706** | **0.5519** | **0.3045** | **0.4133** |
+> **Notes**: 500 images randomly sampled from the SA1B dataset are used for evaluation.
 ## Usage
 ```python
 from transformers import AutoProcessor, AutoModelForImageTextToText