Update README.md
Browse files
README.md
CHANGED
|
@@ -32,7 +32,18 @@ In this repository, we propose `SmolVLM-0.6B-Cap`, which is built upon the `Qwen
|
|
| 32 |
- **Composited Structure**: Following the SOTA VLM models, e.g., `LLaVA`, `Qwen-VL`, `GLM-V`, `SmolVLM`, etc., `SmolVLM-0.6B-Cap` leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB `SmolVLM-0.5B-Instruct` model, and the text encoder is `Qwen2.5-0.5B-Instruct` model.
|
| 33 |
- **Bilingual support**: Unlike English-only original `SmolVLM` models, `SmolVLM-0.6B-Cap` supports both Chinese and English captions.
|
| 34 |
- **Optimized Capacity**: Despite its moderate volume, `SmolVLM-0.6B-Cap` is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## Usage
|
| 37 |
```python
|
| 38 |
from transformers import AutoProcessor, AutoModelForImageTextToText
|
|
|
|
| 32 |
- **Composited Structure**: Following the SOTA VLM models, e.g., `LLaVA`, `Qwen-VL`, `GLM-V`, `SmolVLM`, etc., `SmolVLM-0.6B-Cap` leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB `SmolVLM-0.5B-Instruct` model, and the text encoder is `Qwen2.5-0.5B-Instruct` model.
|
| 33 |
- **Bilingual support**: Unlike English-only original `SmolVLM` models, `SmolVLM-0.6B-Cap` supports both Chinese and English captions.
|
| 34 |
- **Optimized Capacity**: Despite its moderate volume, `SmolVLM-0.6B-Cap` is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.
|
| 35 |
+
|
| 36 |
+
## Comparisons
|
| 37 |
+
|
| 38 |
+
| Model | Params | BLEU | BERTScore | ROUGE-1 | ROUGE-2 | ROUGE-L |
|
| 39 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
| 40 |
+
| Ovis2.5 | 2B | 0.3402 | 0.8393 | 0.4877 | 0.2000 | 0.3320 |
|
| 41 |
+
| Qwen2.5-VL | 3B | 0.4394 | 0.8623 | 0.5208 | 0.2260 | 0.3555 |
|
| 42 |
+
| GLM-4.1V | 9B | 0.4016 | 0.8731 | 0.4847 | 0.1745 | 0.3158 |
|
| 43 |
+
| SmolVLM-Cap | 0.6B | **0.4927** | **0.8706** | **0.5519** | **0.3045** | **0.4133** |
|
| 44 |
+
|
| 45 |
+
> **Notes**: 500 images randomly sampled from the SA1B dataset are used for evaluation.
|
| 46 |
+
|
| 47 |
## Usage
|
| 48 |
```python
|
| 49 |
from transformers import AutoProcessor, AutoModelForImageTextToText
|