Jax-dan commited on
Commit
b50bb10
·
verified ·
1 Parent(s): b0aee7d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -1
README.md CHANGED
@@ -32,7 +32,18 @@ In this repository, we propose `SmolVLM-0.6B-Cap`, which is built upon the `Qwen
32
  - **Composited Structure**: Following the SOTA VLM models, e.g., `LLaVA`, `Qwen-VL`, `GLM-V`, `SmolVLM`, etc., `SmolVLM-0.6B-Cap` leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB `SmolVLM-0.5B-Instruct` model, and the text encoder is `Qwen2.5-0.5B-Instruct` model.
33
  - **Bilingual support**: Unlike English-only original `SmolVLM` models, `SmolVLM-0.6B-Cap` supports both Chinese and English captions.
34
  - **Optimized Capacity**: Despite its moderate volume, `SmolVLM-0.6B-Cap` is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.
35
-
 
 
 
 
 
 
 
 
 
 
 
36
  ## Usage
37
  ```python
38
  from transformers import AutoProcessor, AutoModelForImageTextToText
 
32
  - **Composited Structure**: Following the SOTA VLM models, e.g., `LLaVA`, `Qwen-VL`, `GLM-V`, `SmolVLM`, etc., `SmolVLM-0.6B-Cap` leverages a sandwich structure that consists of a vision encoder, a text encoder, and an MLP connector. Notably, the vision encoder is drawn from HuggingfaceTB `SmolVLM-0.5B-Instruct` model, and the text encoder is `Qwen2.5-0.5B-Instruct` model.
33
  - **Bilingual support**: Unlike English-only original `SmolVLM` models, `SmolVLM-0.6B-Cap` supports both Chinese and English captions.
34
  - **Optimized Capacity**: Despite its moderate volume, `SmolVLM-0.6B-Cap` is optimized on massive image-caption pairs in both Chinese and English settings, demonstrating significant advantages over the generalist VLMs with several times the parameters.
35
+
36
+ ## Comparisons
37
+
38
+ | Model | Params | BLEU | BERTScore | ROUGE-1 | ROUGE-2 | ROUGE-L |
39
+ | --- | --- | --- | --- | --- | --- | --- |
40
+ | Ovis2.5 | 2B | 0.3402 | 0.8393 | 0.4877 | 0.2000 | 0.3320 |
41
+ | Qwen2.5-VL | 3B | 0.4394 | 0.8623 | 0.5208 | 0.2260 | 0.3555 |
42
+ | GLM-4.1V | 9B | 0.4016 | 0.8731 | 0.4847 | 0.1745 | 0.3158 |
43
+ | SmolVLM-Cap | 0.6B | **0.4927** | **0.8706** | **0.5519** | **0.3045** | **0.4133** |
44
+
45
+ > **Notes**: 500 images randomly sampled from the SA1B dataset are used for evaluation.
46
+
47
  ## Usage
48
  ```python
49
  from transformers import AutoProcessor, AutoModelForImageTextToText