lkeab commited on
Commit
f8ecf9a
Β·
verified Β·
1 Parent(s): 6320f7c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -20
README.md CHANGED
@@ -1,13 +1,17 @@
1
  ---
 
2
  language:
3
  - en
4
- license: apache-2.0
5
- pipeline_tag: image-text-to-text
 
 
 
6
  tags:
 
 
7
  - vision-language-model
8
- - multimodal
9
- - custom_code
10
- library_name: transformers
11
  ---
12
 
13
  <p align="center">
@@ -31,9 +35,9 @@ Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
31
 
32
  ## 🌟 Model Overview
33
 
34
- PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**.
35
 
36
- Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
37
 
38
  ### Key Characteristics
39
 
@@ -41,18 +45,6 @@ Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e
41
  The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
42
  This provides strong semantic priors and native compatibility with the downstream LLM.
43
 
44
- - πŸŽ₯ **Efficient Video Understanding**
45
- A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.
46
-
47
- - πŸ— Unified Architecture
48
- The model consists of:
49
- 1. LLM-initialized vision encoder
50
- 2. Lightweight MLP projector
51
- 3. Qwen3 language backbone
52
-
53
- - πŸ“Š Compact but Strong
54
- At 2B scale, PG-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.
55
-
56
  ---
57
 
58
  ## πŸ§ͺ Quick Start β€” Transformers Inference
@@ -63,7 +55,7 @@ from transformers import AutoModel, AutoImageProcessor
63
  from transformers.image_utils import load_image
64
 
65
  model_name = "tencent/Penguin-Encoder"
66
- image_path = "assets/xxxx.jpg"
67
  images = load_image(image_path)
68
 
69
  model = AutoModel.from_pretrained(
@@ -89,6 +81,12 @@ image_features = model(**inputs)
89
  | PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
90
  | PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
91
 
 
 
 
 
 
 
92
 
93
  ## Citation
94
 
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - en
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - Qwen/Qwen3-0.6B
9
+ library_name: transformers
10
  tags:
11
+ - multi-modal
12
+ - large-language-model
13
  - vision-language-model
14
+ - vision-encoder
 
 
15
  ---
16
 
17
  <p align="center">
 
35
 
36
  ## 🌟 Model Overview
37
 
38
+ PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
39
 
40
+ Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
41
 
42
  ### Key Characteristics
43
 
 
45
  The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
46
  This provides strong semantic priors and native compatibility with the downstream LLM.
47
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ---
49
 
50
  ## πŸ§ͺ Quick Start β€” Transformers Inference
 
55
  from transformers.image_utils import load_image
56
 
57
  model_name = "tencent/Penguin-Encoder"
58
+ image_path = "your_img.jpg"
59
  images = load_image(image_path)
60
 
61
  model = AutoModel.from_pretrained(
 
81
  | PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
82
  | PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
83
 
84
+ ## πŸš€ Main Results
85
+ Ablation Study:
86
+
87
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/JOSRpV_qEbTqdbYwH-hJr.png)
88
+
89
+ Main Results can see the ablation section in our paper.
90
 
91
  ## Citation
92