Update README.md
Browse files
README.md
CHANGED
|
@@ -1,13 +1,17 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
| 6 |
tags:
|
|
|
|
|
|
|
| 7 |
- vision-language-model
|
| 8 |
-
-
|
| 9 |
-
- custom_code
|
| 10 |
-
library_name: transformers
|
| 11 |
---
|
| 12 |
|
| 13 |
<p align="center">
|
|
@@ -31,9 +35,9 @@ Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
|
|
| 31 |
|
| 32 |
## π Model Overview
|
| 33 |
|
| 34 |
-
PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs.
|
| 35 |
|
| 36 |
-
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP),
|
| 37 |
|
| 38 |
### Key Characteristics
|
| 39 |
|
|
@@ -41,18 +45,6 @@ Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e
|
|
| 41 |
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
|
| 42 |
This provides strong semantic priors and native compatibility with the downstream LLM.
|
| 43 |
|
| 44 |
-
- π₯ **Efficient Video Understanding**
|
| 45 |
-
A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.
|
| 46 |
-
|
| 47 |
-
- π Unified Architecture
|
| 48 |
-
The model consists of:
|
| 49 |
-
1. LLM-initialized vision encoder
|
| 50 |
-
2. Lightweight MLP projector
|
| 51 |
-
3. Qwen3 language backbone
|
| 52 |
-
|
| 53 |
-
- π Compact but Strong
|
| 54 |
-
At 2B scale, PG-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.
|
| 55 |
-
|
| 56 |
---
|
| 57 |
|
| 58 |
## π§ͺ Quick Start β Transformers Inference
|
|
@@ -63,7 +55,7 @@ from transformers import AutoModel, AutoImageProcessor
|
|
| 63 |
from transformers.image_utils import load_image
|
| 64 |
|
| 65 |
model_name = "tencent/Penguin-Encoder"
|
| 66 |
-
image_path = "
|
| 67 |
images = load_image(image_path)
|
| 68 |
|
| 69 |
model = AutoModel.from_pretrained(
|
|
@@ -89,6 +81,12 @@ image_features = model(**inputs)
|
|
| 89 |
| PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
|
| 90 |
| PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
## Citation
|
| 94 |
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
metrics:
|
| 6 |
+
- accuracy
|
| 7 |
+
base_model:
|
| 8 |
+
- Qwen/Qwen3-0.6B
|
| 9 |
+
library_name: transformers
|
| 10 |
tags:
|
| 11 |
+
- multi-modal
|
| 12 |
+
- large-language-model
|
| 13 |
- vision-language-model
|
| 14 |
+
- vision-encoder
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
<p align="center">
|
|
|
|
| 35 |
|
| 36 |
## π Model Overview
|
| 37 |
|
| 38 |
+
PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
|
| 39 |
|
| 40 |
+
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
|
| 41 |
|
| 42 |
### Key Characteristics
|
| 43 |
|
|
|
|
| 45 |
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
|
| 46 |
This provides strong semantic priors and native compatibility with the downstream LLM.
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
---
|
| 49 |
|
| 50 |
## π§ͺ Quick Start β Transformers Inference
|
|
|
|
| 55 |
from transformers.image_utils import load_image
|
| 56 |
|
| 57 |
model_name = "tencent/Penguin-Encoder"
|
| 58 |
+
image_path = "your_img.jpg"
|
| 59 |
images = load_image(image_path)
|
| 60 |
|
| 61 |
model = AutoModel.from_pretrained(
|
|
|
|
| 81 |
| PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
|
| 82 |
| PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
|
| 83 |
|
| 84 |
+
## π Main Results
|
| 85 |
+
Ablation Study:
|
| 86 |
+
|
| 87 |
+

|
| 88 |
+
|
| 89 |
+
Main Results can see the ablation section in our paper.
|
| 90 |
|
| 91 |
## Citation
|
| 92 |
|