tencent
/

Penguin-Encoder

@@ -1,13 +1,17 @@
 ---
 language:
 - en
-license: apache-2.0
-pipeline_tag: image-text-to-text
 tags:
 - vision-language-model
-- multimodal
-- custom_code
-library_name: transformers
 ---
 <p align="center">
@@ -31,9 +35,9 @@ Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
 ## 🌟 Model Overview
-PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**.
-Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
 ### Key Characteristics
@@ -41,18 +45,6 @@ Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e
   The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
   This provides strong semantic priors and native compatibility with the downstream LLM.
-- 🎥 **Efficient Video Understanding**
-  A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.
-- 🏗 Unified Architecture
-  The model consists of:
-  1. LLM-initialized vision encoder
-  2. Lightweight MLP projector
-  3. Qwen3 language backbone
-- 📊 Compact but Strong
-  At 2B scale, PG-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.
 ---
 ## 🧪 Quick Start — Transformers Inference
@@ -63,7 +55,7 @@ from transformers import AutoModel, AutoImageProcessor
 from transformers.image_utils import load_image
 model_name = "tencent/Penguin-Encoder"
-image_path = "assets/xxxx.jpg"
 images = load_image(image_path)
 model = AutoModel.from_pretrained(
@@ -89,6 +81,12 @@ image_features = model(**inputs)
 | PenguinVL-2B         | Qwen3-1.7B   | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
 | PenguinVL-Encoder    | Qwen3-0.6B   | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
 ## Citation

 ---
+license: apache-2.0
 language:
 - en
+metrics:
+- accuracy
+base_model:
+- Qwen/Qwen3-0.6B
+library_name: transformers
 tags:
+- multi-modal
+- large-language-model
 - vision-language-model
+- vision-encoder
 ---
 <p align="center">
 ## 🌟 Model Overview
+PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
+Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
 ### Key Characteristics
   The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
   This provides strong semantic priors and native compatibility with the downstream LLM.
 ---
 ## 🧪 Quick Start — Transformers Inference
 from transformers.image_utils import load_image
 model_name = "tencent/Penguin-Encoder"
+image_path = "your_img.jpg"
 images = load_image(image_path)
 model = AutoModel.from_pretrained(
 | PenguinVL-2B         | Qwen3-1.7B   | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
 | PenguinVL-Encoder    | Qwen3-0.6B   | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
+## 🚀 Main Results
+Ablation Study:
+![image](https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/JOSRpV_qEbTqdbYwH-hJr.png)
+Main Results can see the ablation section in our paper.
 ## Citation