tencent
/

Penguin-Encoder

@@ -1,7 +1,20 @@
 <p align="center">
-  <img src="assets/logo.png" width="160" />
 </p>
 <h2 align="center">Vision Encoder of PenguinVL</h2>
 <h4 align="center">
 Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
@@ -18,9 +31,9 @@ Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
 ## 🌟 Model Overview
-PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
-Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PG-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
 ### Key Characteristics
@@ -49,8 +62,8 @@ import torch
 from transformers import AutoModel, AutoImageProcessor
 from transformers.image_utils import load_image
-model_name = "pg-team/pg-vision-encoder"
-image_path = "xxx"
 images = load_image(image_path)
 model = AutoModel.from_pretrained(
@@ -72,9 +85,10 @@ image_features = model(**inputs)
 ## 🌎 Model Zoo
 | Model                | Base Model   | HF Link                                                      |
 | -------------------- | ------------ | ------------------------------------------------------------ |
-| PenguinVL-8B         | Qwen3-8B     | [pg-team/pg-vl-8b-hf](https://huggingface.co/pg-team/pg-vl-8b-hf) |
-| PenguinVL-2B         | Qwen3-1.7B   | [pg-team/pg-vl-2b-hf](https://huggingface.co/pg-team/pg-vl-2b-hf) |
-| PenguinVL-Encoder    | Qwen3-0.6B   | [pg-team/pg-vision-encoder](https://huggingface.co/pg-team/pg-vision-encoder) |
 ## 🚀 Main Results
 xxx

+---
+language:
+- en
+license: apache-2.0
+pipeline_tag: image-text-to-text
+tags:
+- vision-language-model
+- multimodal
+- custom_code
+library_name: transformers
+---
 <p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
 </p>
 <h2 align="center">Vision Encoder of PenguinVL</h2>
 <h4 align="center">
 Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
 ## 🌟 Model Overview
+PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**.
+Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
 ### Key Characteristics
 from transformers import AutoModel, AutoImageProcessor
 from transformers.image_utils import load_image
+model_name = "tencent/Penguin-Encoder"
+image_path = "assets/xxxx.jpg"
 images = load_image(image_path)
 model = AutoModel.from_pretrained(
 ## 🌎 Model Zoo
 | Model                | Base Model   | HF Link                                                      |
 | -------------------- | ------------ | ------------------------------------------------------------ |
+| PenguinVL-8B         | Qwen3-8B     | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
+| PenguinVL-2B         | Qwen3-1.7B   | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
+| PenguinVL-Encoder    | Qwen3-0.6B   | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
 ## 🚀 Main Results
 xxx