Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<p align="center">
|
| 2 |
-
<img src="
|
| 3 |
</p>
|
| 4 |
|
|
|
|
| 5 |
<h2 align="center">Vision Encoder of PenguinVL</h2>
|
| 6 |
<h4 align="center">
|
| 7 |
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
|
|
@@ -18,9 +31,9 @@ Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
|
|
| 18 |
|
| 19 |
## π Model Overview
|
| 20 |
|
| 21 |
-
PenguinVL is a compact Vision-Language Model
|
| 22 |
|
| 23 |
-
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP),
|
| 24 |
|
| 25 |
### Key Characteristics
|
| 26 |
|
|
@@ -49,8 +62,8 @@ import torch
|
|
| 49 |
from transformers import AutoModel, AutoImageProcessor
|
| 50 |
from transformers.image_utils import load_image
|
| 51 |
|
| 52 |
-
model_name = "
|
| 53 |
-
image_path = "
|
| 54 |
images = load_image(image_path)
|
| 55 |
|
| 56 |
model = AutoModel.from_pretrained(
|
|
@@ -72,9 +85,10 @@ image_features = model(**inputs)
|
|
| 72 |
## π Model Zoo
|
| 73 |
| Model | Base Model | HF Link |
|
| 74 |
| -------------------- | ------------ | ------------------------------------------------------------ |
|
| 75 |
-
| PenguinVL-8B | Qwen3-8B | [
|
| 76 |
-
| PenguinVL-2B | Qwen3-1.7B | [
|
| 77 |
-
| PenguinVL-Encoder | Qwen3-0.6B | [
|
|
|
|
| 78 |
|
| 79 |
## π Main Results
|
| 80 |
xxx
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
+
tags:
|
| 7 |
+
- vision-language-model
|
| 8 |
+
- multimodal
|
| 9 |
+
- custom_code
|
| 10 |
+
library_name: transformers
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
<p align="center">
|
| 14 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
|
| 15 |
</p>
|
| 16 |
|
| 17 |
+
|
| 18 |
<h2 align="center">Vision Encoder of PenguinVL</h2>
|
| 19 |
<h4 align="center">
|
| 20 |
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
|
|
|
|
| 31 |
|
| 32 |
## π Model Overview
|
| 33 |
|
| 34 |
+
PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**.
|
| 35 |
|
| 36 |
+
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
|
| 37 |
|
| 38 |
### Key Characteristics
|
| 39 |
|
|
|
|
| 62 |
from transformers import AutoModel, AutoImageProcessor
|
| 63 |
from transformers.image_utils import load_image
|
| 64 |
|
| 65 |
+
model_name = "tencent/Penguin-Encoder"
|
| 66 |
+
image_path = "assets/xxxx.jpg"
|
| 67 |
images = load_image(image_path)
|
| 68 |
|
| 69 |
model = AutoModel.from_pretrained(
|
|
|
|
| 85 |
## π Model Zoo
|
| 86 |
| Model | Base Model | HF Link |
|
| 87 |
| -------------------- | ------------ | ------------------------------------------------------------ |
|
| 88 |
+
| PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
|
| 89 |
+
| PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
|
| 90 |
+
| PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
|
| 91 |
+
|
| 92 |
|
| 93 |
## π Main Results
|
| 94 |
xxx
|