File size: 4,366 Bytes
537c6f2
f8ecf9a
537c6f2
 
f8ecf9a
 
 
 
 
537c6f2
f8ecf9a
 
537c6f2
f8ecf9a
537c6f2
 
dda5e62
537c6f2
dda5e62
 
537c6f2
2dde216
dda5e62
 
 
 
50810fe
3e18123
 
 
 
 
 
 
 
50810fe
 
dda5e62
 
 
 
08fb9e3
 
dda5e62
 
 
 
 
f8ecf9a
dda5e62
f8ecf9a
dda5e62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
537c6f2
f8ecf9a
dda5e62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
537c6f2
 
 
 
f8ecf9a
 
 
 
 
 
dda5e62
 
 
2dde216
dda5e62
62327d3
 
 
 
 
 
dda5e62
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-0.6B
library_name: transformers
tags:
- multi-modal
- large-language-model
- vision-language-model
- vision-encoder
---

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
</p>


<h2 align="center">Vision Encoder of Penguin-VL</h2>
<h4 align="center">
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
</h4>

<h4 align="center">
  <b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> | 
  <b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> | 
  <b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
  <br><br>
  <a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
  <a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
  <a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
  <a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
</h4>

---

## πŸ“° News

* **2026.03** β€” PenguinVL-Encoder now available for general use.
* **2026.03** β€” Released PenguinVL-2B, PenguinVL-8B.

---

## 🌟 Model Overview

PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

### Key Characteristics

- 🧠 **LLM-based Vision Encoder**  
  The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.  
  This provides strong semantic priors and native compatibility with the downstream LLM.

---

## πŸ§ͺ Quick Start β€” Transformers Inference

```python
import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image

model_name = "tencent/Penguin-Encoder"
image_path = "your_img.jpg"
images = load_image(image_path)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)
```

## 🌎 Model Zoo
| Model                | Base Model   | HF Link                                                      |
| -------------------- | ------------ | ------------------------------------------------------------ |
| PenguinVL-8B         | Qwen3-8B     | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
| PenguinVL-2B         | Qwen3-1.7B   | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
| PenguinVL-Encoder    | Qwen3-0.6B   | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |

## πŸš€ Main Results
Ablation Study:

![image](https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/JOSRpV_qEbTqdbYwH-hJr.png)

Main Results can see the ablation section in our paper.

## Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{Penguin-VL,
  title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
  author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
  journal={arXiv preprint arXiv:2603.06569},
  year={2026}
}
```