File size: 4,366 Bytes
537c6f2 f8ecf9a 537c6f2 f8ecf9a 537c6f2 f8ecf9a 537c6f2 f8ecf9a 537c6f2 dda5e62 537c6f2 dda5e62 537c6f2 2dde216 dda5e62 50810fe 3e18123 50810fe dda5e62 08fb9e3 dda5e62 f8ecf9a dda5e62 f8ecf9a dda5e62 537c6f2 f8ecf9a dda5e62 537c6f2 f8ecf9a dda5e62 2dde216 dda5e62 5e5e73a dda5e62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | ---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-0.6B
library_name: transformers
tags:
- multi-modal
- large-language-model
- vision-language-model
- vision-encoder
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
</p>
<h2 align="center">Vision Encoder of Penguin-VL</h2>
<h4 align="center">
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
</h4>
<h4 align="center">
<b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> |
<b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> |
<b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
<br><br>
<a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
<a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
<a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
<a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
</h4>
---
## π° News
* **2026.03** β PenguinVL-Encoder now available for general use.
* **2026.03** β Released PenguinVL-2B, PenguinVL-8B.
---
## π Model Overview
PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
### Key Characteristics
- π§ **LLM-based Vision Encoder**
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.
---
## π§ͺ Quick Start β Transformers Inference
```python
import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image
model_name = "tencent/Penguin-Encoder"
image_path = "your_img.jpg"
images = load_image(image_path)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)
inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)
```
## π Model Zoo
| Model | Base Model | HF Link |
| -------------------- | ------------ | ------------------------------------------------------------ |
| PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
| PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
| PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
## π Main Results
Ablation Study:

Main Results can see the ablation section in our paper.
## Citation
If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{Penguin-VL,
title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
journal={arXiv preprint arXiv:2603.06569},
year={2026}
}
``` |