File size: 7,065 Bytes
ba0a0a1 355e843 ba0a0a1 355e843 ba0a0a1 355e843 ba0a0a1 355e843 ba0a0a1 a10fac4 ba0a0a1 a10fac4 b9d371d a10fac4 c0c17ec 752f0c8 c0c17ec a10fac4 1fc14fc a10fac4 ba0a0a1 a10fac4 ba0a0a1 a10fac4 b9d371d a10fac4 ba0a0a1 a10fac4 26ac2ce a10fac4 ba0a0a1 a10fac4 355e843 ba0a0a1 a10fac4 4cba438 a10fac4 8197d4d a10fac4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-1.7B
library_name: transformers
tags:
- multi-modal
- large-language-model
- vision-language-model
- vision-encoder
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
</p>
<h2 align="center">Penguin-VL</h2>
<h4 align="center">
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
</h4>
<h4 align="center">
<b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> |
<b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> |
<b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
<br><br>
<a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
<a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
<a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
<a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
</h4>
---
## π° News
* **2026.03** β PenguinVL-Encoder now available for general use.
* **2026.03** β Released PenguinVL-2B, PenguinVL-8B.
---
## π Model Overview
PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**.
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
### Key Characteristics
- π§ **LLM-based Vision Encoder**
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.
- π₯ **Efficient Video Understanding**
A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.
- π Unified Architecture
The model consists of:
1. LLM-initialized vision encoder
2. Lightweight MLP projector
3. Qwen3 language backbone
- π Compact but Strong
At 2B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.
---
## π§ͺ Quick Start β Transformers Inference
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model_name = "tencent/Penguin-VL-2B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# Example: Image + Text
inputs = processor(
conversation=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "image", "image": {"image_path": "assets/example.jpg"}},
{"type": "text", "text": "Describe this image."}
],
},
],
return_tensors="pt",
)
inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)
```
## π Model Zoo
| Model | Base Model | HF Link |
| -------------------- | ------------ | ------------------------------------------------------------ |
| PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
| PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
| PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
## π Main Results
### Chart / OCR / Document Understanding
| Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---:|---:|---:|---:|---:|
| InfoVQA | **77.8** | 72.4 | 70.8 | 51.9 | 43.0 |
| ChartQA | **86.6** | 76.9 | 80.7 | 65.8 | 68.7 |
| DocVQA | **94.1** | 93.3 | 89.4 | 78.4 | 80.0 |
| CharXiv (DQ / RQ) | **66.4 / 35.8** | 62.3 / 26.8 | 65.0 / 31.6 | 60.1 / 27.0 | 36.9 / 15.5 |
| OCRBench | 810 | **858** | 836 | 700 | 729 |
### General Knowledge / Multi-Image / Math Reasoning
| Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---:|---:|---:|---:|---:|
| AI2D | **80.7** | 76.9 | 78.8 | 74.6 | 70.0 |
| RealWorldQA | **70.2** | 63.9 | 62.0 | 59.9 | 58.3 |
| V-star | **83.8** | 74.9 | 69.1 | 46.0 | 51.8 |
| MMMU-Pro | 31.4 | **36.5** | 31.6 | 28.0 | 20.1 |
| BLINK | 51.7 | **53.8** | 36.6 | 44.1 | 44.0 |
| MathVista | **67.3** | 61.3 | 60.8 | 50.4 | 51.5 |
| MathVerse | 35.9 | **52.1** | 39.6 | 22.5 | 21.5 |
| LogicVista | 41.3 | 35.8 | **47.7** | 33.9 | 24.8 |
### Video Understanding
| Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---:|---:|---:|---:|---:|
| MVBench | 65.5 | 61.7 | **65.9** | 46.8 | 46.3 |
| LongVideoBench | **59.5** | 52.1 | 57.4 | 43.0 | 49.7 |
| VideoMME | 57.4 | **61.9** | 58.4 | 47.0 | 52.1 |
| Egochema | **57.6** | 55.7 | 50.5 | 48.0 | 34.0 |
| MMVU | **42.7** | 41.7 | **42.7** | 34.5 | 33.5 |
| CharadesSTA | **56.2** | 54.5 | 21.9 | 5.5 | 9.5 |
| NextQA | **79.9** | 76.9 | 76.1 | 65.4 | 62.4 |
| ActivityNetQA | **61.5** | 59.7 | 58.3 | 51.5 | 52.6 |
| Perception Test | **70.4** | 64.5 | 64.7 | 48.6 | 51.6 |
> **Bold** indicates the best score among compared models.
> More details can see our paper.
## Citation
If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{Penguin-VL,
title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
journal={arXiv preprint arXiv:2603.06569},
year={2026}
}
``` |