File size: 7,065 Bytes

ba0a0a1
355e843
ba0a0a1
 
355e843
 
 
 
 
ba0a0a1
355e843
 
ba0a0a1
355e843
ba0a0a1
 
a10fac4
ba0a0a1
a10fac4
 
b9d371d
a10fac4
 
 
 
c0c17ec
752f0c8
 
 
 
 
 
 
 
c0c17ec
 
a10fac4
 
 
 
1fc14fc
 
a10fac4
 
 
 
 
ba0a0a1
a10fac4
ba0a0a1
a10fac4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9d371d
a10fac4
 
 
 
 
 
 
 
 
ba0a0a1
a10fac4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26ac2ce
 
 
a10fac4
 
 
 
 
 
 
 
 
 
ba0a0a1
 
 
a10fac4
 
355e843
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba0a0a1
a10fac4
 
 
4cba438
a10fac4
8197d4d
 
 
 
 
 
a10fac4

---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-1.7B
library_name: transformers
tags:
- multi-modal
- large-language-model
- vision-language-model
- vision-encoder
---

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
</p>

<h2 align="center">Penguin-VL</h2>
<h4 align="center">
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
</h4>

<h4 align="center">
  <b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> | 
  <b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> | 
  <b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
  <br><br>
  <a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
  <a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
  <a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
  <a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
</h4>

---

## 📰 News

* **2026.03** — PenguinVL-Encoder now available for general use.
* **2026.03** — Released PenguinVL-2B, PenguinVL-8B.

---

## 🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

### Key Characteristics

- 🧠 **LLM-based Vision Encoder**  
  The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.  
  This provides strong semantic priors and native compatibility with the downstream LLM.

- 🎥 **Efficient Video Understanding**  
  A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.

- 🏗 Unified Architecture  
  The model consists of:
  1. LLM-initialized vision encoder  
  2. Lightweight MLP projector  
  3. Qwen3 language backbone  

- 📊 Compact but Strong  
  At 2B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.

---

## 🧪 Quick Start — Transformers Inference

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "tencent/Penguin-VL-2B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example: Image + Text
inputs = processor(
    conversation=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": {"image_path": "assets/example.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ],
        },
    ],
    return_tensors="pt",
)

inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(response)
```

## 🌎 Model Zoo
| Model                | Base Model   | HF Link                                                      |
| -------------------- | ------------ | ------------------------------------------------------------ |
| PenguinVL-8B         | Qwen3-8B     | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
| PenguinVL-2B         | Qwen3-1.7B   | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
| PenguinVL-Encoder    | Qwen3-0.6B   | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |

## 🚀 Main Results

### Chart / OCR / Document Understanding

| Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---:|---:|---:|---:|---:|
| InfoVQA | **77.8** | 72.4 | 70.8 | 51.9 | 43.0 |
| ChartQA | **86.6** | 76.9 | 80.7 | 65.8 | 68.7 |
| DocVQA | **94.1** | 93.3 | 89.4 | 78.4 | 80.0 |
| CharXiv (DQ / RQ) | **66.4 / 35.8** | 62.3 / 26.8 | 65.0 / 31.6 | 60.1 / 27.0 | 36.9 / 15.5 |
| OCRBench | 810 | **858** | 836 | 700 | 729 |

### General Knowledge / Multi-Image / Math Reasoning

| Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---:|---:|---:|---:|---:|
| AI2D | **80.7** | 76.9 | 78.8 | 74.6 | 70.0 |
| RealWorldQA | **70.2** | 63.9 | 62.0 | 59.9 | 58.3 |
| V-star | **83.8** | 74.9 | 69.1 | 46.0 | 51.8 |
| MMMU-Pro | 31.4 | **36.5** | 31.6 | 28.0 | 20.1 |
| BLINK | 51.7 | **53.8** | 36.6 | 44.1 | 44.0 |
| MathVista | **67.3** | 61.3 | 60.8 | 50.4 | 51.5 |
| MathVerse | 35.9 | **52.1** | 39.6 | 22.5 | 21.5 |
| LogicVista | 41.3 | 35.8 | **47.7** | 33.9 | 24.8 |

### Video Understanding

| Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---:|---:|---:|---:|---:|
| MVBench | 65.5 | 61.7 | **65.9** | 46.8 | 46.3 |
| LongVideoBench | **59.5** | 52.1 | 57.4 | 43.0 | 49.7 |
| VideoMME | 57.4 | **61.9** | 58.4 | 47.0 | 52.1 |
| Egochema | **57.6** | 55.7 | 50.5 | 48.0 | 34.0 |
| MMVU | **42.7** | 41.7 | **42.7** | 34.5 | 33.5 |
| CharadesSTA | **56.2** | 54.5 | 21.9 | 5.5 | 9.5 |
| NextQA | **79.9** | 76.9 | 76.1 | 65.4 | 62.4 |
| ActivityNetQA | **61.5** | 59.7 | 58.3 | 51.5 | 52.6 |
| Perception Test | **70.4** | 64.5 | 64.7 | 48.6 | 51.6 |

> **Bold** indicates the best score among compared models.
> More details can see our paper.


## Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{Penguin-VL,
  title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
  author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
  journal={arXiv preprint arXiv:2603.06569},
  year={2026}
}
```