Penguin-VL
Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
GitHub with detailed usage: tencent-ailab/Penguin-VL
π° News
- 2025.03 β PenguinVL-Encoder now available for general use.
- 2025.03 β Released PenguinVL-2B, PenguinVL-8B.
π Model Overview
PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.
Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.
Key Characteristics
π§ LLM-based Vision Encoder
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.π₯ Efficient Video Understanding
A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.π Unified Architecture
The model consists of:- LLM-initialized vision encoder
- Lightweight MLP projector
- Qwen3 language backbone
π Compact but Strong
At 2B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.
π§ͺ Quick Start β Transformers Inference
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model_name = "tencent/Penguin-VL-2B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# Example: Image + Text
inputs = processor(
conversation=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "image", "image": {"image_path": "assets/example.jpg"}},
{"type": "text", "text": "Describe this image."}
],
},
],
return_tensors="pt",
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if isinstance(v, torch.Tensor)}
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)
π Model Zoo
| Model | Base Model | HF Link |
|---|---|---|
| PenguinVL-8B | Qwen3-8B | tencent/Penguin-VL-8B |
| PenguinVL-2B | Qwen3-1.7B | tencent/Penguin-VL-2B |
| PenguinVL-Encoder | Qwen3-0.6B | tencent/Penguin-Encoder |
π Main Results
Chart / OCR / Document Understanding
| Benchmark | Penguin-VL 2B | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---|---|---|---|---|
| InfoVQA | 77.8 | 72.4 | 70.8 | 51.9 | 43.0 |
| ChartQA | 86.6 | 76.9 | 80.7 | 65.8 | 68.7 |
| DocVQA | 94.1 | 93.3 | 89.4 | 78.4 | 80.0 |
| CharXiv (DQ / RQ) | 66.4 / 35.8 | 62.3 / 26.8 | 65.0 / 31.6 | 60.1 / 27.0 | 36.9 / 15.5 |
| OCRBench | 810 | 858 | 836 | 700 | 729 |
General Knowledge / Multi-Image / Math Reasoning
| Benchmark | Penguin-VL 2B | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---|---|---|---|---|
| AI2D | 80.7 | 76.9 | 78.8 | 74.6 | 70.0 |
| RealWorldQA | 70.2 | 63.9 | 62.0 | 59.9 | 58.3 |
| V-star | 83.8 | 74.9 | 69.1 | 46.0 | 51.8 |
| MMMU-Pro | 31.4 | 36.5 | 31.6 | 28.0 | 20.1 |
| BLINK | 51.7 | 53.8 | 36.6 | 44.1 | 44.0 |
| MathVista | 67.3 | 61.3 | 60.8 | 50.4 | 51.5 |
| MathVerse | 35.9 | 52.1 | 39.6 | 22.5 | 21.5 |
| LogicVista | 41.3 | 35.8 | 47.7 | 33.9 | 24.8 |
Video Understanding
| Benchmark | Penguin-VL 2B | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B |
|---|---|---|---|---|---|
| MVBench | 65.5 | 61.7 | 65.9 | 46.8 | 46.3 |
| LongVideoBench | 59.5 | 52.1 | 57.4 | 43.0 | 49.7 |
| VideoMME | 57.4 | 61.9 | 58.4 | 47.0 | 52.1 |
| Egochema | 57.6 | 55.7 | 50.5 | 48.0 | 34.0 |
| MMVU | 42.7 | 41.7 | 42.7 | 34.5 | 33.5 |
| CharadesSTA | 56.2 | 54.5 | 21.9 | 5.5 | 9.5 |
| NextQA | 79.9 | 76.9 | 76.1 | 65.4 | 62.4 |
| ActivityNetQA | 61.5 | 59.7 | 58.3 | 51.5 | 52.6 |
| Perception Test | 70.4 | 64.5 | 64.7 | 48.6 | 51.6 |
Bold indicates the best score among compared models. More details can see our paper.
Citation
If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
...
- Downloads last month
- 1