Update README.md (#2)

8197d4d 1 day ago

6.96 kB

	---
	license: apache-2.0
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen3-1.7B
	library_name: transformers
	tags:
	- multi-modal
	- large-language-model
	- vision-language-model
	- vision-encoder
	---

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
	</p>

	<h2 align="center">Penguin-VL</h2>
	<h4 align="center">
	Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
	</h4>

	<h4 align="center">
	<b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> \|
	<b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> \|
	<b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
	<br><br>
	<a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
	<a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
	<a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
	<a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
	</h4>

	---

	## 📰 News

	* 2026.03 — PenguinVL-Encoder now available for general use.
	* 2026.03 — Released PenguinVL-2B, PenguinVL-8B.

	---

	## 🌟 Model Overview

	PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

	Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

	### Key Characteristics

	- 🧠 LLM-based Vision Encoder
	The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
	This provides strong semantic priors and native compatibility with the downstream LLM.

	- 🎥 Efficient Video Understanding
	A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.

	- 🏗 Unified Architecture
	The model consists of:
	1. LLM-initialized vision encoder
	2. Lightweight MLP projector
	3. Qwen3 language backbone

	- 📊 Compact but Strong
	At 2B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.

	---

	## 🧪 Quick Start — Transformers Inference

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	model_name = "tencent/Penguin-VL-2B"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)

	processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

	# Example: Image + Text
	inputs = processor(
	conversation=[
	{"role": "system", "content": "You are a helpful assistant."},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": {"image_path": "assets/example.jpg"}},
	{"type": "text", "text": "Describe this image."}
	],
	},
	],
	return_tensors="pt",
	)

	inputs = {k: v.to("cuda") for k, v in inputs.items() if isinstance(v, torch.Tensor)}

	output_ids = model.generate(**inputs, max_new_tokens=128)
	response = processor.decode(output_ids[0], skip_special_tokens=True)

	print(response)
	```

	## 🌎 Model Zoo
	\| Model \| Base Model \| HF Link \|
	\| -------------------- \| ------------ \| ------------------------------------------------------------ \|
	\| PenguinVL-8B \| Qwen3-8B \| [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) \|
	\| PenguinVL-2B \| Qwen3-1.7B \| [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) \|
	\| PenguinVL-Encoder \| Qwen3-0.6B \| [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) \|

	## 🚀 Main Results

	### Chart / OCR / Document Understanding

	\| Benchmark \| Penguin-VL 2B \| Qwen3-VL 2B \| InternVL3.5 2B \| Gemma3n E2B-it \| SmolVLM2 2.2B \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| InfoVQA \| 77.8 \| 72.4 \| 70.8 \| 51.9 \| 43.0 \|
	\| ChartQA \| 86.6 \| 76.9 \| 80.7 \| 65.8 \| 68.7 \|
	\| DocVQA \| 94.1 \| 93.3 \| 89.4 \| 78.4 \| 80.0 \|
	\| CharXiv (DQ / RQ) \| 66.4 / 35.8 \| 62.3 / 26.8 \| 65.0 / 31.6 \| 60.1 / 27.0 \| 36.9 / 15.5 \|
	\| OCRBench \| 810 \| 858 \| 836 \| 700 \| 729 \|

	### General Knowledge / Multi-Image / Math Reasoning

	\| Benchmark \| Penguin-VL 2B \| Qwen3-VL 2B \| InternVL3.5 2B \| Gemma3n E2B-it \| SmolVLM2 2.2B \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| AI2D \| 80.7 \| 76.9 \| 78.8 \| 74.6 \| 70.0 \|
	\| RealWorldQA \| 70.2 \| 63.9 \| 62.0 \| 59.9 \| 58.3 \|
	\| V-star \| 83.8 \| 74.9 \| 69.1 \| 46.0 \| 51.8 \|
	\| MMMU-Pro \| 31.4 \| 36.5 \| 31.6 \| 28.0 \| 20.1 \|
	\| BLINK \| 51.7 \| 53.8 \| 36.6 \| 44.1 \| 44.0 \|
	\| MathVista \| 67.3 \| 61.3 \| 60.8 \| 50.4 \| 51.5 \|
	\| MathVerse \| 35.9 \| 52.1 \| 39.6 \| 22.5 \| 21.5 \|
	\| LogicVista \| 41.3 \| 35.8 \| 47.7 \| 33.9 \| 24.8 \|

	### Video Understanding

	\| Benchmark \| Penguin-VL 2B \| Qwen3-VL 2B \| InternVL3.5 2B \| Gemma3n E2B-it \| SmolVLM2 2.2B \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| MVBench \| 65.5 \| 61.7 \| 65.9 \| 46.8 \| 46.3 \|
	\| LongVideoBench \| 59.5 \| 52.1 \| 57.4 \| 43.0 \| 49.7 \|
	\| VideoMME \| 57.4 \| 61.9 \| 58.4 \| 47.0 \| 52.1 \|
	\| Egochema \| 57.6 \| 55.7 \| 50.5 \| 48.0 \| 34.0 \|
	\| MMVU \| 42.7 \| 41.7 \| 42.7 \| 34.5 \| 33.5 \|
	\| CharadesSTA \| 56.2 \| 54.5 \| 21.9 \| 5.5 \| 9.5 \|
	\| NextQA \| 79.9 \| 76.9 \| 76.1 \| 65.4 \| 62.4 \|
	\| ActivityNetQA \| 61.5 \| 59.7 \| 58.3 \| 51.5 \| 52.6 \|
	\| Perception Test \| 70.4 \| 64.5 \| 64.7 \| 48.6 \| 51.6 \|

	> Bold indicates the best score among compared models.
	> More details can see our paper.


	## Citation

	If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
	```bibtex
	@article{Penguin-VL,
	title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
	author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
	journal={arXiv preprint arXiv:2603.06569},
	year={2026}
	}
	```