| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | base_model: |
| | - Qwen/Qwen3-1.7B |
| | library_name: transformers |
| | tags: |
| | - multi-modal |
| | - large-language-model |
| | - vision-language-model |
| | - vision-encoder |
| | --- |
| | |
| | <p align="center"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" /> |
| | </p> |
| |
|
| | <h2 align="center">Penguin-VL</h2> |
| | <h4 align="center"> |
| | Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders |
| | </h4> |
| |
|
| | <h4 align="center"> |
| | <b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> | |
| | <b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> | |
| | <b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a> |
| | <br><br> |
| | <a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a> |
| | <a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a> |
| | <a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a> |
| | <a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a> |
| | </h4> |
| |
|
| | --- |
| |
|
| | ## π° News |
| |
|
| | * **2026.03** β PenguinVL-Encoder now available for general use. |
| | * **2026.03** β Released PenguinVL-2B, PenguinVL-8B. |
| |
|
| | --- |
| |
|
| | ## π Model Overview |
| |
|
| | PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through **LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning**. |
| |
|
| | Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a **text-only LLM**. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone. |
| |
|
| | ### Key Characteristics |
| |
|
| | - π§ **LLM-based Vision Encoder** |
| | The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling. |
| | This provides strong semantic priors and native compatibility with the downstream LLM. |
| |
|
| | - π₯ **Efficient Video Understanding** |
| | A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window. |
| |
|
| | - π Unified Architecture |
| | The model consists of: |
| | 1. LLM-initialized vision encoder |
| | 2. Lightweight MLP projector |
| | 3. Qwen3 language backbone |
| |
|
| | - π Compact but Strong |
| | At 2B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly. |
| |
|
| | --- |
| |
|
| | ## π§ͺ Quick Start β Transformers Inference |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModelForCausalLM, AutoProcessor |
| | |
| | model_name = "tencent/Penguin-VL-2B" |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | trust_remote_code=True, |
| | device_map="auto", |
| | torch_dtype=torch.bfloat16, |
| | ) |
| | |
| | processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | # Example: Image + Text |
| | inputs = processor( |
| | conversation=[ |
| | {"role": "system", "content": "You are a helpful assistant."}, |
| | { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "image": {"image_path": "assets/example.jpg"}}, |
| | {"type": "text", "text": "Describe this image."} |
| | ], |
| | }, |
| | ], |
| | return_tensors="pt", |
| | ) |
| | |
| | inputs = {k: v.to("cuda") for k, v in inputs.items() if isinstance(v, torch.Tensor)} |
| | |
| | output_ids = model.generate(**inputs, max_new_tokens=128) |
| | response = processor.decode(output_ids[0], skip_special_tokens=True) |
| | |
| | print(response) |
| | ``` |
| |
|
| | ## π Model Zoo |
| | | Model | Base Model | HF Link | |
| | | -------------------- | ------------ | ------------------------------------------------------------ | |
| | | PenguinVL-8B | Qwen3-8B | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) | |
| | | PenguinVL-2B | Qwen3-1.7B | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) | |
| | | PenguinVL-Encoder | Qwen3-0.6B | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) | |
| |
|
| | ## π Main Results |
| |
|
| | ### Chart / OCR / Document Understanding |
| |
|
| | | Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B | |
| | |---|---:|---:|---:|---:|---:| |
| | | InfoVQA | **77.8** | 72.4 | 70.8 | 51.9 | 43.0 | |
| | | ChartQA | **86.6** | 76.9 | 80.7 | 65.8 | 68.7 | |
| | | DocVQA | **94.1** | 93.3 | 89.4 | 78.4 | 80.0 | |
| | | CharXiv (DQ / RQ) | **66.4 / 35.8** | 62.3 / 26.8 | 65.0 / 31.6 | 60.1 / 27.0 | 36.9 / 15.5 | |
| | | OCRBench | 810 | **858** | 836 | 700 | 729 | |
| |
|
| | ### General Knowledge / Multi-Image / Math Reasoning |
| |
|
| | | Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B | |
| | |---|---:|---:|---:|---:|---:| |
| | | AI2D | **80.7** | 76.9 | 78.8 | 74.6 | 70.0 | |
| | | RealWorldQA | **70.2** | 63.9 | 62.0 | 59.9 | 58.3 | |
| | | V-star | **83.8** | 74.9 | 69.1 | 46.0 | 51.8 | |
| | | MMMU-Pro | 31.4 | **36.5** | 31.6 | 28.0 | 20.1 | |
| | | BLINK | 51.7 | **53.8** | 36.6 | 44.1 | 44.0 | |
| | | MathVista | **67.3** | 61.3 | 60.8 | 50.4 | 51.5 | |
| | | MathVerse | 35.9 | **52.1** | 39.6 | 22.5 | 21.5 | |
| | | LogicVista | 41.3 | 35.8 | **47.7** | 33.9 | 24.8 | |
| |
|
| | ### Video Understanding |
| |
|
| | | Benchmark | **Penguin-VL 2B** | Qwen3-VL 2B | InternVL3.5 2B | Gemma3n E2B-it | SmolVLM2 2.2B | |
| | |---|---:|---:|---:|---:|---:| |
| | | MVBench | 65.5 | 61.7 | **65.9** | 46.8 | 46.3 | |
| | | LongVideoBench | **59.5** | 52.1 | 57.4 | 43.0 | 49.7 | |
| | | VideoMME | 57.4 | **61.9** | 58.4 | 47.0 | 52.1 | |
| | | Egochema | **57.6** | 55.7 | 50.5 | 48.0 | 34.0 | |
| | | MMVU | **42.7** | 41.7 | **42.7** | 34.5 | 33.5 | |
| | | CharadesSTA | **56.2** | 54.5 | 21.9 | 5.5 | 9.5 | |
| | | NextQA | **79.9** | 76.9 | 76.1 | 65.4 | 62.4 | |
| | | ActivityNetQA | **61.5** | 59.7 | 58.3 | 51.5 | 52.6 | |
| | | Perception Test | **70.4** | 64.5 | 64.7 | 48.6 | 51.6 | |
| |
|
| | > **Bold** indicates the best score among compared models. |
| | > More details can see our paper. |
| |
|
| |
|
| | ## Citation |
| |
|
| | If you find Penguin-VL useful for your research and applications, please cite using this BibTeX: |
| | ```bibtex |
| | @article{Penguin-VL, |
| | title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders}, |
| | author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang}, |
| | journal={arXiv preprint arXiv:2603.06569}, |
| | year={2026} |
| | } |
| | ``` |