Update README.md

5e5e73a verified 1 day ago

4.37 kB

	---
	license: apache-2.0
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen3-0.6B
	library_name: transformers
	tags:
	- multi-modal
	- large-language-model
	- vision-language-model
	- vision-encoder
	---

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/6258a6455ea3a0a9b6de3f22/mIMYeUFquGSbm89lT61TG.png" width="160" />
	</p>


	<h2 align="center">Vision Encoder of Penguin-VL</h2>
	<h4 align="center">
	Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
	</h4>

	<h4 align="center">
	<b>Project Page:</b> <a href="https://penguin-vl.github.io">penguin-vl.github.io</a> \|
	<b>GitHub:</b> <a href="https://github.com/tencent-ailab/Penguin-VL">tencent-ailab/Penguin-VL</a> \|
	<b>arXiv:</b> <a href="https://arxiv.org/abs/2603.06569">2603.06569</a>
	<br><br>
	<a href="https://penguin-vl.github.io"><img src="https://img.shields.io/badge/Project-Page-green?logo=github" alt="Project Page"></a>
	<a href="https://github.com/tencent-ailab/Penguin-VL"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub Badge"></a>
	<a href="https://huggingface.co/spaces/tencent/Penguin-VL"><img src="https://img.shields.io/badge/HuggingFace-Spaces-yellow?logo=huggingface" alt="Hugging Face Spaces"></a>
	<a href="https://arxiv.org/abs/2603.06569"><img src="https://img.shields.io/badge/arXiv-2603.06569-b31b1b.svg?logo=arxiv" alt="arXiv"></a>
	</h4>

	---

	## 📰 News

	* 2026.03 — PenguinVL-Encoder now available for general use.
	* 2026.03 — Released PenguinVL-2B, PenguinVL-8B.

	---

	## 🌟 Model Overview

	PenguinVL is a compact Vision-Language Model, designed to explore the efficiency limits of small-scale VLMs.

	Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), Penguin-VL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

	### Key Characteristics

	- 🧠 LLM-based Vision Encoder
	The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
	This provides strong semantic priors and native compatibility with the downstream LLM.

	---

	## 🧪 Quick Start — Transformers Inference

	```python
	import torch
	from transformers import AutoModel, AutoImageProcessor
	from transformers.image_utils import load_image

	model_name = "tencent/Penguin-Encoder"
	image_path = "your_img.jpg"
	images = load_image(image_path)

	model = AutoModel.from_pretrained(
	model_name,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

	inputs = processor(images=images, merge_size=1)
	inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
	if "pixel_values" in inputs:
	inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
	image_features = model(**inputs)
	```

	## 🌎 Model Zoo
	\| Model \| Base Model \| HF Link \|
	\| -------------------- \| ------------ \| ------------------------------------------------------------ \|
	\| PenguinVL-8B \| Qwen3-8B \| [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) \|
	\| PenguinVL-2B \| Qwen3-1.7B \| [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) \|
	\| PenguinVL-Encoder \| Qwen3-0.6B \| [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) \|

	## 🚀 Main Results
	Ablation Study:

	![image](https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/JOSRpV_qEbTqdbYwH-hJr.png)

	Main Results can see the ablation section in our paper.

	## Citation

	If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:
	```bibtex
	@article{Penguin-VL,
	title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
	author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
	journal={arXiv preprint arXiv:2603.06569},
	year={2026}
	}
	```