Instructions to use tencent/Penguin-VL-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tencent/Penguin-VL-2B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tencent/Penguin-VL-2B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tencent/Penguin-VL-2B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tencent/Penguin-VL-2B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tencent/Penguin-VL-2B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Penguin-VL-2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tencent/Penguin-VL-2B

SGLang

How to use tencent/Penguin-VL-2B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tencent/Penguin-VL-2B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Penguin-VL-2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tencent/Penguin-VL-2B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Penguin-VL-2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tencent/Penguin-VL-2B with Docker Model Runner:
```
docker model run hf.co/tencent/Penguin-VL-2B
```

Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Project Page: penguin-vl.github.io | GitHub: tencent-ailab/Penguin-VL | arXiv: 2603.06569

📰 News

2026.03 — PenguinVL-Encoder now available for general use.
2026.03 — Released PenguinVL-2B, PenguinVL-8B.

🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

🧠 LLM-based Vision Encoder
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.
🎥 Efficient Video Understanding
A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.
🏗 Unified Architecture
The model consists of:
1. LLM-initialized vision encoder
2. Lightweight MLP projector
3. Qwen3 language backbone
📊 Compact but Strong
At 2B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.

🧪 Quick Start — Transformers Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "tencent/Penguin-VL-2B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example: Image + Text
inputs = processor(
    conversation=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": {"image_path": "assets/example.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ],
        },
    ],
    return_tensors="pt",
)

inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(response)

🌎 Model Zoo

Model	Base Model	HF Link
PenguinVL-8B	Qwen3-8B	tencent/Penguin-VL-8B
PenguinVL-2B	Qwen3-1.7B	tencent/Penguin-VL-2B
PenguinVL-Encoder	Qwen3-0.6B	tencent/Penguin-Encoder

🚀 Main Results

Chart / OCR / Document Understanding

Benchmark	Penguin-VL 2B	Qwen3-VL 2B	InternVL3.5 2B	Gemma3n E2B-it	SmolVLM2 2.2B
InfoVQA	77.8	72.4	70.8	51.9	43.0
ChartQA	86.6	76.9	80.7	65.8	68.7
DocVQA	94.1	93.3	89.4	78.4	80.0
CharXiv (DQ / RQ)	66.4 / 35.8	62.3 / 26.8	65.0 / 31.6	60.1 / 27.0	36.9 / 15.5
OCRBench	810	858	836	700	729

General Knowledge / Multi-Image / Math Reasoning

Benchmark	Penguin-VL 2B	Qwen3-VL 2B	InternVL3.5 2B	Gemma3n E2B-it	SmolVLM2 2.2B
AI2D	80.7	76.9	78.8	74.6	70.0
RealWorldQA	70.2	63.9	62.0	59.9	58.3
V-star	83.8	74.9	69.1	46.0	51.8
MMMU-Pro	31.4	36.5	31.6	28.0	20.1
BLINK	51.7	53.8	36.6	44.1	44.0
MathVista	67.3	61.3	60.8	50.4	51.5
MathVerse	35.9	52.1	39.6	22.5	21.5
LogicVista	41.3	35.8	47.7	33.9	24.8

Video Understanding

Benchmark	Penguin-VL 2B	Qwen3-VL 2B	InternVL3.5 2B	Gemma3n E2B-it	SmolVLM2 2.2B
MVBench	65.5	61.7	65.9	46.8	46.3
LongVideoBench	59.5	52.1	57.4	43.0	49.7
VideoMME	57.4	61.9	58.4	47.0	52.1
Egochema	57.6	55.7	50.5	48.0	34.0
MMVU	42.7	41.7	42.7	34.5	33.5
CharadesSTA	56.2	54.5	21.9	5.5	9.5
NextQA	79.9	76.9	76.1	65.4	62.4
ActivityNetQA	61.5	59.7	58.3	51.5	52.6
Perception Test	70.4	64.5	64.7	48.6	51.6

Bold indicates the best score among compared models. More details can see our paper.

Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:

@article{Penguin-VL,
  title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
  author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
  journal={arXiv preprint arXiv:2603.06569},
  year={2026}
}