Instructions to use tencent/Penguin-VL-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tencent/Penguin-VL-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tencent/Penguin-VL-8B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tencent/Penguin-VL-8B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tencent/Penguin-VL-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tencent/Penguin-VL-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Penguin-VL-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tencent/Penguin-VL-8B

SGLang

How to use tencent/Penguin-VL-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tencent/Penguin-VL-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Penguin-VL-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tencent/Penguin-VL-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Penguin-VL-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tencent/Penguin-VL-8B with Docker Model Runner:
```
docker model run hf.co/tencent/Penguin-VL-8B
```

Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Project Page: penguin-vl.github.io | GitHub: tencent-ailab/Penguin-VL | arXiv: 2603.06569

📰 News

2026.03 — PenguinVL-Encoder now available for general use.
2026.03 — Released PenguinVL-2B, PenguinVL-8B.

🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

🧠 LLM-based Vision Encoder
The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
This provides strong semantic priors and native compatibility with the downstream LLM.
🎥 Efficient Video Understanding
A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.
🏗 Unified Architecture
The model consists of:
1. LLM-initialized vision encoder
2. Lightweight MLP projector
3. Qwen3 language backbone
📊 Compact but Strong
At 8B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.

🧪 Quick Start — Transformers Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "tencent/Penguin-VL-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example: Image + Text
inputs = processor(
    conversation=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": {"image_path": "assets/example.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ],
        },
    ],
    return_tensors="pt",
)


inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(response)

🌎 Model Zoo

Model	Base Model	HF Link
PenguinVL-8B	Qwen3-8B	tencent/Penguin-VL-8B
PenguinVL-2B	Qwen3-1.7B	tencent/Penguin-VL-2B
PenguinVL-Encoder	Qwen3-0.6B	tencent/Penguin-Encoder

🚀 Main Results

Chart / OCR / Document Understanding

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	OpenAI GPT-5 nano
InfoVQA	86.8	83.1	79.1	49.2
ChartQA	90.5	89.6	86.7	48.6
DocVQA	96.2	96.1	92.3	78.3
CharXiv (DQ / RQ)	75.7 / 40.0	83.0 / 46.4	72.2 / 44.4	64.4 / 31.7
OCRBench	852	896	840	701

General Knowledge / Multi-Image / Math Reasoning

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	OpenAI GPT-5 nano
AI2D	86.1	85.7	84.0	65.7
RealWorldQA	75.8	71.5	67.5	60.7
V-star	90.2	90.1	70.7	63.4
MMMU-Pro	40.2	55.9	39.7	36.5
BLINK	58.2	69.1	59.5	42.2
MathVista	77.4	77.2	74.2	40.9
MathVerse	50.8	62.1	55.8	27.0
LogicVista	53.8	55.3	57.3	40.5

Video Understanding

Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL3.5 8B	OpenAI GPT-5 nano
MVBench	71.7	68.7	72.1	52.9
LongVideoBench	67.0	62.6	62.1	38.1
VideoMME	66.2	71.4	66.0	49.4
Egochema	67.0	70.2	61.0	34.8
MMVU	53.9	58.7	51.5	51.0
CharadesSTA	61.4	56.0	32.8	5.0
NextQA	85.4	82.3	81.3	59.3
ActivityNetQA	65.2	63.7	60.1	–
Perception Test	78.0	72.7	72.7	–

Bold indicates the best result among compared models. More details can see our paper.

Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:

@article{Penguin-VL,
  title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
  author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
  journal={arXiv preprint arXiv:2603.06569},
  year={2026}
}