Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

GitHub with detailed usage: tencent-ailab/Penguin-VL

GitHub Badge


πŸ“° News

  • 2025.03 β€” PenguinVL-Encoder now available for general use.
  • 2025.03 β€” Released PenguinVL-2B, PenguinVL-8B.

🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

  • 🧠 LLM-based Vision Encoder
    The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
    This provides strong semantic priors and native compatibility with the downstream LLM.

  • πŸŽ₯ Efficient Video Understanding
    A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.

  • πŸ— Unified Architecture
    The model consists of:

    1. LLM-initialized vision encoder
    2. Lightweight MLP projector
    3. Qwen3 language backbone
  • πŸ“Š Compact but Strong
    At 2B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.


πŸ§ͺ Quick Start β€” Transformers Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "tencent/Penguin-VL-2B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example: Image + Text
inputs = processor(
    conversation=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": {"image_path": "assets/example.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ],
        },
    ],
    return_tensors="pt",
)

inputs = {k: v.to("cuda") for k, v in inputs.items() if isinstance(v, torch.Tensor)}

output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(response)

🌎 Model Zoo

Model Base Model HF Link
PenguinVL-8B Qwen3-8B tencent/Penguin-VL-8B
PenguinVL-2B Qwen3-1.7B tencent/Penguin-VL-2B
PenguinVL-Encoder Qwen3-0.6B tencent/Penguin-Encoder

πŸš€ Main Results

Chart / OCR / Document Understanding

Benchmark Penguin-VL 2B Qwen3-VL 2B InternVL3.5 2B Gemma3n E2B-it SmolVLM2 2.2B
InfoVQA 77.8 72.4 70.8 51.9 43.0
ChartQA 86.6 76.9 80.7 65.8 68.7
DocVQA 94.1 93.3 89.4 78.4 80.0
CharXiv (DQ / RQ) 66.4 / 35.8 62.3 / 26.8 65.0 / 31.6 60.1 / 27.0 36.9 / 15.5
OCRBench 810 858 836 700 729

General Knowledge / Multi-Image / Math Reasoning

Benchmark Penguin-VL 2B Qwen3-VL 2B InternVL3.5 2B Gemma3n E2B-it SmolVLM2 2.2B
AI2D 80.7 76.9 78.8 74.6 70.0
RealWorldQA 70.2 63.9 62.0 59.9 58.3
V-star 83.8 74.9 69.1 46.0 51.8
MMMU-Pro 31.4 36.5 31.6 28.0 20.1
BLINK 51.7 53.8 36.6 44.1 44.0
MathVista 67.3 61.3 60.8 50.4 51.5
MathVerse 35.9 52.1 39.6 22.5 21.5
LogicVista 41.3 35.8 47.7 33.9 24.8

Video Understanding

Benchmark Penguin-VL 2B Qwen3-VL 2B InternVL3.5 2B Gemma3n E2B-it SmolVLM2 2.2B
MVBench 65.5 61.7 65.9 46.8 46.3
LongVideoBench 59.5 52.1 57.4 43.0 49.7
VideoMME 57.4 61.9 58.4 47.0 52.1
Egochema 57.6 55.7 50.5 48.0 34.0
MMVU 42.7 41.7 42.7 34.5 33.5
CharadesSTA 56.2 54.5 21.9 5.5 9.5
NextQA 79.9 76.9 76.1 65.4 62.4
ActivityNetQA 61.5 59.7 58.3 51.5 52.6
Perception Test 70.4 64.5 64.7 48.6 51.6

Bold indicates the best score among compared models. More details can see our paper.

Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:

...
Downloads last month
1
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tencent/Penguin-VL-2B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(484)
this model

Collection including tencent/Penguin-VL-2B