Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

GitHub with detailed usage: tencent-ailab/Penguin-VL

GitHub Badge


πŸ“° News

  • 2025.03 β€” PenguinVL-Encoder now available for general use.
  • 2025.03 β€” Released PenguinVL-2B, PenguinVL-8B.

🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

  • 🧠 LLM-based Vision Encoder
    The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
    This provides strong semantic priors and native compatibility with the downstream LLM.

  • πŸŽ₯ Efficient Video Understanding
    A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.

  • πŸ— Unified Architecture
    The model consists of:

    1. LLM-initialized vision encoder
    2. Lightweight MLP projector
    3. Qwen3 language backbone
  • πŸ“Š Compact but Strong
    At 8B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.


πŸ§ͺ Quick Start β€” Transformers Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "tencent/Penguin-VL-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example: Image + Text
inputs = processor(
    conversation=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": {"image_path": "assets/example.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ],
        },
    ],
    return_tensors="pt",
)


inputs = {k: v.to("cuda") for k, v in inputs.items() if isinstance(v, torch.Tensor)}

output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(response)

🌎 Model Zoo

Model Base Model HF Link
PenguinVL-8B Qwen3-8B tencent/Penguin-VL-8B
PenguinVL-2B Qwen3-1.7B tencent/Penguin-VL-2B
PenguinVL-Encoder Qwen3-0.6B tencent/Penguin-Encoder

πŸš€ Main Results

Chart / OCR / Document Understanding

Benchmark Penguin-VL 8B Qwen3-VL 8B InternVL3.5 8B OpenAI GPT-5 nano
InfoVQA 86.8 83.1 79.1 49.2
ChartQA 90.5 89.6 86.7 48.6
DocVQA 96.2 96.1 92.3 78.3
CharXiv (DQ / RQ) 75.7 / 40.0 83.0 / 46.4 72.2 / 44.4 64.4 / 31.7
OCRBench 852 896 840 701

General Knowledge / Multi-Image / Math Reasoning

Benchmark Penguin-VL 8B Qwen3-VL 8B InternVL3.5 8B OpenAI GPT-5 nano
AI2D 86.1 85.7 84.0 65.7
RealWorldQA 75.8 71.5 67.5 60.7
V-star 90.2 90.1 70.7 63.4
MMMU-Pro 40.2 55.9 39.7 36.5
BLINK 58.2 69.1 59.5 42.2
MathVista 77.4 77.2 74.2 40.9
MathVerse 50.8 62.1 55.8 27.0
LogicVista 53.8 55.3 57.3 40.5

Video Understanding

Benchmark Penguin-VL 8B Qwen3-VL 8B InternVL3.5 8B OpenAI GPT-5 nano
MVBench 71.7 68.7 72.1 52.9
LongVideoBench 67.0 62.6 62.1 38.1
VideoMME 66.2 71.4 66.0 49.4
Egochema 67.0 70.2 61.0 34.8
MMVU 53.9 58.7 51.5 51.0
CharadesSTA 61.4 56.0 32.8 5.0
NextQA 85.4 82.3 81.3 59.3
ActivityNetQA 65.2 63.7 60.1 –
Perception Test 78.0 72.7 72.7 –

Bold indicates the best result among compared models. More details can see our paper.

Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:

...
Downloads last month
13
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tencent/Penguin-VL-8B

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(1037)
this model

Space using tencent/Penguin-VL-8B 1

Collection including tencent/Penguin-VL-8B