Penguin-Encoder / README.md
lkeab's picture
Update README.md
537c6f2 verified
|
raw
history blame
3.69 kB
metadata
language:
  - en
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
  - vision-language-model
  - multimodal
  - custom_code
library_name: transformers

Vision Encoder of PenguinVL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders


πŸ“° News

  • 2025.03 β€” PenguinVL-Encoder now available for general use.
  • 2025.03 β€” Released PenguinVL-2B, PenguinVL-8B.

🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

  • 🧠 LLM-based Vision Encoder
    The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling.
    This provides strong semantic priors and native compatibility with the downstream LLM.

  • πŸŽ₯ Efficient Video Understanding
    A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.

  • πŸ— Unified Architecture
    The model consists of:

    1. LLM-initialized vision encoder
    2. Lightweight MLP projector
    3. Qwen3 language backbone
  • πŸ“Š Compact but Strong
    At 2B scale, PG-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.


πŸ§ͺ Quick Start β€” Transformers Inference

import torch
from transformers import AutoModel, AutoImageProcessor
from transformers.image_utils import load_image

model_name = "tencent/Penguin-Encoder"
image_path = "assets/xxxx.jpg"
images = load_image(image_path)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoImageProcessor.from_pretrained(model_name, trust_remote_code=True)

inputs = processor(images=images, merge_size=1)
inputs = {k: torch.tensor(v).cuda() for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
image_features = model(**inputs)

🌎 Model Zoo

Model Base Model HF Link
PenguinVL-8B Qwen3-8B tencent/Penguin-VL-8B
PenguinVL-2B Qwen3-1.7B tencent/Penguin-VL-2B
PenguinVL-Encoder Qwen3-0.6B tencent/Penguin-Encoder

πŸš€ Main Results

xxx

Citation

If you find PenguinVL useful for your research and applications, please cite using this BibTeX:

...