Add model card and sample usage for OneVision-Encoder

f2da41d verified 19 days ago

2.62 kB

license: apache-2.0
library_name: transformers
pipeline_tag: image-feature-extraction

OneVision-Encoder

OneVision-Encoder (OV-Encoder) is a vision foundation model that introduces Codec-Aligned Sparsity as a foundational principle for multimodal intelligence. By focusing computation on regions rich in signal entropy (inspired by video codecs), the model achieves high efficiency and accuracy across image, video, and document understanding tasks.

This specific repository contains an intermediate stage of pretraining for the OneVision-Encoder model, where only images have been trained so far.

Paper: OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
Project Page: https://www.lmms-lab.com/onevision-encoder/index.html
Repository: GitHub - OneVision-Encoder

Usage

You can use the model with the Hugging Face transformers library (recommended version 4.57.3). Note that trust_remote_code=True is required as the model architecture is defined in the repository.

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and preprocessor
model = AutoModel.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).to("cuda").eval()

preprocessor = AutoImageProcessor.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True
)

# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg")  # Replace with your image path
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")

with torch.no_grad():
    outputs = model(pixel_values)
    # outputs.last_hidden_state: [B, num_patches, hidden_size]
    # outputs.pooler_output: [B, hidden_size]

Citation

@article{tang2026onevision_encoder,
    title   = {{OneVision-Encoder}: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
    author  = {Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Bo Li and Ziyong Feng and Ziwei Liu and Zongyuan Ge and Jiankang Deng},
    journal = {arXiv:2602.08683},
    year    = {2026},
    url     = {https://arxiv.org/abs/2602.08683}
}