--- license: apache-2.0 library_name: transformers pipeline_tag: image-feature-extraction --- # OneVision-Encoder OneVision-Encoder (OV-Encoder) is a vision foundation model that introduces Codec-Aligned Sparsity as a foundational principle for multimodal intelligence. By focusing computation on regions rich in signal entropy (inspired by video codecs), the model achieves high efficiency and accuracy across image, video, and document understanding tasks. This specific repository contains an intermediate stage of pretraining for the OneVision-Encoder model, where only images have been trained so far. - **Paper:** [OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence](https://huggingface.co/papers/2602.08683) - **Project Page:** [https://www.lmms-lab.com/onevision-encoder/index.html](https://www.lmms-lab.com/onevision-encoder/index.html) - **Repository:** [GitHub - OneVision-Encoder](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder) ## Usage You can use the model with the Hugging Face `transformers` library (recommended version `4.57.3`). Note that `trust_remote_code=True` is required as the model architecture is defined in the repository. ```python from transformers import AutoModel, AutoImageProcessor from PIL import Image import torch # Load model and preprocessor model = AutoModel.from_pretrained( "lmms-lab-encoder/onevision-encoder-large", trust_remote_code=True, attn_implementation="flash_attention_2" ).to("cuda").eval() preprocessor = AutoImageProcessor.from_pretrained( "lmms-lab-encoder/onevision-encoder-large", trust_remote_code=True ) # Image inference: [B, C, H, W] image = Image.open("path/to/your/image.jpg") # Replace with your image path pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda") with torch.no_grad(): outputs = model(pixel_values) # outputs.last_hidden_state: [B, num_patches, hidden_size] # outputs.pooler_output: [B, hidden_size] ``` ## Citation ```bibtex @article{tang2026onevision_encoder, title = {{OneVision-Encoder}: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence}, author = {Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Bo Li and Ziyong Feng and Ziwei Liu and Zongyuan Ge and Jiankang Deng}, journal = {arXiv:2602.08683}, year = {2026}, url = {https://arxiv.org/abs/2602.08683} } ```