---
license: apache-2.0
library_name: transformers
pipeline_tag: image-feature-extraction
---

# OneVision-Encoder

OneVision-Encoder (OV-Encoder) is a vision foundation model that introduces Codec-Aligned Sparsity as a foundational principle for multimodal intelligence. By focusing computation on regions rich in signal entropy (inspired by video codecs), the model achieves high efficiency and accuracy across image, video, and document understanding tasks.

This specific repository contains an intermediate stage of pretraining for the OneVision-Encoder model, where only images have been trained so far.

- **Paper:** [OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence](https://huggingface.co/papers/2602.08683)
- **Project Page:** [https://www.lmms-lab.com/onevision-encoder/index.html](https://www.lmms-lab.com/onevision-encoder/index.html)
- **Repository:** [GitHub - OneVision-Encoder](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder)

## Usage

You can use the model with the Hugging Face `transformers` library (recommended version `4.57.3`). Note that `trust_remote_code=True` is required as the model architecture is defined in the repository.

```python
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and preprocessor
model = AutoModel.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).to("cuda").eval()

preprocessor = AutoImageProcessor.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True
)

# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg")  # Replace with your image path
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")

with torch.no_grad():
    outputs = model(pixel_values)
    # outputs.last_hidden_state: [B, num_patches, hidden_size]
    # outputs.pooler_output: [B, hidden_size]
```

## Citation

```bibtex
@article{tang2026onevision_encoder,
    title   = {{OneVision-Encoder}: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
    author  = {Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Bo Li and Ziyong Feng and Ziwei Liu and Zongyuan Ge and Jiankang Deng},
    journal = {arXiv:2602.08683},
    year    = {2026},
    url     = {https://arxiv.org/abs/2602.08683}
}
```