Add model card and sample usage for OneVision-Encoder

f2da41d verified 19 days ago

2.62 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: image-feature-extraction
	---

	# OneVision-Encoder

	OneVision-Encoder (OV-Encoder) is a vision foundation model that introduces Codec-Aligned Sparsity as a foundational principle for multimodal intelligence. By focusing computation on regions rich in signal entropy (inspired by video codecs), the model achieves high efficiency and accuracy across image, video, and document understanding tasks.

	This specific repository contains an intermediate stage of pretraining for the OneVision-Encoder model, where only images have been trained so far.

	- Paper: [OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence](https://huggingface.co/papers/2602.08683)
	- Project Page: [https://www.lmms-lab.com/onevision-encoder/index.html](https://www.lmms-lab.com/onevision-encoder/index.html)
	- Repository: [GitHub - OneVision-Encoder](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder)

	## Usage

	You can use the model with the Hugging Face `transformers` library (recommended version `4.57.3`). Note that `trust_remote_code=True` is required as the model architecture is defined in the repository.

	```python
	from transformers import AutoModel, AutoImageProcessor
	from PIL import Image
	import torch

	# Load model and preprocessor
	model = AutoModel.from_pretrained(
	"lmms-lab-encoder/onevision-encoder-large",
	trust_remote_code=True,
	attn_implementation="flash_attention_2"
	).to("cuda").eval()

	preprocessor = AutoImageProcessor.from_pretrained(
	"lmms-lab-encoder/onevision-encoder-large",
	trust_remote_code=True
	)

	# Image inference: [B, C, H, W]
	image = Image.open("path/to/your/image.jpg") # Replace with your image path
	pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")

	with torch.no_grad():
	outputs = model(pixel_values)
	# outputs.last_hidden_state: [B, num_patches, hidden_size]
	# outputs.pooler_output: [B, hidden_size]
	```

	## Citation

	```bibtex
	@article{tang2026onevision_encoder,
	title = {{OneVision-Encoder}: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
	author = {Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Bo Li and Ziyong Feng and Ziwei Liu and Zongyuan Ge and Jiankang Deng},
	journal = {arXiv:2602.08683},
	year = {2026},
	url = {https://arxiv.org/abs/2602.08683}
	}
	```