Add model card and sample usage for OneVision-Encoder
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,6 +1,58 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-feature-extraction
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# OneVision-Encoder
|
| 8 |
|
| 9 |
+
OneVision-Encoder (OV-Encoder) is a vision foundation model that introduces Codec-Aligned Sparsity as a foundational principle for multimodal intelligence. By focusing computation on regions rich in signal entropy (inspired by video codecs), the model achieves high efficiency and accuracy across image, video, and document understanding tasks.
|
| 10 |
+
|
| 11 |
+
This specific repository contains an intermediate stage of pretraining for the OneVision-Encoder model, where only images have been trained so far.
|
| 12 |
+
|
| 13 |
+
- **Paper:** [OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence](https://huggingface.co/papers/2602.08683)
|
| 14 |
+
- **Project Page:** [https://www.lmms-lab.com/onevision-encoder/index.html](https://www.lmms-lab.com/onevision-encoder/index.html)
|
| 15 |
+
- **Repository:** [GitHub - OneVision-Encoder](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder)
|
| 16 |
+
|
| 17 |
+
## Usage
|
| 18 |
+
|
| 19 |
+
You can use the model with the Hugging Face `transformers` library (recommended version `4.57.3`). Note that `trust_remote_code=True` is required as the model architecture is defined in the repository.
|
| 20 |
+
|
| 21 |
+
```python
|
| 22 |
+
from transformers import AutoModel, AutoImageProcessor
|
| 23 |
+
from PIL import Image
|
| 24 |
+
import torch
|
| 25 |
+
|
| 26 |
+
# Load model and preprocessor
|
| 27 |
+
model = AutoModel.from_pretrained(
|
| 28 |
+
"lmms-lab-encoder/onevision-encoder-large",
|
| 29 |
+
trust_remote_code=True,
|
| 30 |
+
attn_implementation="flash_attention_2"
|
| 31 |
+
).to("cuda").eval()
|
| 32 |
+
|
| 33 |
+
preprocessor = AutoImageProcessor.from_pretrained(
|
| 34 |
+
"lmms-lab-encoder/onevision-encoder-large",
|
| 35 |
+
trust_remote_code=True
|
| 36 |
+
)
|
| 37 |
+
|
| 38 |
+
# Image inference: [B, C, H, W]
|
| 39 |
+
image = Image.open("path/to/your/image.jpg") # Replace with your image path
|
| 40 |
+
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
|
| 41 |
+
|
| 42 |
+
with torch.no_grad():
|
| 43 |
+
outputs = model(pixel_values)
|
| 44 |
+
# outputs.last_hidden_state: [B, num_patches, hidden_size]
|
| 45 |
+
# outputs.pooler_output: [B, hidden_size]
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
## Citation
|
| 49 |
+
|
| 50 |
+
```bibtex
|
| 51 |
+
@article{tang2026onevision_encoder,
|
| 52 |
+
title = {{OneVision-Encoder}: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
|
| 53 |
+
author = {Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Bo Li and Ziyong Feng and Ziwei Liu and Zongyuan Ge and Jiankang Deng},
|
| 54 |
+
journal = {arXiv:2602.08683},
|
| 55 |
+
year = {2026},
|
| 56 |
+
url = {https://arxiv.org/abs/2602.08683}
|
| 57 |
+
}
|
| 58 |
+
```
|