nielsr HF Staff commited on
Commit
0e00854
·
verified ·
1 Parent(s): 9908b86

Add metadata and improve model card

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face.

This pull request improves the model card for OneVision-Encoder-Large by:
- Adding `library_name: transformers` to the metadata (verified via `auto_map` and code snippets).
- Adding the `image-feature-extraction` pipeline tag.
- Providing links to the paper, project page, and official GitHub repository.
- Adding the BibTeX citation.

These changes help users discover the model and understand how to use it within the Hugging Face ecosystem.

Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -1,13 +1,20 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
 
 
 
 
 
5
 
6
  ### Key Features
7
 
8
  - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
9
  - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
10
-
11
 
12
  #### Downstream Tasks
13
 
@@ -119,7 +126,7 @@ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K sample
119
  </picture>
120
  </p>
121
 
122
- ### Model Card
123
 
124
  | Property | Value |
125
  | ----------------------------- | --------------------------------- |
@@ -136,3 +143,15 @@ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K sample
136
  | **Normalization** | Layer Normalization |
137
  | **Activation Function** | GELU |
138
  | **License** | Apache 2.0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-feature-extraction
5
  ---
6
 
7
+ # OneVision-Encoder-Large
8
+
9
+ [[Paper](https://huggingface.co/papers/2602.08683)] [[Project Page](https://www.lmms-lab.com/onevision-encoder/index.html)] [[GitHub](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder)]
10
+
11
+ OneVision-Encoder is a vision foundation model that introduces **Codec-Aligned Sparsity** as a foundational principle for multimodal intelligence. By adopting Codec Patchification, the model focuses computation exclusively on the regions rich in signal entropy, achieving high efficiency and accuracy across image, video, and document understanding tasks.
12
 
13
  ### Key Features
14
 
15
  - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
16
  - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
17
+ - **Unified Vision Foundation**: A single base model for consistent understanding of images, videos, and OCR.
18
 
19
  #### Downstream Tasks
20
 
 
126
  </picture>
127
  </p>
128
 
129
+ ### Model Card Summary
130
 
131
  | Property | Value |
132
  | ----------------------------- | --------------------------------- |
 
143
  | **Normalization** | Layer Normalization |
144
  | **Activation Function** | GELU |
145
  | **License** | Apache 2.0 |
146
+
147
+ ### Citation
148
+
149
+ ```bibtex
150
+ @article{tang2026onevision_encoder,
151
+ title = {{OneVision-Encoder}: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
152
+ author = {Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Bo Li and Ziyong Feng and Ziwei Liu and Zongyuan Ge and Jiankang Deng},
153
+ journal = {arXiv:2602.08683},
154
+ year = {2026},
155
+ url = {https://arxiv.org/abs/2602.08683}
156
+ }
157
+ ```