Add metadata and link to paper/code

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +29 -46
README.md CHANGED
@@ -1,5 +1,15 @@
 
 
 
 
 
 
1
  # OneVision-Encoder
2
 
 
 
 
 
3
  ### Key Features
4
 
5
  - **LLM-Aligned Architecture**: Unlike standard vision backbones, this model is specifically optimized for **Large Multimodal Models (LMMs)**, ensuring seamless feature alignment and superior performance when connected to language models.
@@ -75,57 +85,12 @@ w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
75
  w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
76
 
77
  patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
78
- # patch_positions example (256 tokens per frame, 16x16 patch grid):
79
- # Each row is [t, h, w].
80
- # First 4 patches of frame 0 (t=0):
81
- # patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
82
- # First 4 patches of frame 1 (t=4):
83
- # patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
84
 
85
  with torch.no_grad():
86
  outputs = model(video, patch_positions=patch_positions)
87
-
88
  ```
89
 
90
- ### Loading from Source Code
91
-
92
- ```bash
93
- git clone [https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git)
94
- cd OneVision-Encoder
95
- pip install -e .
96
-
97
- ```
98
-
99
- ```python
100
- from onevision_encoder import OneVisionEncoderModel, OneVisionEncoderConfig
101
- from transformers import AutoImageProcessor
102
- model = OneVisionEncoderModel.from_pretrained(
103
- "lmms-lab-encoder/onevision-encoder-large-lang",
104
- trust_remote_code=True,
105
- attn_implementation="flash_attention_2"
106
- ).to("cuda").eval()
107
- preprocessor = AutoImageProcessor.from_pretrained(
108
- "lmms-lab-encoder/onevision-encoder-large-lang",
109
- trust_remote_code=True
110
- )
111
-
112
- ```
113
-
114
- ### LMM Probe Results
115
-
116
- Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning.
117
-
118
- We adopt a streamlined **native-resolution strategy** inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed **directly**—without tiling or cropping—to evaluate the ViT's capability to handle **true native resolution** and **arbitrary frame sequences**.
119
-
120
- <p align="center">
121
- <picture>
122
- <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
123
- <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
124
- <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
125
- </picture>
126
- </p>
127
-
128
- ### Model Card
129
 
130
  | Property | Value |
131
  | --- | --- |
@@ -143,3 +108,21 @@ We adopt a streamlined **native-resolution strategy** inspired by LLaVA-OneVisio
143
  | **Normalization** | Layer Normalization |
144
  | **Activation Function** | GELU |
145
  | **License** | Apache 2.0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-feature-extraction
5
+ ---
6
+
7
  # OneVision-Encoder
8
 
9
+ OneVision-Encoder is an LLM-aligned vision transformer specifically optimized for Large Multimodal Models (LMMs). It is a core component of the [LLaVA-OneVision-2](https://huggingface.co/papers/2605.25979) series and is further detailed in the technical report [OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence](https://arxiv.org/abs/2602.08683).
10
+
11
+ [**Project Page**](https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/) | [**GitHub**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2)
12
+
13
  ### Key Features
14
 
15
  - **LLM-Aligned Architecture**: Unlike standard vision backbones, this model is specifically optimized for **Large Multimodal Models (LMMs)**, ensuring seamless feature alignment and superior performance when connected to language models.
 
85
  w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
86
 
87
  patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
 
 
 
 
 
 
88
 
89
  with torch.no_grad():
90
  outputs = model(video, patch_positions=patch_positions)
 
91
  ```
92
 
93
+ ### Model Properties
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  | Property | Value |
96
  | --- | --- |
 
108
  | **Normalization** | Layer Normalization |
109
  | **Activation Function** | GELU |
110
  | **License** | Apache 2.0 |
111
+
112
+ ### Citation
113
+
114
+ ```bibtex
115
+ @inproceedings{LLaVA-OneVision-2,
116
+ title={LLaVA-OneVision-2},
117
+ author={llava-onevision contributors},
118
+ booktitle={arXiv},
119
+ year={2026}
120
+ }
121
+
122
+ @article{tang2026onevisionencoder,
123
+ title={OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
124
+ author={Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
125
+ journal={arXiv preprint arXiv:2602.08683},
126
+ year={2026}
127
+ }
128
+ ```