Instructions to use lmms-lab-encoder/onevision-encoder-large-lang with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lmms-lab-encoder/onevision-encoder-large-lang with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Add metadata and link to paper/code
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,5 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# OneVision-Encoder
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
### Key Features
|
| 4 |
|
| 5 |
- **LLM-Aligned Architecture**: Unlike standard vision backbones, this model is specifically optimized for **Large Multimodal Models (LMMs)**, ensuring seamless feature alignment and superior performance when connected to language models.
|
|
@@ -75,57 +85,12 @@ w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
|
|
| 75 |
w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
|
| 76 |
|
| 77 |
patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
|
| 78 |
-
# patch_positions example (256 tokens per frame, 16x16 patch grid):
|
| 79 |
-
# Each row is [t, h, w].
|
| 80 |
-
# First 4 patches of frame 0 (t=0):
|
| 81 |
-
# patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
|
| 82 |
-
# First 4 patches of frame 1 (t=4):
|
| 83 |
-
# patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
|
| 84 |
|
| 85 |
with torch.no_grad():
|
| 86 |
outputs = model(video, patch_positions=patch_positions)
|
| 87 |
-
|
| 88 |
```
|
| 89 |
|
| 90 |
-
###
|
| 91 |
-
|
| 92 |
-
```bash
|
| 93 |
-
git clone [https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git)
|
| 94 |
-
cd OneVision-Encoder
|
| 95 |
-
pip install -e .
|
| 96 |
-
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
```python
|
| 100 |
-
from onevision_encoder import OneVisionEncoderModel, OneVisionEncoderConfig
|
| 101 |
-
from transformers import AutoImageProcessor
|
| 102 |
-
model = OneVisionEncoderModel.from_pretrained(
|
| 103 |
-
"lmms-lab-encoder/onevision-encoder-large-lang",
|
| 104 |
-
trust_remote_code=True,
|
| 105 |
-
attn_implementation="flash_attention_2"
|
| 106 |
-
).to("cuda").eval()
|
| 107 |
-
preprocessor = AutoImageProcessor.from_pretrained(
|
| 108 |
-
"lmms-lab-encoder/onevision-encoder-large-lang",
|
| 109 |
-
trust_remote_code=True
|
| 110 |
-
)
|
| 111 |
-
|
| 112 |
-
```
|
| 113 |
-
|
| 114 |
-
### LMM Probe Results
|
| 115 |
-
|
| 116 |
-
Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning.
|
| 117 |
-
|
| 118 |
-
We adopt a streamlined **native-resolution strategy** inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed **directly**—without tiling or cropping—to evaluate the ViT's capability to handle **true native resolution** and **arbitrary frame sequences**.
|
| 119 |
-
|
| 120 |
-
<p align="center">
|
| 121 |
-
<picture>
|
| 122 |
-
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
|
| 123 |
-
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
|
| 124 |
-
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
|
| 125 |
-
</picture>
|
| 126 |
-
</p>
|
| 127 |
-
|
| 128 |
-
### Model Card
|
| 129 |
|
| 130 |
| Property | Value |
|
| 131 |
| --- | --- |
|
|
@@ -143,3 +108,21 @@ We adopt a streamlined **native-resolution strategy** inspired by LLaVA-OneVisio
|
|
| 143 |
| **Normalization** | Layer Normalization |
|
| 144 |
| **Activation Function** | GELU |
|
| 145 |
| **License** | Apache 2.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-feature-extraction
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
# OneVision-Encoder
|
| 8 |
|
| 9 |
+
OneVision-Encoder is an LLM-aligned vision transformer specifically optimized for Large Multimodal Models (LMMs). It is a core component of the [LLaVA-OneVision-2](https://huggingface.co/papers/2605.25979) series and is further detailed in the technical report [OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence](https://arxiv.org/abs/2602.08683).
|
| 10 |
+
|
| 11 |
+
[**Project Page**](https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/) | [**GitHub**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2)
|
| 12 |
+
|
| 13 |
### Key Features
|
| 14 |
|
| 15 |
- **LLM-Aligned Architecture**: Unlike standard vision backbones, this model is specifically optimized for **Large Multimodal Models (LMMs)**, ensuring seamless feature alignment and superior performance when connected to language models.
|
|
|
|
| 85 |
w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
|
| 86 |
|
| 87 |
patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
with torch.no_grad():
|
| 90 |
outputs = model(video, patch_positions=patch_positions)
|
|
|
|
| 91 |
```
|
| 92 |
|
| 93 |
+
### Model Properties
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
| Property | Value |
|
| 96 |
| --- | --- |
|
|
|
|
| 108 |
| **Normalization** | Layer Normalization |
|
| 109 |
| **Activation Function** | GELU |
|
| 110 |
| **License** | Apache 2.0 |
|
| 111 |
+
|
| 112 |
+
### Citation
|
| 113 |
+
|
| 114 |
+
```bibtex
|
| 115 |
+
@inproceedings{LLaVA-OneVision-2,
|
| 116 |
+
title={LLaVA-OneVision-2},
|
| 117 |
+
author={llava-onevision contributors},
|
| 118 |
+
booktitle={arXiv},
|
| 119 |
+
year={2026}
|
| 120 |
+
}
|
| 121 |
+
|
| 122 |
+
@article{tang2026onevisionencoder,
|
| 123 |
+
title={OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
|
| 124 |
+
author={Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
|
| 125 |
+
journal={arXiv preprint arXiv:2602.08683},
|
| 126 |
+
year={2026}
|
| 127 |
+
}
|
| 128 |
+
```
|