File size: 10,334 Bytes
b80f869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ac545a
b80f869
 
3ac545a
 
 
 
 
 
 
 
 
 
b80f869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
license: apache-2.0
---

> # `onevision-encoder-large-tf57`
>
> **transformers 5.7+ idiomatic variant of [`lmms-lab-encoder/onevision-encoder-large`](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large).**
> Weights are byte-identical to the upstream model (same `safetensors` SHA-256). Only `modeling_onevision_encoder.py` and `config.json` (`transformers_version` field) differ.
>
> ## Why this variant
>
> Upstream `modeling_onevision_encoder.py` is written against the `transformers 4.x` API surface and does not load correctly under `transformers >= 5.0`:
> 1. `_supports_flash_attn_2` was renamed to `_supports_flash_attn`.
> 2. The v5 fast-init / meta-tensor path skips re-initialization of `persistent=False` buffers, leaving `VideoRotaryEmbeddingSplit466.inv_freq_{t,h,w}` filled with uninitialized memory. RoPE then produces garbage and downstream attention diverges (max diff up to 700+ vs upstream).
> 3. `add_start_docstrings*` / `replace_return_docstrings` decorators are removed in v5.
> 4. Manual eager-only attention is replaced by the v5 `ALL_ATTENTION_FUNCTIONS` interface dispatching across `eager`, `sdpa`, `flash_attention_2`, `flex_attention`.
>
> ## v5-only notice
>
> This variant **requires `transformers >= 5.7.0`** and will not load under `transformers 4.x`. Use the upstream model dir for v4 environments.
>
> ## Diff vs upstream
>
> | File | Change |
> |---|---|
> | `model.safetensors` | unchanged (byte-identical) |
> | `config.json` | `transformers_version: 4.57.3` -> `5.7.0` |
> | `configuration_onevision_encoder.py` | unchanged |
> | `preprocessor_config.json` | unchanged |
> | `modeling_onevision_encoder.py` | full v5-idiom rewrite: `_supports_flash_attn`/`_supports_sdpa`/`_supports_flex_attn`/`_supports_attention_backend`, `ALL_ATTENTION_FUNCTIONS.get_interface(...)` dispatch, `@auto_docstring` + `@can_return_tuple`, removed v4 docstring decorators and `use_return_dict` branches, `_init_weights` hook calls `VideoRotaryEmbeddingSplit466.reset_inv_freqs()` to fix the inv_freq init bug. |
>
> ## Usage
>
> ```python
> from transformers import AutoModel
>
> model = AutoModel.from_pretrained(
>     "path/to/onevision-encoder-large-tf57",
>     trust_remote_code=True,
> )  # default attn_implementation = "flash_attention_2" (set in config.json)
> ```
>
> Override the default if you need a different backend:
>
> ```python
> model = AutoModel.from_pretrained(..., attn_implementation="sdpa")
> # supported: "flash_attention_2" (default), "sdpa", "eager", "flex_attention"
> ```
>
> **Dtype contract**: weights are saved in `bfloat16`. The default `flash_attention_2` backend requires `fp16`/`bf16` inputs. If you must use `fp32`, override with `attn_implementation="sdpa"` or `"eager"`.
>
> Tested with `transformers==5.7.0`, `torch>=2.4`, `flash-attn>=2.7`.
>
> ## Equivalence verification
>
> Cross-version (upstream tf 4.57.3 vs this tf 5.7.0) on 11 input shapes (single image / multi-frame video / batched / non-square / `visible_indices`):
>
> | dtype | attn | result |
> |---|---|---|
> | fp32 | eager | bit-identical (max_diff = 0.0 across all 22 tensors) |
> | bf16 | eager | bit-identical (max_diff = 0.0 across all 22 tensors) |
>
> Plus 7 v5-only scenario tests, all PASSED:
> 1. eager vs sdpa equivalence (max=7.5e-5)
> 2. save_pretrained then from_pretrained bit-identical round-trip
> 3. cpu vs cuda equivalence (max=4.1e-5)
> 4. fp32/bf16/fp16 dtype preservation
> 5. gradient flow (389/399 params receive non-zero grad)
> 6. runtime `_attn_implementation` switch
> 7. `from_pretrained` idempotency (two loads bit-identical)
>
> Plus real-input end-to-end tests on a real JPEG (1332x725) and a real MP4 (decord, 4 frames @ 512x512), preprocessed through `AutoImageProcessor` (CLIPImageProcessor):
>
> | path | result |
> |---|---|
> | image: PIL -> processor -> model fwd | finite, lhs=(1,1024,1024), pool=(1,1024) |
> | video: decord -> 5D (1,3,4,448,448) -> model fwd | finite, lhs=(1,4096,1024), pool=(1,1024) |
> | model-only equivalence on identical pixel_values (v4 vs v5) | **bit-identical (max_diff = 0.0 on image+video)** |
>
> Note: Raw `pixel_values` from `CLIPImageProcessor` differ by ~1e-2 between transformers 4.57.3 and 5.7.0 due to upstream resize/normalize changes in `transformers` itself (independent of this variant). When the same pixel_values are fed to both versions, this model is bit-identical.
>
> Reproduce with `tools/upgrade_v5/run_all.sh` from the OneVision-Encoder repo.
>
> ## Changelog
>
> - **tf57**: full v5-idiom rewrite; weights unchanged.
>
> ---
> The original model card from upstream follows.

### Model Card

| Property | Value |
|----------|-------|
| **Model Type** | Vision Transformer (ViT) |
| **Architecture** | HEVC-Style Vision Transformer |
| **Hidden Size** | 1024 |
| **Intermediate Size** | 4096 |
| **Number of Layers** | 24 |
| **Number of Attention Heads** | 16 |
| **Patch Size** | 16 |
| **Image Resolution** | 448×448 (pre-trained) |
| **Video Resolution** | 224×224 with 256 tokens per frame |
| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
| **Normalization** | Layer Normalization |
| **Activation Function** | GELU |
| **License** | Apache 2.0 |

### Key Features

- **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
- **Native Resolution Support**: Supports native resolution input without tiling or cropping.
- **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.

### Intended Use

#### Primary Use Cases

- **Video Understanding**: Action recognition, video captioning, video question answering
- **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
- **Vision-Language Models**: As the vision encoder backbone for multimodal large language models

#### Downstream Tasks

- Video benchmarks: MVBench, VideoMME, Perception Test
- Image understanding: DocVQA, ChartQA, OCRBench
- Action recognition: SSv2, UCF101, Kinetics


### Quick Start

> **Note:** This model supports native resolution input. For optimal performance:
> - **Image**: 448×448 resolution (pre-trained)
> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)


```python
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and preprocessor
model = AutoModel.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).to("cuda").eval()

preprocessor = AutoImageProcessor.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True
)

# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg")  # Replace with your image path
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
with torch.no_grad():
    outputs = model(pixel_values)
    # outputs.last_hidden_state: [B, num_patches, hidden_size]
    # outputs.pooler_output: [B, hidden_size]

# Video inference: [B, C, T, H, W] with visible_indices
num_frames, frame_tokens, target_frames = 16, 256, 64
# Load video frames and preprocess each frame (replace with your video frame paths)
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
# Reshape from [T, C, H, W] to [B, C, T, H, W]
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")

# Build visible_indices for temporal sampling
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
# visible_indices example (with 256 tokens per frame):
#   Frame 0 (pos=0):  indices [0, 1, 2, ..., 255]
#   Frame 1 (pos=4):  indices [1024, 1025, 1026, ..., 1279]
#   Frame 2 (pos=8):  indices [2048, 2049, 2050, ..., 2303]
#   ...
#   Frame 15 (pos=63): indices [16128, 16129, ..., 16383]

with torch.no_grad():
    outputs = model(video, visible_indices=visible_indices)
```


### LMM Probe Results

Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability.

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
    <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
  </picture>
</p>

### Attentive Probe Results

Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png">
    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png">
    <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;">
  </picture>
</p>


### Codec Input

> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.

---