File size: 10,334 Bytes
b80f869 3ac545a b80f869 3ac545a b80f869 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | ---
license: apache-2.0
---
> # `onevision-encoder-large-tf57`
>
> **transformers 5.7+ idiomatic variant of [`lmms-lab-encoder/onevision-encoder-large`](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large).**
> Weights are byte-identical to the upstream model (same `safetensors` SHA-256). Only `modeling_onevision_encoder.py` and `config.json` (`transformers_version` field) differ.
>
> ## Why this variant
>
> Upstream `modeling_onevision_encoder.py` is written against the `transformers 4.x` API surface and does not load correctly under `transformers >= 5.0`:
> 1. `_supports_flash_attn_2` was renamed to `_supports_flash_attn`.
> 2. The v5 fast-init / meta-tensor path skips re-initialization of `persistent=False` buffers, leaving `VideoRotaryEmbeddingSplit466.inv_freq_{t,h,w}` filled with uninitialized memory. RoPE then produces garbage and downstream attention diverges (max diff up to 700+ vs upstream).
> 3. `add_start_docstrings*` / `replace_return_docstrings` decorators are removed in v5.
> 4. Manual eager-only attention is replaced by the v5 `ALL_ATTENTION_FUNCTIONS` interface dispatching across `eager`, `sdpa`, `flash_attention_2`, `flex_attention`.
>
> ## v5-only notice
>
> This variant **requires `transformers >= 5.7.0`** and will not load under `transformers 4.x`. Use the upstream model dir for v4 environments.
>
> ## Diff vs upstream
>
> | File | Change |
> |---|---|
> | `model.safetensors` | unchanged (byte-identical) |
> | `config.json` | `transformers_version: 4.57.3` -> `5.7.0` |
> | `configuration_onevision_encoder.py` | unchanged |
> | `preprocessor_config.json` | unchanged |
> | `modeling_onevision_encoder.py` | full v5-idiom rewrite: `_supports_flash_attn`/`_supports_sdpa`/`_supports_flex_attn`/`_supports_attention_backend`, `ALL_ATTENTION_FUNCTIONS.get_interface(...)` dispatch, `@auto_docstring` + `@can_return_tuple`, removed v4 docstring decorators and `use_return_dict` branches, `_init_weights` hook calls `VideoRotaryEmbeddingSplit466.reset_inv_freqs()` to fix the inv_freq init bug. |
>
> ## Usage
>
> ```python
> from transformers import AutoModel
>
> model = AutoModel.from_pretrained(
> "path/to/onevision-encoder-large-tf57",
> trust_remote_code=True,
> ) # default attn_implementation = "flash_attention_2" (set in config.json)
> ```
>
> Override the default if you need a different backend:
>
> ```python
> model = AutoModel.from_pretrained(..., attn_implementation="sdpa")
> # supported: "flash_attention_2" (default), "sdpa", "eager", "flex_attention"
> ```
>
> **Dtype contract**: weights are saved in `bfloat16`. The default `flash_attention_2` backend requires `fp16`/`bf16` inputs. If you must use `fp32`, override with `attn_implementation="sdpa"` or `"eager"`.
>
> Tested with `transformers==5.7.0`, `torch>=2.4`, `flash-attn>=2.7`.
>
> ## Equivalence verification
>
> Cross-version (upstream tf 4.57.3 vs this tf 5.7.0) on 11 input shapes (single image / multi-frame video / batched / non-square / `visible_indices`):
>
> | dtype | attn | result |
> |---|---|---|
> | fp32 | eager | bit-identical (max_diff = 0.0 across all 22 tensors) |
> | bf16 | eager | bit-identical (max_diff = 0.0 across all 22 tensors) |
>
> Plus 7 v5-only scenario tests, all PASSED:
> 1. eager vs sdpa equivalence (max=7.5e-5)
> 2. save_pretrained then from_pretrained bit-identical round-trip
> 3. cpu vs cuda equivalence (max=4.1e-5)
> 4. fp32/bf16/fp16 dtype preservation
> 5. gradient flow (389/399 params receive non-zero grad)
> 6. runtime `_attn_implementation` switch
> 7. `from_pretrained` idempotency (two loads bit-identical)
>
> Plus real-input end-to-end tests on a real JPEG (1332x725) and a real MP4 (decord, 4 frames @ 512x512), preprocessed through `AutoImageProcessor` (CLIPImageProcessor):
>
> | path | result |
> |---|---|
> | image: PIL -> processor -> model fwd | finite, lhs=(1,1024,1024), pool=(1,1024) |
> | video: decord -> 5D (1,3,4,448,448) -> model fwd | finite, lhs=(1,4096,1024), pool=(1,1024) |
> | model-only equivalence on identical pixel_values (v4 vs v5) | **bit-identical (max_diff = 0.0 on image+video)** |
>
> Note: Raw `pixel_values` from `CLIPImageProcessor` differ by ~1e-2 between transformers 4.57.3 and 5.7.0 due to upstream resize/normalize changes in `transformers` itself (independent of this variant). When the same pixel_values are fed to both versions, this model is bit-identical.
>
> Reproduce with `tools/upgrade_v5/run_all.sh` from the OneVision-Encoder repo.
>
> ## Changelog
>
> - **tf57**: full v5-idiom rewrite; weights unchanged.
>
> ---
> The original model card from upstream follows.
### Model Card
| Property | Value |
|----------|-------|
| **Model Type** | Vision Transformer (ViT) |
| **Architecture** | HEVC-Style Vision Transformer |
| **Hidden Size** | 1024 |
| **Intermediate Size** | 4096 |
| **Number of Layers** | 24 |
| **Number of Attention Heads** | 16 |
| **Patch Size** | 16 |
| **Image Resolution** | 448×448 (pre-trained) |
| **Video Resolution** | 224×224 with 256 tokens per frame |
| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
| **Normalization** | Layer Normalization |
| **Activation Function** | GELU |
| **License** | Apache 2.0 |
### Key Features
- **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
- **Native Resolution Support**: Supports native resolution input without tiling or cropping.
- **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
### Intended Use
#### Primary Use Cases
- **Video Understanding**: Action recognition, video captioning, video question answering
- **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
- **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
#### Downstream Tasks
- Video benchmarks: MVBench, VideoMME, Perception Test
- Image understanding: DocVQA, ChartQA, OCRBench
- Action recognition: SSv2, UCF101, Kinetics
### Quick Start
> **Note:** This model supports native resolution input. For optimal performance:
> - **Image**: 448×448 resolution (pre-trained)
> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
```python
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch
# Load model and preprocessor
model = AutoModel.from_pretrained(
"lmms-lab-encoder/onevision-encoder-large",
trust_remote_code=True,
attn_implementation="flash_attention_2"
).to("cuda").eval()
preprocessor = AutoImageProcessor.from_pretrained(
"lmms-lab-encoder/onevision-encoder-large",
trust_remote_code=True
)
# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg") # Replace with your image path
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
with torch.no_grad():
outputs = model(pixel_values)
# outputs.last_hidden_state: [B, num_patches, hidden_size]
# outputs.pooler_output: [B, hidden_size]
# Video inference: [B, C, T, H, W] with visible_indices
num_frames, frame_tokens, target_frames = 16, 256, 64
# Load video frames and preprocess each frame (replace with your video frame paths)
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
# Reshape from [T, C, H, W] to [B, C, T, H, W]
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
# Build visible_indices for temporal sampling
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
# visible_indices example (with 256 tokens per frame):
# Frame 0 (pos=0): indices [0, 1, 2, ..., 255]
# Frame 1 (pos=4): indices [1024, 1025, 1026, ..., 1279]
# Frame 2 (pos=8): indices [2048, 2049, 2050, ..., 2303]
# ...
# Frame 15 (pos=63): indices [16128, 16129, ..., 16383]
with torch.no_grad():
outputs = model(video, visible_indices=visible_indices)
```
### LMM Probe Results
Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability.
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
</picture>
</p>
### Attentive Probe Results
Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png">
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png">
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;">
</picture>
</p>
### Codec Input
> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
---
|