| --- |
| license: apache-2.0 |
| --- |
| |
| > # `onevision-encoder-large-tf57` |
| > |
| > **transformers 5.7+ idiomatic variant of [`lmms-lab-encoder/onevision-encoder-large`](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large).** |
| > Weights are byte-identical to the upstream model (same `safetensors` SHA-256). Only `modeling_onevision_encoder.py` and `config.json` (`transformers_version` field) differ. |
| > |
| > ## Why this variant |
| > |
| > Upstream `modeling_onevision_encoder.py` is written against the `transformers 4.x` API surface and does not load correctly under `transformers >= 5.0`: |
| > 1. `_supports_flash_attn_2` was renamed to `_supports_flash_attn`. |
| > 2. The v5 fast-init / meta-tensor path skips re-initialization of `persistent=False` buffers, leaving `VideoRotaryEmbeddingSplit466.inv_freq_{t,h,w}` filled with uninitialized memory. RoPE then produces garbage and downstream attention diverges (max diff up to 700+ vs upstream). |
| > 3. `add_start_docstrings*` / `replace_return_docstrings` decorators are removed in v5. |
| > 4. Manual eager-only attention is replaced by the v5 `ALL_ATTENTION_FUNCTIONS` interface dispatching across `eager`, `sdpa`, `flash_attention_2`, `flex_attention`. |
| > |
| > ## v5-only notice |
| > |
| > This variant **requires `transformers >= 5.7.0`** and will not load under `transformers 4.x`. Use the upstream model dir for v4 environments. |
| > |
| > ## Diff vs upstream |
| > |
| > | File | Change | |
| > |---|---| |
| > | `model.safetensors` | unchanged (byte-identical) | |
| > | `config.json` | `transformers_version: 4.57.3` -> `5.7.0` | |
| > | `configuration_onevision_encoder.py` | unchanged | |
| > | `preprocessor_config.json` | unchanged | |
| > | `modeling_onevision_encoder.py` | full v5-idiom rewrite: `_supports_flash_attn`/`_supports_sdpa`/`_supports_flex_attn`/`_supports_attention_backend`, `ALL_ATTENTION_FUNCTIONS.get_interface(...)` dispatch, `@auto_docstring` + `@can_return_tuple`, removed v4 docstring decorators and `use_return_dict` branches, `_init_weights` hook calls `VideoRotaryEmbeddingSplit466.reset_inv_freqs()` to fix the inv_freq init bug. | |
| > |
| > ## Usage |
| > |
| > ```python |
| > from transformers import AutoModel |
| > |
| > model = AutoModel.from_pretrained( |
| > "path/to/onevision-encoder-large-tf57", |
| > trust_remote_code=True, |
| > ) # default attn_implementation = "flash_attention_2" (set in config.json) |
| > ``` |
| > |
| > Override the default if you need a different backend: |
| > |
| > ```python |
| > model = AutoModel.from_pretrained(..., attn_implementation="sdpa") |
| > # supported: "flash_attention_2" (default), "sdpa", "eager", "flex_attention" |
| > ``` |
| > |
| > **Dtype contract**: weights are saved in `bfloat16`. The default `flash_attention_2` backend requires `fp16`/`bf16` inputs. If you must use `fp32`, override with `attn_implementation="sdpa"` or `"eager"`. |
| > |
| > Tested with `transformers==5.7.0`, `torch>=2.4`, `flash-attn>=2.7`. |
| > |
| > ## Equivalence verification |
| > |
| > Cross-version (upstream tf 4.57.3 vs this tf 5.7.0) on 11 input shapes (single image / multi-frame video / batched / non-square / `visible_indices`): |
| > |
| > | dtype | attn | result | |
| > |---|---|---| |
| > | fp32 | eager | bit-identical (max_diff = 0.0 across all 22 tensors) | |
| > | bf16 | eager | bit-identical (max_diff = 0.0 across all 22 tensors) | |
| > |
| > Plus 7 v5-only scenario tests, all PASSED: |
| > 1. eager vs sdpa equivalence (max=7.5e-5) |
| > 2. save_pretrained then from_pretrained bit-identical round-trip |
| > 3. cpu vs cuda equivalence (max=4.1e-5) |
| > 4. fp32/bf16/fp16 dtype preservation |
| > 5. gradient flow (389/399 params receive non-zero grad) |
| > 6. runtime `_attn_implementation` switch |
| > 7. `from_pretrained` idempotency (two loads bit-identical) |
| > |
| > Plus real-input end-to-end tests on a real JPEG (1332x725) and a real MP4 (decord, 4 frames @ 512x512), preprocessed through `AutoImageProcessor` (CLIPImageProcessor): |
| > |
| > | path | result | |
| > |---|---| |
| > | image: PIL -> processor -> model fwd | finite, lhs=(1,1024,1024), pool=(1,1024) | |
| > | video: decord -> 5D (1,3,4,448,448) -> model fwd | finite, lhs=(1,4096,1024), pool=(1,1024) | |
| > | model-only equivalence on identical pixel_values (v4 vs v5) | **bit-identical (max_diff = 0.0 on image+video)** | |
| > |
| > Note: Raw `pixel_values` from `CLIPImageProcessor` differ by ~1e-2 between transformers 4.57.3 and 5.7.0 due to upstream resize/normalize changes in `transformers` itself (independent of this variant). When the same pixel_values are fed to both versions, this model is bit-identical. |
| > |
| > Reproduce with `tools/upgrade_v5/run_all.sh` from the OneVision-Encoder repo. |
| > |
| > ## Changelog |
| > |
| > - **tf57**: full v5-idiom rewrite; weights unchanged. |
| > |
| > --- |
| > The original model card from upstream follows. |
| |
| ### Model Card |
| |
| | Property | Value | |
| |----------|-------| |
| | **Model Type** | Vision Transformer (ViT) | |
| | **Architecture** | HEVC-Style Vision Transformer | |
| | **Hidden Size** | 1024 | |
| | **Intermediate Size** | 4096 | |
| | **Number of Layers** | 24 | |
| | **Number of Attention Heads** | 16 | |
| | **Patch Size** | 16 | |
| | **Image Resolution** | 448×448 (pre-trained) | |
| | **Video Resolution** | 224×224 with 256 tokens per frame | |
| | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) | |
| | **Normalization** | Layer Normalization | |
| | **Activation Function** | GELU | |
| | **License** | Apache 2.0 | |
| |
| ### Key Features |
| |
| - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames). |
| - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships. |
| - **Native Resolution Support**: Supports native resolution input without tiling or cropping. |
| - **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency. |
| |
| ### Intended Use |
| |
| #### Primary Use Cases |
| |
| - **Video Understanding**: Action recognition, video captioning, video question answering |
| - **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks |
| - **Vision-Language Models**: As the vision encoder backbone for multimodal large language models |
| |
| #### Downstream Tasks |
| |
| - Video benchmarks: MVBench, VideoMME, Perception Test |
| - Image understanding: DocVQA, ChartQA, OCRBench |
| - Action recognition: SSv2, UCF101, Kinetics |
| |
| |
| ### Quick Start |
| |
| > **Note:** This model supports native resolution input. For optimal performance: |
| > - **Image**: 448×448 resolution (pre-trained) |
| > - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained) |
| |
| |
| ```python |
| from transformers import AutoModel, AutoImageProcessor |
| from PIL import Image |
| import torch |
| |
| # Load model and preprocessor |
| model = AutoModel.from_pretrained( |
| "lmms-lab-encoder/onevision-encoder-large", |
| trust_remote_code=True, |
| attn_implementation="flash_attention_2" |
| ).to("cuda").eval() |
| |
| preprocessor = AutoImageProcessor.from_pretrained( |
| "lmms-lab-encoder/onevision-encoder-large", |
| trust_remote_code=True |
| ) |
| |
| # Image inference: [B, C, H, W] |
| image = Image.open("path/to/your/image.jpg") # Replace with your image path |
| pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda") |
| with torch.no_grad(): |
| outputs = model(pixel_values) |
| # outputs.last_hidden_state: [B, num_patches, hidden_size] |
| # outputs.pooler_output: [B, hidden_size] |
| |
| # Video inference: [B, C, T, H, W] with visible_indices |
| num_frames, frame_tokens, target_frames = 16, 256, 64 |
| # Load video frames and preprocess each frame (replace with your video frame paths) |
| frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)] |
| video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"] |
| # Reshape from [T, C, H, W] to [B, C, T, H, W] |
| video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda") |
| |
| # Build visible_indices for temporal sampling |
| frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() |
| visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1) |
| # visible_indices example (with 256 tokens per frame): |
| # Frame 0 (pos=0): indices [0, 1, 2, ..., 255] |
| # Frame 1 (pos=4): indices [1024, 1025, 1026, ..., 1279] |
| # Frame 2 (pos=8): indices [2048, 2049, 2050, ..., 2303] |
| # ... |
| # Frame 15 (pos=63): indices [16128, 16129, ..., 16383] |
| |
| with torch.no_grad(): |
| outputs = model(video, visible_indices=visible_indices) |
| ``` |
| |
|
|
| ### LMM Probe Results |
|
|
| Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability. |
|
|
| <p align="center"> |
| <picture> |
| <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png"> |
| <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png"> |
| <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;"> |
| </picture> |
| </p> |
| |
| ### Attentive Probe Results |
|
|
| Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations. |
|
|
| <p align="center"> |
| <picture> |
| <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png"> |
| <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png"> |
| <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;"> |
| </picture> |
| </p> |
| |
|
|
| ### Codec Input |
|
|
| > **TODO:** Add codec-style input documentation for temporal saliency-based patch selection. |
|
|
| --- |
|
|