> # `onevision-encoder-large-lang-tf57` > > **transformers 5.7+ idiomatic variant of [`lmms-lab-encoder/onevision-encoder-large-lang`](https://huggingface.co/lmms-lab-encoder/onevision-encoder-large-lang).** > Weights are byte-identical to the upstream model (same `safetensors` SHA-256). Only `modeling_onevision_encoder.py` and `config.json` (`transformers_version` field) differ. > > ## Why this variant > > Upstream `modeling_onevision_encoder.py` is written against the `transformers 4.x` API surface and does not load correctly under `transformers >= 5.0`: > 1. `_supports_flash_attn_2` was renamed to `_supports_flash_attn`. > 2. The v5 fast-init / meta-tensor path skips re-initialization of `persistent=False` buffers, leaving `VideoRotaryEmbeddingSplit466.inv_freq_{t,h,w}` filled with uninitialized memory. RoPE then produces garbage and downstream attention diverges (max diff up to 700+ vs upstream). > 3. `add_start_docstrings*` / `replace_return_docstrings` decorators are removed in v5. > 4. Manual eager-only attention is replaced by the v5 `ALL_ATTENTION_FUNCTIONS` interface dispatching across `eager`, `sdpa`, `flash_attention_2`, `flex_attention`. > > ## v5-only notice > > This variant **requires `transformers >= 5.7.0`** and will not load under `transformers 4.x`. Use the upstream model dir for v4 environments. > > ## Lang-specific surface (vs `large`) > > Same backbone, but the `lang` variant exposes the encoder for language-aligned multimodal callers: > - `forward(..., patch_positions=...)` accepts an explicit `(batch_size, seq_len, 3)` `[t, h, w]` per-patch position tensor (mutually exclusive with the default grid path). > - `VideoRotaryEmbeddingSplit466.forward_from_positions(patch_positions)` computes RoPE frequencies directly from those positions. > - `apply_rotary_pos_emb` casts `cos/sin` to `q.dtype` immediately rather than computing in fp32, to preserve FlashAttention's dtype contract under bf16/fp16. > - No pooling head: `pooler_output` is `None`. > > ## Diff vs upstream > > | File | Change | > |---|---| > | `model.safetensors` | unchanged (byte-identical) | > | `config.json` | `transformers_version: 4.57.3` -> `5.7.0` | > | `configuration_onevision_encoder.py` | unchanged | > | `preprocessor_config.json` | unchanged | > | `modeling_onevision_encoder.py` | full v5-idiom rewrite (same as `large`) plus the lang-specific surface above. | > > ## Usage > > ```python > from transformers import AutoModel > > model = AutoModel.from_pretrained( > "path/to/onevision-encoder-large-lang-tf57", > trust_remote_code=True, > ) # default attn_implementation = "flash_attention_2" (set in config.json) > > # default grid path > out = model(pixel_values=images) > # explicit per-patch positions (lang-only) > out = model(pixel_values=images, patch_positions=patch_positions) > ``` > > Override the default if you need a different backend: > > ```python > model = AutoModel.from_pretrained(..., attn_implementation="sdpa") > # supported: "flash_attention_2" (default), "sdpa", "eager", "flex_attention" > ``` > > **Dtype contract**: weights are saved in `bfloat16`. The default `flash_attention_2` backend requires `fp16`/`bf16` inputs. If you must use `fp32`, override with `attn_implementation="sdpa"` or `"eager"`. > > **Numerical note (lang variant)**: Unlike the `large` variant, attention backends are NOT numerically equivalent in `bf16` for this model — `eager` and `flash_attention_2`/`sdpa` differ in `max_diff` up to several hundred in absolute value (mean diff < 0.1, std preserved). This is due to the lang variant intentionally keeping RoPE `cos`/`sin` in `q.dtype` (bf16) instead of upcasting to `fp32` like the `large` variant. The model still trains/serves correctly on any backend, but if you need strict numerical reproducibility against the upstream model, use `attn_implementation="eager"` in `bf16` or any backend in `fp32`. > > Tested with `transformers==5.7.0`, `torch>=2.4`, `flash-attn>=2.7`. > > ## Equivalence verification > > Cross-version (upstream tf 4.57.3 vs this tf 5.7.0) on 11 input shapes (single image / multi-frame video / batched / non-square / `patch_positions`): > > | dtype | attn | result | > |---|---|---| > | fp32 | eager | bit-identical (max_diff = 0.0 across all `__lhs` tensors; `__pool` is `None` for both) | > | bf16 | eager | bit-identical (max_diff = 0.0 across all `__lhs` tensors; `__pool` is `None` for both) | > > Plus 7 v5-only scenario tests, all PASSED: > 1. eager vs sdpa equivalence (max=9.2e-3) > 2. save_pretrained then from_pretrained bit-identical round-trip > 3. cpu vs cuda equivalence (max=3.5e-3) > 4. fp32/bf16/fp16 dtype preservation > 5. gradient flow > 6. runtime `_attn_implementation` switch > 7. `from_pretrained` idempotency (two loads bit-identical) > > Plus real-input end-to-end tests on a real JPEG (1332x725) and a real MP4 (decord, 4 frames @ 512x512), preprocessed through `AutoImageProcessor` (CLIPImageProcessor): > > | path | result | > |---|---| > | image: PIL -> processor -> model fwd | finite, lhs=(1,1024,1024) | > | video: decord -> 5D (1,3,4,448,448) -> model fwd | finite, lhs=(1,4096,1024) | > | model-only equivalence on identical pixel_values (v4 vs v5) | **bit-identical (max_diff = 0.0 on image+video)** | > > Note: Raw `pixel_values` from `CLIPImageProcessor` differ by ~1e-2 between transformers 4.57.3 and 5.7.0 due to upstream resize/normalize changes in `transformers` itself (independent of this variant). When the same pixel_values are fed to both versions, this model is bit-identical. > > Reproduce with `tools/upgrade_v5/run_all.sh` from the OneVision-Encoder repo. > > ## Changelog > > - **tf57**: full v5-idiom rewrite; weights unchanged. > > --- > The original model card from upstream follows. # OneVision-Encoder ### Key Features - **LLM-Aligned Architecture**: Unlike standard vision backbones, this model is specifically optimized for **Large Multimodal Models (LMMs)**, ensuring seamless feature alignment and superior performance when connected to language models. - **True Native Resolution**: Supports dynamic, **fully native resolution** inputs directly. It processes images and videos in their original aspect ratios without the need for tiling, cropping, padding, or resizing hacks. - **Arbitrary Frame Support**: Capable of processing video inputs with **any number of frames** (variable length). It breaks the constraint of fixed-frame inputs, allowing for flexible long-context video understanding limited only by memory. - **Codec-Style Input Processing**: Implements a "OneVision" mechanism that treats video like a codec stream—**sampling dense frames sparsely** (selecting important patches from many frames) rather than the traditional approach of sampling sparse frames densely. - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture complex spatiotemporal relationships across arbitrary sequence lengths. #### Downstream Tasks - **Video benchmarks**: MVBench, VideoMME, Perception Test - **Image understanding**: DocVQA, ChartQA, OCRBench - **Action recognition**: SSv2, UCF101, Kinetics ### Quick Start > [!IMPORTANT] > **Transformers Version Compatibility:** > > - ✅ **`transformers==4.57.3`** (Recommended): Works with `AutoModel.from_pretrained()` > - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix. > **Note on Inputs:** > While the model is pre-trained with the configurations below, it supports **dynamic native resolution** and **arbitrary frame counts** during inference: > > - **Pre-training Image Base**: 448×448 > - **Pre-training Video Base**: 224×224 (256 tokens/frame) > - **Inference**: Supports variable resolutions and frame lengths. ```python from transformers import AutoModel, AutoImageProcessor from PIL import Image import torch # Load model and preprocessor model = AutoModel.from_pretrained( "lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True, attn_implementation="flash_attention_2" ).to("cuda").eval() preprocessor = AutoImageProcessor.from_pretrained( "lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True ) # Image inference: [B, C, H, W] image = Image.open("path/to/your/image.jpg") # Replace with your image path pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda") with torch.no_grad(): outputs = model(pixel_values) # outputs.last_hidden_state: [B, num_patches, hidden_size] # outputs.pooler_output: [B, hidden_size] # Video inference: [B, C, T, H, W] with patch_positions num_frames, target_frames = 16, 64 patch_size = 14 # Load video frames and preprocess each frame (replace with your video frame paths) frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)] video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"] # Reshape from [T, C, H, W] to [B, C, T, H, W] video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda") # Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3] frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() # [T] grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size # patch grid frame_tokens = grid_h * grid_w t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1) # [T * frame_tokens] h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w) h_positions = h_positions.repeat(num_frames) # [T * frame_tokens] w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h) w_positions = w_positions.repeat(num_frames) # [T * frame_tokens] patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0) # patch_positions example (256 tokens per frame, 16x16 patch grid): # Each row is [t, h, w]. # First 4 patches of frame 0 (t=0): # patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]] # First 4 patches of frame 1 (t=4): # patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]] with torch.no_grad(): outputs = model(video, patch_positions=patch_positions) ``` ### Loading from Source Code ```bash git clone [https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git) cd OneVision-Encoder pip install -e . ``` ```python from onevision_encoder import OneVisionEncoderModel, OneVisionEncoderConfig from transformers import AutoImageProcessor model = OneVisionEncoderModel.from_pretrained( "lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True, attn_implementation="flash_attention_2" ).to("cuda").eval() preprocessor = AutoImageProcessor.from_pretrained( "lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True ) ``` ### LMM Probe Results Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined **native-resolution strategy** inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed **directly**—without tiling or cropping—to evaluate the ViT's capability to handle **true native resolution** and **arbitrary frame sequences**.

LMM Probe Results

### Model Card | Property | Value | | --- | --- | | **Model Type** | **LLM-Aligned** Vision Transformer (ViT) | | **Architecture** | **HEVC-Style** / Codec-Like Vision Transformer | | **Input Paradigm** | **Codec-Style** (Sparse Patch / Dense Frame) | | **Resolution Strategy** | **True Native Resolution** (Dynamic, No Tiling) | | **Temporal Context** | **Arbitrary Frame Count** (Variable Length Support) | | **Hidden Size** | 1024 | | **Intermediate Size** | 4096 | | **Number of Layers** | 24 | | **Number of Attention Heads** | 16 | | **Patch Size** | 14 | | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) | | **Normalization** | Layer Normalization | | **Activation Function** | GELU | | **License** | Apache 2.0 |