onevision-encoder-large-tf57

transformers 5.7+ idiomatic variant of lmms-lab-encoder/onevision-encoder-large. Weights are byte-identical to the upstream model (same safetensors SHA-256). Only modeling_onevision_encoder.py and config.json (transformers_version field) differ.

Why this variant

Upstream modeling_onevision_encoder.py is written against the transformers 4.x API surface and does not load correctly under transformers >= 5.0:

  1. _supports_flash_attn_2 was renamed to _supports_flash_attn.
  2. The v5 fast-init / meta-tensor path skips re-initialization of persistent=False buffers, leaving VideoRotaryEmbeddingSplit466.inv_freq_{t,h,w} filled with uninitialized memory. RoPE then produces garbage and downstream attention diverges (max diff up to 700+ vs upstream).
  3. add_start_docstrings* / replace_return_docstrings decorators are removed in v5.
  4. Manual eager-only attention is replaced by the v5 ALL_ATTENTION_FUNCTIONS interface dispatching across eager, sdpa, flash_attention_2, flex_attention.

v5-only notice

This variant requires transformers >= 5.7.0 and will not load under transformers 4.x. Use the upstream model dir for v4 environments.

Diff vs upstream

File Change
model.safetensors unchanged (byte-identical)
config.json transformers_version: 4.57.3 -> 5.7.0
configuration_onevision_encoder.py unchanged
preprocessor_config.json unchanged
modeling_onevision_encoder.py full v5-idiom rewrite: _supports_flash_attn/_supports_sdpa/_supports_flex_attn/_supports_attention_backend, ALL_ATTENTION_FUNCTIONS.get_interface(...) dispatch, @auto_docstring + @can_return_tuple, removed v4 docstring decorators and use_return_dict branches, _init_weights hook calls VideoRotaryEmbeddingSplit466.reset_inv_freqs() to fix the inv_freq init bug.

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/onevision-encoder-large-tf57",
    trust_remote_code=True,
    attn_implementation="eager",  # or "sdpa", "flash_attention_2", "flex_attention"
)

Tested with transformers==5.7.0, torch>=2.4.

Equivalence verification

Cross-version (upstream tf 4.57.3 vs this tf 5.7.0) on 11 input shapes (single image / multi-frame video / batched / non-square / visible_indices):

dtype attn result
fp32 eager bit-identical (max_diff = 0.0 across all 22 tensors)
bf16 eager bit-identical (max_diff = 0.0 across all 22 tensors)

Plus 7 v5-only scenario tests, all PASSED:

  1. eager vs sdpa equivalence (max=7.5e-5)
  2. save_pretrained then from_pretrained bit-identical round-trip
  3. cpu vs cuda equivalence (max=4.1e-5)
  4. fp32/bf16/fp16 dtype preservation
  5. gradient flow (389/399 params receive non-zero grad)
  6. runtime _attn_implementation switch
  7. from_pretrained idempotency (two loads bit-identical)

Plus real-input end-to-end tests on a real JPEG (1332x725) and a real MP4 (decord, 4 frames @ 512x512), preprocessed through AutoImageProcessor (CLIPImageProcessor):

path result
image: PIL -> processor -> model fwd finite, lhs=(1,1024,1024), pool=(1,1024)
video: decord -> 5D (1,3,4,448,448) -> model fwd finite, lhs=(1,4096,1024), pool=(1,1024)
model-only equivalence on identical pixel_values (v4 vs v5) bit-identical (max_diff = 0.0 on image+video)

Note: Raw pixel_values from CLIPImageProcessor differ by ~1e-2 between transformers 4.57.3 and 5.7.0 due to upstream resize/normalize changes in transformers itself (independent of this variant). When the same pixel_values are fed to both versions, this model is bit-identical.

Reproduce with tools/upgrade_v5/run_all.sh from the OneVision-Encoder repo.

Changelog

  • tf57: full v5-idiom rewrite; weights unchanged.

The original model card from upstream follows.

Model Card

Property Value
Model Type Vision Transformer (ViT)
Architecture HEVC-Style Vision Transformer
Hidden Size 1024
Intermediate Size 4096
Number of Layers 24
Number of Attention Heads 16
Patch Size 16
Image Resolution 448×448 (pre-trained)
Video Resolution 224×224 with 256 tokens per frame
Positional Encoding 3D RoPE (4:6:6 split for T:H:W)
Normalization Layer Normalization
Activation Function GELU
License Apache 2.0

Key Features

  • Codec-Style Patch Selection: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
  • 3D Rotary Position Embedding: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
  • Native Resolution Support: Supports native resolution input without tiling or cropping.
  • Flash Attention 2: Efficient attention implementation for improved performance and memory efficiency.

Intended Use

Primary Use Cases

  • Video Understanding: Action recognition, video captioning, video question answering
  • Image Understanding: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
  • Vision-Language Models: As the vision encoder backbone for multimodal large language models

Downstream Tasks

  • Video benchmarks: MVBench, VideoMME, Perception Test
  • Image understanding: DocVQA, ChartQA, OCRBench
  • Action recognition: SSv2, UCF101, Kinetics

Quick Start

Note: This model supports native resolution input. For optimal performance:

  • Image: 448×448 resolution (pre-trained)
  • Video: 224×224 resolution with 256 tokens per frame (pre-trained)
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and preprocessor
model = AutoModel.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).to("cuda").eval()

preprocessor = AutoImageProcessor.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True
)

# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg")  # Replace with your image path
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
with torch.no_grad():
    outputs = model(pixel_values)
    # outputs.last_hidden_state: [B, num_patches, hidden_size]
    # outputs.pooler_output: [B, hidden_size]

# Video inference: [B, C, T, H, W] with visible_indices
num_frames, frame_tokens, target_frames = 16, 256, 64
# Load video frames and preprocess each frame (replace with your video frame paths)
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
# Reshape from [T, C, H, W] to [B, C, T, H, W]
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")

# Build visible_indices for temporal sampling
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
# visible_indices example (with 256 tokens per frame):
#   Frame 0 (pos=0):  indices [0, 1, 2, ..., 255]
#   Frame 1 (pos=4):  indices [1024, 1025, 1026, ..., 1279]
#   Frame 2 (pos=8):  indices [2048, 2049, 2050, ..., 2303]
#   ...
#   Frame 15 (pos=63): indices [16128, 16129, ..., 16383]

with torch.no_grad():
    outputs = model(video, visible_indices=visible_indices)

LMM Probe Results

Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability.

LMM Probe Results

Attentive Probe Results

Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.

LMM Probe Results

Codec Input

TODO: Add codec-style input documentation for temporal saliency-based patch selection.


Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including lmms-lab-encoder/onevision-encoder-large-tf57