onevision-encoder-large-tf57

transformers 5.7+ idiomatic variant of lmms-lab-encoder/onevision-encoder-large. Weights are byte-identical to the upstream model (same safetensors SHA-256). Only modeling_onevision_encoder.py and config.json (transformers_version field) differ.

Why this variant

Upstream modeling_onevision_encoder.py is written against the transformers 4.x API surface and does not load correctly under transformers >= 5.0:

_supports_flash_attn_2 was renamed to _supports_flash_attn.

The v5 fast-init / meta-tensor path skips re-initialization of persistent=False buffers, leaving VideoRotaryEmbeddingSplit466.inv_freq_{t,h,w} filled with uninitialized memory. RoPE then produces garbage and downstream attention diverges (max diff up to 700+ vs upstream).

add_start_docstrings* / replace_return_docstrings decorators are removed in v5.

Manual eager-only attention is replaced by the v5 ALL_ATTENTION_FUNCTIONS interface dispatching across eager, sdpa, flash_attention_2, flex_attention.

v5-only notice

This variant requires transformers >= 5.7.0 and will not load under transformers 4.x. Use the upstream model dir for v4 environments.

Diff vs upstream

File Change

model.safetensors unchanged (byte-identical)

config.json transformers_version: 4.57.3 -> 5.7.0

configuration_onevision_encoder.py unchanged

preprocessor_config.json unchanged

modeling_onevision_encoder.py full v5-idiom rewrite: _supports_flash_attn/_supports_sdpa/_supports_flex_attn/_supports_attention_backend, ALL_ATTENTION_FUNCTIONS.get_interface(...) dispatch, @auto_docstring + @can_return_tuple, removed v4 docstring decorators and use_return_dict branches, _init_weights hook calls VideoRotaryEmbeddingSplit466.reset_inv_freqs() to fix the inv_freq init bug.

Usage
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "path/to/onevision-encoder-large-tf57",
    trust_remote_code=True,
    attn_implementation="eager",  # or "sdpa", "flash_attention_2", "flex_attention"
)
Tested with transformers==5.7.0, torch>=2.4.

Equivalence verification

Cross-version (upstream tf 4.57.3 vs this tf 5.7.0) on 11 input shapes (single image / multi-frame video / batched / non-square / visible_indices):

dtype attn result

fp32 eager bit-identical (max_diff = 0.0 across all 22 tensors)

bf16 eager bit-identical (max_diff = 0.0 across all 22 tensors)

Plus 7 v5-only scenario tests, all PASSED:

eager vs sdpa equivalence (max=7.5e-5)

save_pretrained then from_pretrained bit-identical round-trip

cpu vs cuda equivalence (max=4.1e-5)

fp32/bf16/fp16 dtype preservation

gradient flow (389/399 params receive non-zero grad)

runtime _attn_implementation switch

from_pretrained idempotency (two loads bit-identical)

Plus real-input end-to-end tests on a real JPEG (1332x725) and a real MP4 (decord, 4 frames @ 512x512), preprocessed through AutoImageProcessor (CLIPImageProcessor):

path result

image: PIL -> processor -> model fwd finite, lhs=(1,1024,1024), pool=(1,1024)

video: decord -> 5D (1,3,4,448,448) -> model fwd finite, lhs=(1,4096,1024), pool=(1,1024)

model-only equivalence on identical pixel_values (v4 vs v5) bit-identical (max_diff = 0.0 on image+video)

Note: Raw pixel_values from CLIPImageProcessor differ by ~1e-2 between transformers 4.57.3 and 5.7.0 due to upstream resize/normalize changes in transformers itself (independent of this variant). When the same pixel_values are fed to both versions, this model is bit-identical.

Reproduce with tools/upgrade_v5/run_all.sh from the OneVision-Encoder repo.

Changelog

tf57: full v5-idiom rewrite; weights unchanged.

The original model card from upstream follows.

File	Change
`model.safetensors`	unchanged (byte-identical)
`config.json`	`transformers_version: 4.57.3` -> `5.7.0`
`configuration_onevision_encoder.py`	unchanged
`preprocessor_config.json`	unchanged
`modeling_onevision_encoder.py`	full v5-idiom rewrite: `_supports_flash_attn`/`_supports_sdpa`/`_supports_flex_attn`/`_supports_attention_backend`, `ALL_ATTENTION_FUNCTIONS.get_interface(...)` dispatch, `@auto_docstring` + `@can_return_tuple`, removed v4 docstring decorators and `use_return_dict` branches, `_init_weights` hook calls `VideoRotaryEmbeddingSplit466.reset_inv_freqs()` to fix the inv_freq init bug.

dtype	attn	result
fp32	eager	bit-identical (max_diff = 0.0 across all 22 tensors)
bf16	eager	bit-identical (max_diff = 0.0 across all 22 tensors)

path	result
image: PIL -> processor -> model fwd	finite, lhs=(1,1024,1024), pool=(1,1024)
video: decord -> 5D (1,3,4,448,448) -> model fwd	finite, lhs=(1,4096,1024), pool=(1,1024)
model-only equivalence on identical pixel_values (v4 vs v5)	bit-identical (max_diff = 0.0 on image+video)

Model Card

Property	Value
Model Type	Vision Transformer (ViT)
Architecture	HEVC-Style Vision Transformer
Hidden Size	1024
Intermediate Size	4096
Number of Layers	24
Number of Attention Heads	16
Patch Size	16
Image Resolution	448×448 (pre-trained)
Video Resolution	224×224 with 256 tokens per frame
Positional Encoding	3D RoPE (4:6:6 split for T:H:W)
Normalization	Layer Normalization
Activation Function	GELU
License	Apache 2.0

Key Features

Codec-Style Patch Selection: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
3D Rotary Position Embedding: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
Native Resolution Support: Supports native resolution input without tiling or cropping.
Flash Attention 2: Efficient attention implementation for improved performance and memory efficiency.

Intended Use

Primary Use Cases

Video Understanding: Action recognition, video captioning, video question answering
Image Understanding: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
Vision-Language Models: As the vision encoder backbone for multimodal large language models

Downstream Tasks

Video benchmarks: MVBench, VideoMME, Perception Test
Image understanding: DocVQA, ChartQA, OCRBench
Action recognition: SSv2, UCF101, Kinetics

Quick Start

Note: This model supports native resolution input. For optimal performance:

Image: 448×448 resolution (pre-trained)

Video: 224×224 resolution with 256 tokens per frame (pre-trained)

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and preprocessor
model = AutoModel.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).to("cuda").eval()

preprocessor = AutoImageProcessor.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large",
    trust_remote_code=True
)

# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg")  # Replace with your image path
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
with torch.no_grad():
    outputs = model(pixel_values)
    # outputs.last_hidden_state: [B, num_patches, hidden_size]
    # outputs.pooler_output: [B, hidden_size]

# Video inference: [B, C, T, H, W] with visible_indices
num_frames, frame_tokens, target_frames = 16, 256, 64
# Load video frames and preprocess each frame (replace with your video frame paths)
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
# Reshape from [T, C, H, W] to [B, C, T, H, W]
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")

# Build visible_indices for temporal sampling
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
# visible_indices example (with 256 tokens per frame):
#   Frame 0 (pos=0):  indices [0, 1, 2, ..., 255]
#   Frame 1 (pos=4):  indices [1024, 1025, 1026, ..., 1279]
#   Frame 2 (pos=8):  indices [2048, 2049, 2050, ..., 2303]
#   ...
#   Frame 15 (pos=63): indices [16128, 16129, ..., 16383]

with torch.no_grad():
    outputs = model(video, visible_indices=visible_indices)

LMM Probe Results

Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. We adopt a streamlined native-resolution strategy inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed directly—without tiling or cropping—to evaluate the ViT's native resolution capability.

LMM Probe Results

Attentive Probe Results

Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.

LMM Probe Results

Codec Input

TODO: Add codec-style input documentation for temporal saliency-based patch selection.

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including lmms-lab-encoder/onevision-encoder-large-tf57

onevision-encoder

Collection

4 items • Updated about 12 hours ago • 6

lmms-lab-encoder
/

onevision-encoder-large-tf57

`onevision-encoder-large-tf57`

Why this variant

v5-only notice

Diff vs upstream

Usage

Equivalence verification

Changelog

Model Card

Key Features

Intended Use

Primary Use Cases

Downstream Tasks

Quick Start

LMM Probe Results

Attentive Probe Results

Codec Input

Collection including lmms-lab-encoder/onevision-encoder-large-tf57

onevision-encoder