How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-feature-extraction", model="lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True)
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("lmms-lab-encoder/onevision-encoder-large-lang", trust_remote_code=True, dtype="auto")
Quick Links

OneVision-Encoder

OneVision-Encoder is an LLM-aligned vision transformer specifically optimized for Large Multimodal Models (LMMs). It is a core component of the LLaVA-OneVision-2 series and is further detailed in the technical report OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence.

Project Page | GitHub

Key Features

  • LLM-Aligned Architecture: Unlike standard vision backbones, this model is specifically optimized for Large Multimodal Models (LMMs), ensuring seamless feature alignment and superior performance when connected to language models.
  • True Native Resolution: Supports dynamic, fully native resolution inputs directly. It processes images and videos in their original aspect ratios without the need for tiling, cropping, padding, or resizing hacks.
  • Arbitrary Frame Support: Capable of processing video inputs with any number of frames (variable length). It breaks the constraint of fixed-frame inputs, allowing for flexible long-context video understanding limited only by memory.
  • Codec-Style Input Processing: Implements a "OneVision" mechanism that treats video like a codec stream—sampling dense frames sparsely (selecting important patches from many frames) rather than the traditional approach of sampling sparse frames densely.
  • 3D Rotary Position Embedding: Uses a 4:6:6 split for temporal, height, and width dimensions to capture complex spatiotemporal relationships across arbitrary sequence lengths.

Downstream Tasks

  • Video benchmarks: MVBench, VideoMME, Perception Test
  • Image understanding: DocVQA, ChartQA, OCRBench
  • Action recognition: SSv2, UCF101, Kinetics

Quick Start

Transformers Version Compatibility:

  • transformers==4.57.3 (Recommended): Works with AutoModel.from_pretrained()
  • ⚠️ transformers>=5.0.0: Not currently supported. We are actively working on a fix.

Note on Inputs: While the model is pre-trained with the configurations below, it supports dynamic native resolution and arbitrary frame counts during inference:

  • Pre-training Image Base: 448×448
  • Pre-training Video Base: 224×224 (256 tokens/frame)
  • Inference: Supports variable resolutions and frame lengths.
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and preprocessor
model = AutoModel.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large-lang",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).to("cuda").eval()

preprocessor = AutoImageProcessor.from_pretrained(
    "lmms-lab-encoder/onevision-encoder-large-lang",
    trust_remote_code=True
)

# Image inference: [B, C, H, W]
image = Image.open("path/to/your/image.jpg")  # Replace with your image path
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
with torch.no_grad():
    outputs = model(pixel_values)
    # outputs.last_hidden_state: [B, num_patches, hidden_size]
    # outputs.pooler_output: [B, hidden_size]

# Video inference: [B, C, T, H, W] with patch_positions
num_frames, target_frames = 16, 64
patch_size = 14
# Load video frames and preprocess each frame (replace with your video frame paths)
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
# Reshape from [T, C, H, W] to [B, C, T, H, W]
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")

# Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()  # [T]
grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size  # patch grid
frame_tokens = grid_h * grid_w

t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1)  # [T * frame_tokens]
h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
h_positions = h_positions.repeat(num_frames)  # [T * frame_tokens]
w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
w_positions = w_positions.repeat(num_frames)  # [T * frame_tokens]

patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)

with torch.no_grad():
  outputs = model(video, patch_positions=patch_positions)

Model Properties

Property Value
Model Type LLM-Aligned Vision Transformer (ViT)
Architecture HEVC-Style / Codec-Like Vision Transformer
Input Paradigm Codec-Style (Sparse Patch / Dense Frame)
Resolution Strategy True Native Resolution (Dynamic, No Tiling)
Temporal Context Arbitrary Frame Count (Variable Length Support)
Hidden Size 1024
Intermediate Size 4096
Number of Layers 24
Number of Attention Heads 16
Patch Size 14
Positional Encoding 3D RoPE (4:6:6 split for T:H:W)
Normalization Layer Normalization
Activation Function GELU
License Apache 2.0

Citation

@inproceedings{LLaVA-OneVision-2,
  title={LLaVA-OneVision-2},
  author={llava-onevision contributors},
  booktitle={arXiv},
  year={2026}
}

@article{tang2026onevisionencoder,
  title={OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence},
  author={Tang, Feilong and An, Xiang and Yan, Yunyao and Xie, Yin and Qin, Bin and Yang, Kaicheng and Shen, Yifei and Zhang, Yuanhan and Li, Chunyuan and Feng, Shikun and Chen, Changrui and Tan, Huajie and Hu, Ming and Zhang, Manyuan and Li, Bo and Feng, Ziyong and Liu, Ziwei and Ge, Zongyuan and Deng, Jiankang},
  journal={arXiv preprint arXiv:2602.08683},
  year={2026}
}
Downloads last month
371
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including lmms-lab-encoder/onevision-encoder-large-lang

Papers for lmms-lab-encoder/onevision-encoder-large-lang