Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OneVision-Encoder
|
| 2 |
+
|
| 3 |
+
### Key Features
|
| 4 |
+
|
| 5 |
+
- **LLM-Aligned Architecture**: Unlike standard vision backbones, this model is specifically optimized for **Large Multimodal Models (LMMs)**, ensuring seamless feature alignment and superior performance when connected to language models.
|
| 6 |
+
- **True Native Resolution**: Supports dynamic, **fully native resolution** inputs directly. It processes images and videos in their original aspect ratios without the need for tiling, cropping, padding, or resizing hacks.
|
| 7 |
+
- **Arbitrary Frame Support**: Capable of processing video inputs with **any number of frames** (variable length). It breaks the constraint of fixed-frame inputs, allowing for flexible long-context video understanding limited only by memory.
|
| 8 |
+
- **Codec-Style Input Processing**: Implements a "OneVision" mechanism that treats video like a codec stream—**sampling dense frames sparsely** (selecting important patches from many frames) rather than the traditional approach of sampling sparse frames densely.
|
| 9 |
+
- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture complex spatiotemporal relationships across arbitrary sequence lengths.
|
| 10 |
+
|
| 11 |
+
#### Downstream Tasks
|
| 12 |
+
|
| 13 |
+
- **Video benchmarks**: MVBench, VideoMME, Perception Test
|
| 14 |
+
- **Image understanding**: DocVQA, ChartQA, OCRBench
|
| 15 |
+
- **Action recognition**: SSv2, UCF101, Kinetics
|
| 16 |
+
|
| 17 |
+
### Quick Start
|
| 18 |
+
|
| 19 |
+
> [!IMPORTANT]
|
| 20 |
+
> **Transformers Version Compatibility:**
|
| 21 |
+
>
|
| 22 |
+
> - ✅ **`transformers==4.57.3`** (Recommended): Works with `AutoModel.from_pretrained()`
|
| 23 |
+
> - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
|
| 24 |
+
|
| 25 |
+
> **Note on Inputs:**
|
| 26 |
+
> While the model is pre-trained with the configurations below, it supports **dynamic native resolution** and **arbitrary frame counts** during inference:
|
| 27 |
+
>
|
| 28 |
+
> - **Pre-training Image Base**: 448×448
|
| 29 |
+
> - **Pre-training Video Base**: 224×224 (256 tokens/frame)
|
| 30 |
+
> - **Inference**: Supports variable resolutions and frame lengths.
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import AutoModel, AutoImageProcessor
|
| 34 |
+
from PIL import Image
|
| 35 |
+
import torch
|
| 36 |
+
|
| 37 |
+
# Load model and preprocessor
|
| 38 |
+
model = AutoModel.from_pretrained(
|
| 39 |
+
"lmms-lab-encoder/onevision-encoder-large",
|
| 40 |
+
trust_remote_code=True,
|
| 41 |
+
attn_implementation="flash_attention_2"
|
| 42 |
+
).to("cuda").eval()
|
| 43 |
+
|
| 44 |
+
preprocessor = AutoImageProcessor.from_pretrained(
|
| 45 |
+
"lmms-lab-encoder/onevision-encoder-large",
|
| 46 |
+
trust_remote_code=True
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
# Image inference: [B, C, H, W]
|
| 50 |
+
image = Image.open("path/to/your/image.jpg") # Replace with your image path
|
| 51 |
+
pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
|
| 52 |
+
with torch.no_grad():
|
| 53 |
+
outputs = model(pixel_values)
|
| 54 |
+
# outputs.last_hidden_state: [B, num_patches, hidden_size]
|
| 55 |
+
# outputs.pooler_output: [B, hidden_size]
|
| 56 |
+
|
| 57 |
+
# Video inference: [B, C, T, H, W] with patch_positions
|
| 58 |
+
num_frames, target_frames = 16, 64
|
| 59 |
+
patch_size = 14
|
| 60 |
+
# Load video frames and preprocess each frame (replace with your video frame paths)
|
| 61 |
+
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
|
| 62 |
+
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
|
| 63 |
+
# Reshape from [T, C, H, W] to [B, C, T, H, W]
|
| 64 |
+
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
|
| 65 |
+
|
| 66 |
+
# Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
|
| 67 |
+
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() # [T]
|
| 68 |
+
grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size # patch grid
|
| 69 |
+
frame_tokens = grid_h * grid_w
|
| 70 |
+
|
| 71 |
+
t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1) # [T * frame_tokens]
|
| 72 |
+
h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
|
| 73 |
+
h_positions = h_positions.repeat(num_frames) # [T * frame_tokens]
|
| 74 |
+
w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
|
| 75 |
+
w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
|
| 76 |
+
|
| 77 |
+
patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
|
| 78 |
+
# patch_positions example (256 tokens per frame, 16x16 patch grid):
|
| 79 |
+
# Each row is [t, h, w].
|
| 80 |
+
# First 4 patches of frame 0 (t=0):
|
| 81 |
+
# patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
|
| 82 |
+
# First 4 patches of frame 1 (t=4):
|
| 83 |
+
# patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
|
| 84 |
+
|
| 85 |
+
with torch.no_grad():
|
| 86 |
+
outputs = model(video, patch_positions=patch_positions)
|
| 87 |
+
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### Loading from Source Code
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
git clone [https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git)
|
| 94 |
+
cd OneVision-Encoder
|
| 95 |
+
pip install -e .
|
| 96 |
+
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
```python
|
| 100 |
+
from onevision_encoder import OneVisionEncoderModel, OneVisionEncoderConfig
|
| 101 |
+
from transformers import AutoImageProcessor
|
| 102 |
+
model = OneVisionEncoderModel.from_pretrained(
|
| 103 |
+
"lmms-lab-encoder/onevision-encoder-large-lang",
|
| 104 |
+
trust_remote_code=True,
|
| 105 |
+
attn_implementation="flash_attention_2"
|
| 106 |
+
).to("cuda").eval()
|
| 107 |
+
preprocessor = AutoImageProcessor.from_pretrained(
|
| 108 |
+
"lmms-lab-encoder/onevision-encoder-large-lang",
|
| 109 |
+
trust_remote_code=True
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### LMM Probe Results
|
| 115 |
+
|
| 116 |
+
Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning.
|
| 117 |
+
|
| 118 |
+
We adopt a streamlined **native-resolution strategy** inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed **directly**—without tiling or cropping—to evaluate the ViT's capability to handle **true native resolution** and **arbitrary frame sequences**.
|
| 119 |
+
|
| 120 |
+
<p align="center">
|
| 121 |
+
<picture>
|
| 122 |
+
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
|
| 123 |
+
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
|
| 124 |
+
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
|
| 125 |
+
</picture>
|
| 126 |
+
</p>
|
| 127 |
+
|
| 128 |
+
### Model Card
|
| 129 |
+
|
| 130 |
+
| Property | Value |
|
| 131 |
+
| --- | --- |
|
| 132 |
+
| **Model Type** | **LLM-Aligned** Vision Transformer (ViT) |
|
| 133 |
+
| **Architecture** | **HEVC-Style** / Codec-Like Vision Transformer |
|
| 134 |
+
| **Input Paradigm** | **Codec-Style** (Sparse Patch / Dense Frame) |
|
| 135 |
+
| **Resolution Strategy** | **True Native Resolution** (Dynamic, No Tiling) |
|
| 136 |
+
| **Temporal Context** | **Arbitrary Frame Count** (Variable Length Support) |
|
| 137 |
+
| **Hidden Size** | 1024 |
|
| 138 |
+
| **Intermediate Size** | 4096 |
|
| 139 |
+
| **Number of Layers** | 24 |
|
| 140 |
+
| **Number of Attention Heads** | 16 |
|
| 141 |
+
| **Patch Size** | 14 |
|
| 142 |
+
| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
|
| 143 |
+
| **Normalization** | Layer Normalization |
|
| 144 |
+
| **Activation Function** | GELU |
|
| 145 |
+
| **License** | Apache 2.0 |
|
| 146 |
+
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
```
|