xiangan commited on
Commit
bd417f2
·
verified ·
1 Parent(s): 5486249

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OneVision-Encoder
2
+
3
+ ### Key Features
4
+
5
+ - **LLM-Aligned Architecture**: Unlike standard vision backbones, this model is specifically optimized for **Large Multimodal Models (LMMs)**, ensuring seamless feature alignment and superior performance when connected to language models.
6
+ - **True Native Resolution**: Supports dynamic, **fully native resolution** inputs directly. It processes images and videos in their original aspect ratios without the need for tiling, cropping, padding, or resizing hacks.
7
+ - **Arbitrary Frame Support**: Capable of processing video inputs with **any number of frames** (variable length). It breaks the constraint of fixed-frame inputs, allowing for flexible long-context video understanding limited only by memory.
8
+ - **Codec-Style Input Processing**: Implements a "OneVision" mechanism that treats video like a codec stream—**sampling dense frames sparsely** (selecting important patches from many frames) rather than the traditional approach of sampling sparse frames densely.
9
+ - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture complex spatiotemporal relationships across arbitrary sequence lengths.
10
+
11
+ #### Downstream Tasks
12
+
13
+ - **Video benchmarks**: MVBench, VideoMME, Perception Test
14
+ - **Image understanding**: DocVQA, ChartQA, OCRBench
15
+ - **Action recognition**: SSv2, UCF101, Kinetics
16
+
17
+ ### Quick Start
18
+
19
+ > [!IMPORTANT]
20
+ > **Transformers Version Compatibility:**
21
+ >
22
+ > - ✅ **`transformers==4.57.3`** (Recommended): Works with `AutoModel.from_pretrained()`
23
+ > - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
24
+
25
+ > **Note on Inputs:**
26
+ > While the model is pre-trained with the configurations below, it supports **dynamic native resolution** and **arbitrary frame counts** during inference:
27
+ >
28
+ > - **Pre-training Image Base**: 448×448
29
+ > - **Pre-training Video Base**: 224×224 (256 tokens/frame)
30
+ > - **Inference**: Supports variable resolutions and frame lengths.
31
+
32
+ ```python
33
+ from transformers import AutoModel, AutoImageProcessor
34
+ from PIL import Image
35
+ import torch
36
+
37
+ # Load model and preprocessor
38
+ model = AutoModel.from_pretrained(
39
+ "lmms-lab-encoder/onevision-encoder-large",
40
+ trust_remote_code=True,
41
+ attn_implementation="flash_attention_2"
42
+ ).to("cuda").eval()
43
+
44
+ preprocessor = AutoImageProcessor.from_pretrained(
45
+ "lmms-lab-encoder/onevision-encoder-large",
46
+ trust_remote_code=True
47
+ )
48
+
49
+ # Image inference: [B, C, H, W]
50
+ image = Image.open("path/to/your/image.jpg") # Replace with your image path
51
+ pixel_values = preprocessor(images=image, return_tensors="pt")["pixel_values"].to("cuda")
52
+ with torch.no_grad():
53
+ outputs = model(pixel_values)
54
+ # outputs.last_hidden_state: [B, num_patches, hidden_size]
55
+ # outputs.pooler_output: [B, hidden_size]
56
+
57
+ # Video inference: [B, C, T, H, W] with patch_positions
58
+ num_frames, target_frames = 16, 64
59
+ patch_size = 14
60
+ # Load video frames and preprocess each frame (replace with your video frame paths)
61
+ frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
62
+ video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
63
+ # Reshape from [T, C, H, W] to [B, C, T, H, W]
64
+ video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
65
+
66
+ # Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
67
+ frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() # [T]
68
+ grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size # patch grid
69
+ frame_tokens = grid_h * grid_w
70
+
71
+ t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1) # [T * frame_tokens]
72
+ h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
73
+ h_positions = h_positions.repeat(num_frames) # [T * frame_tokens]
74
+ w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
75
+ w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
76
+
77
+ patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
78
+ # patch_positions example (256 tokens per frame, 16x16 patch grid):
79
+ # Each row is [t, h, w].
80
+ # First 4 patches of frame 0 (t=0):
81
+ # patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
82
+ # First 4 patches of frame 1 (t=4):
83
+ # patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
84
+
85
+ with torch.no_grad():
86
+ outputs = model(video, patch_positions=patch_positions)
87
+
88
+ ```
89
+
90
+ ### Loading from Source Code
91
+
92
+ ```bash
93
+ git clone [https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git](https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git)
94
+ cd OneVision-Encoder
95
+ pip install -e .
96
+
97
+ ```
98
+
99
+ ```python
100
+ from onevision_encoder import OneVisionEncoderModel, OneVisionEncoderConfig
101
+ from transformers import AutoImageProcessor
102
+ model = OneVisionEncoderModel.from_pretrained(
103
+ "lmms-lab-encoder/onevision-encoder-large-lang",
104
+ trust_remote_code=True,
105
+ attn_implementation="flash_attention_2"
106
+ ).to("cuda").eval()
107
+ preprocessor = AutoImageProcessor.from_pretrained(
108
+ "lmms-lab-encoder/onevision-encoder-large-lang",
109
+ trust_remote_code=True
110
+ )
111
+
112
+ ```
113
+
114
+ ### LMM Probe Results
115
+
116
+ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning.
117
+
118
+ We adopt a streamlined **native-resolution strategy** inspired by LLaVA-OneVision: when the input frame resolution matches the model's native input size, it is fed **directly**—without tiling or cropping—to evaluate the ViT's capability to handle **true native resolution** and **arbitrary frame sequences**.
119
+
120
+ <p align="center">
121
+ <picture>
122
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_dark_fixed.png">
123
+ <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png">
124
+ <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
125
+ </picture>
126
+ </p>
127
+
128
+ ### Model Card
129
+
130
+ | Property | Value |
131
+ | --- | --- |
132
+ | **Model Type** | **LLM-Aligned** Vision Transformer (ViT) |
133
+ | **Architecture** | **HEVC-Style** / Codec-Like Vision Transformer |
134
+ | **Input Paradigm** | **Codec-Style** (Sparse Patch / Dense Frame) |
135
+ | **Resolution Strategy** | **True Native Resolution** (Dynamic, No Tiling) |
136
+ | **Temporal Context** | **Arbitrary Frame Count** (Variable Length Support) |
137
+ | **Hidden Size** | 1024 |
138
+ | **Intermediate Size** | 4096 |
139
+ | **Number of Layers** | 24 |
140
+ | **Number of Attention Heads** | 16 |
141
+ | **Patch Size** | 14 |
142
+ | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
143
+ | **Normalization** | Layer Normalization |
144
+ | **Activation Function** | GELU |
145
+ | **License** | Apache 2.0 |
146
+
147
+ ```
148
+
149
+ ```