Upload folder using huggingface_hub
Browse files- README.md +41 -60
- config.json +1 -1
README.md
CHANGED
|
@@ -4,58 +4,48 @@ license: apache-2.0
|
|
| 4 |
|
| 5 |
### Model Card
|
| 6 |
|
| 7 |
-
| Property
|
| 8 |
-
|----------|-------|
|
| 9 |
-
| **Model Type**
|
| 10 |
-
| **Architecture**
|
| 11 |
-
| **Hidden Size**
|
| 12 |
-
| **Intermediate Size**
|
| 13 |
-
| **Number of Layers**
|
| 14 |
-
| **Number of Attention Heads** | 16
|
| 15 |
-
| **Patch Size**
|
| 16 |
-
| **Image Resolution**
|
| 17 |
-
| **Video Resolution**
|
| 18 |
-
| **Positional Encoding**
|
| 19 |
-
| **Normalization**
|
| 20 |
-
| **Activation Function**
|
| 21 |
-
| **License**
|
| 22 |
|
| 23 |
### Key Features
|
| 24 |
|
| 25 |
- **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
|
| 26 |
- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
|
| 27 |
-
- **Native Resolution Support**: Supports native resolution input without tiling or cropping.
|
| 28 |
-
- **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
|
| 29 |
|
| 30 |
### Intended Use
|
| 31 |
|
| 32 |
-
#### Primary Use Cases
|
| 33 |
-
|
| 34 |
-
- **Video Understanding**: Action recognition, video captioning, video question answering
|
| 35 |
-
- **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
|
| 36 |
-
- **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
|
| 37 |
-
|
| 38 |
#### Downstream Tasks
|
| 39 |
|
| 40 |
- Video benchmarks: MVBench, VideoMME, Perception Test
|
| 41 |
- Image understanding: DocVQA, ChartQA, OCRBench
|
| 42 |
- Action recognition: SSv2, UCF101, Kinetics
|
| 43 |
|
| 44 |
-
|
| 45 |
### Quick Start
|
| 46 |
|
| 47 |
-
|
| 48 |
> [!IMPORTANT]
|
| 49 |
> **Transformers Version Compatibility:**
|
| 50 |
-
>
|
|
|
|
| 51 |
> - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
|
| 52 |
|
| 53 |
-
|
| 54 |
> **Note:** This model supports native resolution input. For optimal performance:
|
|
|
|
| 55 |
> - **Image**: 448×448 resolution (pre-trained)
|
| 56 |
> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
|
| 57 |
|
| 58 |
-
|
| 59 |
```python
|
| 60 |
from transformers import AutoModel, AutoImageProcessor
|
| 61 |
from PIL import Image
|
|
@@ -81,29 +71,39 @@ with torch.no_grad():
|
|
| 81 |
# outputs.last_hidden_state: [B, num_patches, hidden_size]
|
| 82 |
# outputs.pooler_output: [B, hidden_size]
|
| 83 |
|
| 84 |
-
# Video inference: [B, C, T, H, W] with
|
| 85 |
-
num_frames,
|
|
|
|
| 86 |
# Load video frames and preprocess each frame (replace with your video frame paths)
|
| 87 |
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
|
| 88 |
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
|
| 89 |
# Reshape from [T, C, H, W] to [B, C, T, H, W]
|
| 90 |
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
|
| 91 |
|
| 92 |
-
# Build
|
| 93 |
-
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
with torch.no_grad():
|
| 103 |
-
|
| 104 |
```
|
| 105 |
|
| 106 |
-
### Loading from Source Code
|
| 107 |
|
| 108 |
```bash
|
| 109 |
git clone https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git
|
|
@@ -136,22 +136,3 @@ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K sample
|
|
| 136 |
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
|
| 137 |
</picture>
|
| 138 |
</p>
|
| 139 |
-
|
| 140 |
-
### Attentive Probe Results
|
| 141 |
-
|
| 142 |
-
Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.
|
| 143 |
-
|
| 144 |
-
<p align="center">
|
| 145 |
-
<picture>
|
| 146 |
-
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png">
|
| 147 |
-
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png">
|
| 148 |
-
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;">
|
| 149 |
-
</picture>
|
| 150 |
-
</p>
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
### Codec Input
|
| 154 |
-
|
| 155 |
-
> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
|
| 156 |
-
|
| 157 |
-
---
|
|
|
|
| 4 |
|
| 5 |
### Model Card
|
| 6 |
|
| 7 |
+
| Property | Value |
|
| 8 |
+
| ----------------------------- | --------------------------------- |
|
| 9 |
+
| **Model Type** | Vision Transformer (ViT) |
|
| 10 |
+
| **Architecture** | HEVC-Style Vision Transformer |
|
| 11 |
+
| **Hidden Size** | 1024 |
|
| 12 |
+
| **Intermediate Size** | 4096 |
|
| 13 |
+
| **Number of Layers** | 24 |
|
| 14 |
+
| **Number of Attention Heads** | 16 |
|
| 15 |
+
| **Patch Size** | 14 |
|
| 16 |
+
| **Image Resolution** | 448×448 (pre-trained) |
|
| 17 |
+
| **Video Resolution** | 224×224 with 256 tokens per frame |
|
| 18 |
+
| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
|
| 19 |
+
| **Normalization** | Layer Normalization |
|
| 20 |
+
| **Activation Function** | GELU |
|
| 21 |
+
| **License** | Apache 2.0 |
|
| 22 |
|
| 23 |
### Key Features
|
| 24 |
|
| 25 |
- **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
|
| 26 |
- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
|
|
|
|
|
|
|
| 27 |
|
| 28 |
### Intended Use
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
#### Downstream Tasks
|
| 31 |
|
| 32 |
- Video benchmarks: MVBench, VideoMME, Perception Test
|
| 33 |
- Image understanding: DocVQA, ChartQA, OCRBench
|
| 34 |
- Action recognition: SSv2, UCF101, Kinetics
|
| 35 |
|
|
|
|
| 36 |
### Quick Start
|
| 37 |
|
|
|
|
| 38 |
> [!IMPORTANT]
|
| 39 |
> **Transformers Version Compatibility:**
|
| 40 |
+
>
|
| 41 |
+
> - ✅ **`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()`
|
| 42 |
> - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
|
| 43 |
|
|
|
|
| 44 |
> **Note:** This model supports native resolution input. For optimal performance:
|
| 45 |
+
>
|
| 46 |
> - **Image**: 448×448 resolution (pre-trained)
|
| 47 |
> - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
|
| 48 |
|
|
|
|
| 49 |
```python
|
| 50 |
from transformers import AutoModel, AutoImageProcessor
|
| 51 |
from PIL import Image
|
|
|
|
| 71 |
# outputs.last_hidden_state: [B, num_patches, hidden_size]
|
| 72 |
# outputs.pooler_output: [B, hidden_size]
|
| 73 |
|
| 74 |
+
# Video inference: [B, C, T, H, W] with patch_positions
|
| 75 |
+
num_frames, target_frames = 16, 64
|
| 76 |
+
patch_size = 14
|
| 77 |
# Load video frames and preprocess each frame (replace with your video frame paths)
|
| 78 |
frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
|
| 79 |
video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
|
| 80 |
# Reshape from [T, C, H, W] to [B, C, T, H, W]
|
| 81 |
video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
|
| 82 |
|
| 83 |
+
# Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
|
| 84 |
+
frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() # [T]
|
| 85 |
+
grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size # patch grid
|
| 86 |
+
frame_tokens = grid_h * grid_w
|
| 87 |
+
|
| 88 |
+
t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1) # [T * frame_tokens]
|
| 89 |
+
h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
|
| 90 |
+
h_positions = h_positions.repeat(num_frames) # [T * frame_tokens]
|
| 91 |
+
w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
|
| 92 |
+
w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
|
| 93 |
+
|
| 94 |
+
patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
|
| 95 |
+
# patch_positions example (256 tokens per frame, 16x16 patch grid):
|
| 96 |
+
# Each row is [t, h, w].
|
| 97 |
+
# First 4 patches of frame 0 (t=0):
|
| 98 |
+
# patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
|
| 99 |
+
# First 4 patches of frame 1 (t=4):
|
| 100 |
+
# patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
|
| 101 |
|
| 102 |
with torch.no_grad():
|
| 103 |
+
outputs = model(video, patch_positions=patch_positions)
|
| 104 |
```
|
| 105 |
|
| 106 |
+
### Loading from Source Code
|
| 107 |
|
| 108 |
```bash
|
| 109 |
git clone https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git
|
|
|
|
| 136 |
<img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
|
| 137 |
</picture>
|
| 138 |
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
CHANGED
|
@@ -18,7 +18,7 @@
|
|
| 18 |
"patch_size": 14,
|
| 19 |
"rope_theta": 10000.0,
|
| 20 |
"rope_temporal_size": 64,
|
| 21 |
-
"transformers_version": "4.57.
|
| 22 |
"use_head": true,
|
| 23 |
"auto_map": {
|
| 24 |
"AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",
|
|
|
|
| 18 |
"patch_size": 14,
|
| 19 |
"rope_theta": 10000.0,
|
| 20 |
"rope_temporal_size": 64,
|
| 21 |
+
"transformers_version": "4.57.3",
|
| 22 |
"use_head": true,
|
| 23 |
"auto_map": {
|
| 24 |
"AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",
|