xiangan commited on
Commit
ecba030
·
verified ·
1 Parent(s): a083ac9

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +41 -60
  2. config.json +1 -1
README.md CHANGED
@@ -4,58 +4,48 @@ license: apache-2.0
4
 
5
  ### Model Card
6
 
7
- | Property | Value |
8
- |----------|-------|
9
- | **Model Type** | Vision Transformer (ViT) |
10
- | **Architecture** | HEVC-Style Vision Transformer |
11
- | **Hidden Size** | 1024 |
12
- | **Intermediate Size** | 4096 |
13
- | **Number of Layers** | 24 |
14
- | **Number of Attention Heads** | 16 |
15
- | **Patch Size** | 16 |
16
- | **Image Resolution** | 448×448 (pre-trained) |
17
- | **Video Resolution** | 224×224 with 256 tokens per frame |
18
- | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
19
- | **Normalization** | Layer Normalization |
20
- | **Activation Function** | GELU |
21
- | **License** | Apache 2.0 |
22
 
23
  ### Key Features
24
 
25
  - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
26
  - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
27
- - **Native Resolution Support**: Supports native resolution input without tiling or cropping.
28
- - **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
29
 
30
  ### Intended Use
31
 
32
- #### Primary Use Cases
33
-
34
- - **Video Understanding**: Action recognition, video captioning, video question answering
35
- - **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
36
- - **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
37
-
38
  #### Downstream Tasks
39
 
40
  - Video benchmarks: MVBench, VideoMME, Perception Test
41
  - Image understanding: DocVQA, ChartQA, OCRBench
42
  - Action recognition: SSv2, UCF101, Kinetics
43
 
44
-
45
  ### Quick Start
46
 
47
-
48
  > [!IMPORTANT]
49
  > **Transformers Version Compatibility:**
50
- > - ✅ **`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()`
 
51
  > - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
52
 
53
-
54
  > **Note:** This model supports native resolution input. For optimal performance:
 
55
  > - **Image**: 448×448 resolution (pre-trained)
56
  > - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
57
 
58
-
59
  ```python
60
  from transformers import AutoModel, AutoImageProcessor
61
  from PIL import Image
@@ -81,29 +71,39 @@ with torch.no_grad():
81
  # outputs.last_hidden_state: [B, num_patches, hidden_size]
82
  # outputs.pooler_output: [B, hidden_size]
83
 
84
- # Video inference: [B, C, T, H, W] with visible_indices
85
- num_frames, frame_tokens, target_frames = 16, 256, 64
 
86
  # Load video frames and preprocess each frame (replace with your video frame paths)
87
  frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
88
  video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
89
  # Reshape from [T, C, H, W] to [B, C, T, H, W]
90
  video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
91
 
92
- # Build visible_indices for temporal sampling
93
- frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
94
- visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
95
- # visible_indices example (with 256 tokens per frame):
96
- # Frame 0 (pos=0): indices [0, 1, 2, ..., 255]
97
- # Frame 1 (pos=4): indices [1024, 1025, 1026, ..., 1279]
98
- # Frame 2 (pos=8): indices [2048, 2049, 2050, ..., 2303]
99
- # ...
100
- # Frame 15 (pos=63): indices [16128, 16129, ..., 16383]
 
 
 
 
 
 
 
 
 
101
 
102
  with torch.no_grad():
103
- outputs = model(video, visible_indices=visible_indices)
104
  ```
105
 
106
- ### Loading from Source Code
107
 
108
  ```bash
109
  git clone https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git
@@ -136,22 +136,3 @@ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K sample
136
  <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
137
  </picture>
138
  </p>
139
-
140
- ### Attentive Probe Results
141
-
142
- Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.
143
-
144
- <p align="center">
145
- <picture>
146
- <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png">
147
- <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png">
148
- <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;">
149
- </picture>
150
- </p>
151
-
152
-
153
- ### Codec Input
154
-
155
- > **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
156
-
157
- ---
 
4
 
5
  ### Model Card
6
 
7
+ | Property | Value |
8
+ | ----------------------------- | --------------------------------- |
9
+ | **Model Type** | Vision Transformer (ViT) |
10
+ | **Architecture** | HEVC-Style Vision Transformer |
11
+ | **Hidden Size** | 1024 |
12
+ | **Intermediate Size** | 4096 |
13
+ | **Number of Layers** | 24 |
14
+ | **Number of Attention Heads** | 16 |
15
+ | **Patch Size** | 14 |
16
+ | **Image Resolution** | 448×448 (pre-trained) |
17
+ | **Video Resolution** | 224×224 with 256 tokens per frame |
18
+ | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
19
+ | **Normalization** | Layer Normalization |
20
+ | **Activation Function** | GELU |
21
+ | **License** | Apache 2.0 |
22
 
23
  ### Key Features
24
 
25
  - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
26
  - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
 
 
27
 
28
  ### Intended Use
29
 
 
 
 
 
 
 
30
  #### Downstream Tasks
31
 
32
  - Video benchmarks: MVBench, VideoMME, Perception Test
33
  - Image understanding: DocVQA, ChartQA, OCRBench
34
  - Action recognition: SSv2, UCF101, Kinetics
35
 
 
36
  ### Quick Start
37
 
 
38
  > [!IMPORTANT]
39
  > **Transformers Version Compatibility:**
40
+ >
41
+ > - ✅ **`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()`
42
  > - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
43
 
 
44
  > **Note:** This model supports native resolution input. For optimal performance:
45
+ >
46
  > - **Image**: 448×448 resolution (pre-trained)
47
  > - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
48
 
 
49
  ```python
50
  from transformers import AutoModel, AutoImageProcessor
51
  from PIL import Image
 
71
  # outputs.last_hidden_state: [B, num_patches, hidden_size]
72
  # outputs.pooler_output: [B, hidden_size]
73
 
74
+ # Video inference: [B, C, T, H, W] with patch_positions
75
+ num_frames, target_frames = 16, 64
76
+ patch_size = 14
77
  # Load video frames and preprocess each frame (replace with your video frame paths)
78
  frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
79
  video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
80
  # Reshape from [T, C, H, W] to [B, C, T, H, W]
81
  video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
82
 
83
+ # Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
84
+ frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda() # [T]
85
+ grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size # patch grid
86
+ frame_tokens = grid_h * grid_w
87
+
88
+ t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1) # [T * frame_tokens]
89
+ h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
90
+ h_positions = h_positions.repeat(num_frames) # [T * frame_tokens]
91
+ w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
92
+ w_positions = w_positions.repeat(num_frames) # [T * frame_tokens]
93
+
94
+ patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
95
+ # patch_positions example (256 tokens per frame, 16x16 patch grid):
96
+ # Each row is [t, h, w].
97
+ # First 4 patches of frame 0 (t=0):
98
+ # patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
99
+ # First 4 patches of frame 1 (t=4):
100
+ # patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
101
 
102
  with torch.no_grad():
103
+ outputs = model(video, patch_positions=patch_positions)
104
  ```
105
 
106
+ ### Loading from Source Code
107
 
108
  ```bash
109
  git clone https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git
 
136
  <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
137
  </picture>
138
  </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -18,7 +18,7 @@
18
  "patch_size": 14,
19
  "rope_theta": 10000.0,
20
  "rope_temporal_size": 64,
21
- "transformers_version": "4.57.1",
22
  "use_head": true,
23
  "auto_map": {
24
  "AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",
 
18
  "patch_size": 14,
19
  "rope_theta": 10000.0,
20
  "rope_temporal_size": 64,
21
+ "transformers_version": "4.57.3",
22
  "use_head": true,
23
  "auto_map": {
24
  "AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",