lmms-lab-encoder
/

onevision-encoder-large

Safetensors

onevision_encoder

custom_code

Model card Files Files and versions

xet

Community

xiangan commited on Jan 28

Commit

ecba030

verified ·

1 Parent(s): a083ac9

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +41 -60
config.json +1 -1

README.md CHANGED Viewed

@@ -4,58 +4,48 @@ license: apache-2.0
 ### Model Card
-| Property | Value |
-|----------|-------|
-| **Model Type** | Vision Transformer (ViT) |
-| **Architecture** | HEVC-Style Vision Transformer |
-| **Hidden Size** | 1024 |
-| **Intermediate Size** | 4096 |
-| **Number of Layers** | 24 |
-| **Number of Attention Heads** | 16 |
-| **Patch Size** | 16 |
-| **Image Resolution** | 448×448 (pre-trained) |
-| **Video Resolution** | 224×224 with 256 tokens per frame |
-| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
-| **Normalization** | Layer Normalization |
-| **Activation Function** | GELU |
-| **License** | Apache 2.0 |
 ### Key Features
 - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
 - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
-- **Native Resolution Support**: Supports native resolution input without tiling or cropping.
-- **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
 ### Intended Use
-#### Primary Use Cases
-- **Video Understanding**: Action recognition, video captioning, video question answering
-- **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
-- **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
 #### Downstream Tasks
 - Video benchmarks: MVBench, VideoMME, Perception Test
 - Image understanding: DocVQA, ChartQA, OCRBench
 - Action recognition: SSv2, UCF101, Kinetics
 ### Quick Start
 > [!IMPORTANT]
 > **Transformers Version Compatibility:**
-> - ✅ **`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()`
 > - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
 > **Note:** This model supports native resolution input. For optimal performance:
 > - **Image**: 448×448 resolution (pre-trained)
 > - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
 ```python
 from transformers import AutoModel, AutoImageProcessor
 from PIL import Image
@@ -81,29 +71,39 @@ with torch.no_grad():
     # outputs.last_hidden_state: [B, num_patches, hidden_size]
     # outputs.pooler_output: [B, hidden_size]
-# Video inference: [B, C, T, H, W] with visible_indices
-num_frames, frame_tokens, target_frames = 16, 256, 64
 # Load video frames and preprocess each frame (replace with your video frame paths)
 frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
 video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
 # Reshape from [T, C, H, W] to [B, C, T, H, W]
 video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
-# Build visible_indices for temporal sampling
-frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()
-visible_indices = (frame_pos.unsqueeze(-1) * frame_tokens + torch.arange(frame_tokens).cuda()).reshape(1, -1)
-# visible_indices example (with 256 tokens per frame):
-#   Frame 0 (pos=0):  indices [0, 1, 2, ..., 255]
-#   Frame 1 (pos=4):  indices [1024, 1025, 1026, ..., 1279]
-#   Frame 2 (pos=8):  indices [2048, 2049, 2050, ..., 2303]
-#   ...
-#   Frame 15 (pos=63): indices [16128, 16129, ..., 16383]
 with torch.no_grad():
-    outputs = model(video, visible_indices=visible_indices)
 ```
-### Loading from Source Code
 ```bash
 git clone https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git
@@ -136,22 +136,3 @@ Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K sample
     <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
   </picture>
 </p>
-### Attentive Probe Results
-Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.
-<p align="center">
-  <picture>
-    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_dark.png">
-    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/fix_00_probe_video_github_light.png">
-    <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="900" style="max-width: 100%;">
-  </picture>
-</p>
-### Codec Input
-> **TODO:** Add codec-style input documentation for temporal saliency-based patch selection.
----

 ### Model Card
+| Property                      | Value                             |
+| ----------------------------- | --------------------------------- |
+| **Model Type**                | Vision Transformer (ViT)          |
+| **Architecture**              | HEVC-Style Vision Transformer     |
+| **Hidden Size**               | 1024                              |
+| **Intermediate Size**         | 4096                              |
+| **Number of Layers**          | 24                                |
+| **Number of Attention Heads** | 16                                |
+| **Patch Size**                | 14                                |
+| **Image Resolution**          | 448×448 (pre-trained)             |
+| **Video Resolution**          | 224×224 with 256 tokens per frame |
+| **Positional Encoding**       | 3D RoPE (4:6:6 split for T:H:W)   |
+| **Normalization**             | Layer Normalization               |
+| **Activation Function**       | GELU                              |
+| **License**                   | Apache 2.0                        |
 ### Key Features
 - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
 - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
 ### Intended Use
 #### Downstream Tasks
 - Video benchmarks: MVBench, VideoMME, Perception Test
 - Image understanding: DocVQA, ChartQA, OCRBench
 - Action recognition: SSv2, UCF101, Kinetics
 ### Quick Start
 > [!IMPORTANT]
 > **Transformers Version Compatibility:**
+>
+> - ✅ **`transformers==4.53.1`** (Recommended): Works with `AutoModel.from_pretrained()`
 > - ⚠️ **`transformers>=5.0.0`**: Not currently supported. We are actively working on a fix.
 > **Note:** This model supports native resolution input. For optimal performance:
+>
 > - **Image**: 448×448 resolution (pre-trained)
 > - **Video**: 224×224 resolution with 256 tokens per frame (pre-trained)
 ```python
 from transformers import AutoModel, AutoImageProcessor
 from PIL import Image
     # outputs.last_hidden_state: [B, num_patches, hidden_size]
     # outputs.pooler_output: [B, hidden_size]
+# Video inference: [B, C, T, H, W] with patch_positions
+num_frames, target_frames = 16, 64
+patch_size = 14
 # Load video frames and preprocess each frame (replace with your video frame paths)
 frames = [Image.open(f"path/to/frame_{i}.jpg") for i in range(num_frames)]
 video_pixel_values = preprocessor(images=frames, return_tensors="pt")["pixel_values"]
 # Reshape from [T, C, H, W] to [B, C, T, H, W]
 video = video_pixel_values.unsqueeze(0).permute(0, 2, 1, 3, 4).to("cuda")
+# Build patch_positions for temporal sampling: [B, num_frames * frame_tokens, 3]
+frame_pos = torch.linspace(0, target_frames - 1, num_frames).long().cuda()  # [T]
+grid_h, grid_w = video.shape[-2] // patch_size, video.shape[-1] // patch_size  # patch grid
+frame_tokens = grid_h * grid_w
+t_positions = frame_pos[:, None].repeat(1, frame_tokens).reshape(-1)  # [T * frame_tokens]
+h_positions = torch.arange(grid_h, device="cuda").repeat_interleave(grid_w)
+h_positions = h_positions.repeat(num_frames)  # [T * frame_tokens]
+w_positions = torch.arange(grid_w, device="cuda").repeat(grid_h)
+w_positions = w_positions.repeat(num_frames)  # [T * frame_tokens]
+patch_positions = torch.stack([t_positions, h_positions, w_positions], dim=-1).unsqueeze(0)
+# patch_positions example (256 tokens per frame, 16x16 patch grid):
+#   Each row is [t, h, w].
+#   First 4 patches of frame 0 (t=0):
+#     patch_positions[0, 0:4, :] -> [[0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3]]
+#   First 4 patches of frame 1 (t=4):
+#     patch_positions[0, 256:260, :] -> [[4, 0, 0], [4, 0, 1], [4, 0, 2], [4, 0, 3]]
 with torch.no_grad():
+  outputs = model(video, patch_positions=patch_positions)
 ```
+### Loading from Source Code
 ```bash
 git clone https://github.com/EvolvingLMMs-Lab/OneVision-Encoder.git
     <img alt="LMM Probe Results" src="https://raw.githubusercontent.com/anxiangsir/asset/main/OneVision/probe_lmm_github_light.png" width="800" style="max-width: 100%;">
   </picture>
 </p>

config.json CHANGED Viewed

@@ -18,7 +18,7 @@
   "patch_size": 14,
   "rope_theta": 10000.0,
   "rope_temporal_size": 64,
-  "transformers_version": "4.57.1",
   "use_head": true,
   "auto_map": {
     "AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",

   "patch_size": 14,
   "rope_theta": 10000.0,
   "rope_temporal_size": 64,
+  "transformers_version": "4.57.3",
   "use_head": true,
   "auto_map": {
     "AutoConfig": "configuration_onevision_encoder.OneVisionEncoderConfig",