lmms-lab-encoder
/

onevision-encoder-large

onevision_encoder

Model card Files Files and versions

xiangan commited on Dec 25, 2025

Commit

8924d74

·

1 Parent(s): dd2f703

Update README.md

Files changed (1) hide show

README.md +42 -1

README.md CHANGED Viewed

@@ -2,8 +2,49 @@
 license: apache-2.0
 ---
-### ⚡ Quick Start
 > **Note:** This model supports native resolution input. For optimal performance:
 > - **Image**: 448×448 resolution (pre-trained)

 license: apache-2.0
 ---
+### Model Details
+| Property | Value |
+|----------|-------|
+| **Model Type** | Vision Transformer (ViT) |
+| **Architecture** | HEVC-Style Vision Transformer |
+| **Hidden Size** | 1024 |
+| **Intermediate Size** | 4096 |
+| **Number of Layers** | 24 |
+| **Number of Attention Heads** | 16 |
+| **Patch Size** | 16 |
+| **Image Resolution** | 448×448 (pre-trained) |
+| **Video Resolution** | 224×224 with 256 tokens per frame |
+| **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
+| **Normalization** | Layer Normalization |
+| **Activation Function** | GELU |
+| **Attention Implementation** | Flash Attention 2 |
+| **License** | Apache 2.0 |
+### Key Features
+- **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
+- **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
+- **Global Contrastive Learning**: Trained with a 2M concept bank for better-separated semantic clusters.
+- **Native Resolution Support**: Supports native resolution input without tiling or cropping.
+- **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
+### Intended Use
+#### Primary Use Cases
+- **Video Understanding**: Action recognition, video captioning, video question answering
+- **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
+- **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
+#### Downstream Tasks
+- Video benchmarks: MVBench, VideoMME, Perception Test
+- Image understanding: DocVQA, ChartQA, OCRBench
+- Action recognition: SSv2, UCF101, Kinetics
+### Quick Start
 > **Note:** This model supports native resolution input. For optimal performance:
 > - **Image**: 448×448 resolution (pre-trained)