xiangan commited on
Commit
8924d74
·
1 Parent(s): dd2f703

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -1
README.md CHANGED
@@ -2,8 +2,49 @@
2
  license: apache-2.0
3
  ---
4
 
 
5
 
6
- ### Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  > **Note:** This model supports native resolution input. For optimal performance:
9
  > - **Image**: 448×448 resolution (pre-trained)
 
2
  license: apache-2.0
3
  ---
4
 
5
+ ### Model Details
6
 
7
+ | Property | Value |
8
+ |----------|-------|
9
+ | **Model Type** | Vision Transformer (ViT) |
10
+ | **Architecture** | HEVC-Style Vision Transformer |
11
+ | **Hidden Size** | 1024 |
12
+ | **Intermediate Size** | 4096 |
13
+ | **Number of Layers** | 24 |
14
+ | **Number of Attention Heads** | 16 |
15
+ | **Patch Size** | 16 |
16
+ | **Image Resolution** | 448×448 (pre-trained) |
17
+ | **Video Resolution** | 224×224 with 256 tokens per frame |
18
+ | **Positional Encoding** | 3D RoPE (4:6:6 split for T:H:W) |
19
+ | **Normalization** | Layer Normalization |
20
+ | **Activation Function** | GELU |
21
+ | **Attention Implementation** | Flash Attention 2 |
22
+ | **License** | Apache 2.0 |
23
+
24
+ ### Key Features
25
+
26
+ - **Codec-Style Patch Selection**: Instead of sampling sparse frames densely (all patches from few frames), OneVision Encoder samples dense frames sparsely (important patches from many frames).
27
+ - **3D Rotary Position Embedding**: Uses a 4:6:6 split for temporal, height, and width dimensions to capture spatiotemporal relationships.
28
+ - **Global Contrastive Learning**: Trained with a 2M concept bank for better-separated semantic clusters.
29
+ - **Native Resolution Support**: Supports native resolution input without tiling or cropping.
30
+ - **Flash Attention 2**: Efficient attention implementation for improved performance and memory efficiency.
31
+
32
+ ### Intended Use
33
+
34
+ #### Primary Use Cases
35
+
36
+ - **Video Understanding**: Action recognition, video captioning, video question answering
37
+ - **Image Understanding**: Document understanding (DocVQA), chart understanding (ChartQA), OCR tasks
38
+ - **Vision-Language Models**: As the vision encoder backbone for multimodal large language models
39
+
40
+ #### Downstream Tasks
41
+
42
+ - Video benchmarks: MVBench, VideoMME, Perception Test
43
+ - Image understanding: DocVQA, ChartQA, OCRBench
44
+ - Action recognition: SSv2, UCF101, Kinetics
45
+
46
+
47
+ ### Quick Start
48
 
49
  > **Note:** This model supports native resolution input. For optimal performance:
50
  > - **Image**: 448×448 resolution (pre-trained)