kobiakor15 commited on
Commit
11e1f9d
Β·
verified Β·
1 Parent(s): 81c473b

Upload docs/ARCHITECTURE.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/ARCHITECTURE.md +215 -0
docs/ARCHITECTURE.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Oculus 0.1 Architecture
2
+
3
+ ## Overview
4
+ Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.
5
+
6
+ ## Architecture Components
7
+
8
+ ### 1. DINOv3 Encoder (ViT-L/16)
9
+ - **Model**: DINOv3 ViT-L/16 (pretrained)
10
+ - **Parameters**: ~1.7B
11
+ - **Input**: 224Γ—224 images
12
+ - **Output**: 197 tokens (1 CLS + 196 patches)
13
+ - **Patch Grid**: 14Γ—14
14
+ - **Feature Dimension**: 1024D
15
+ - **Capabilities**: Universal vision backbone, dense prediction
16
+
17
+ ### 2. SigLIP2 Encoder (SO400M)
18
+ - **Model**: SigLIP2 SO400M (pretrained)
19
+ - **Parameters**: ~400M
20
+ - **Input**: 384Γ—384 images
21
+ - **Output**: 576 patch tokens
22
+ - **Patch Grid**: 24Γ—24
23
+ - **Feature Dimension**: 1152D
24
+ - **Capabilities**: Vision-language understanding, fine-grained features
25
+
26
+ ### 3. Feature Fusion
27
+ - **Method**: Concatenation
28
+ - **Input**: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
29
+ - **Output**: 2176D per spatial location
30
+ - **Note**: SigLIP2 features resampled to 14Γ—14 to match DINOv3
31
+
32
+ ### 4. Vision-Language Projector
33
+ - **Type**: 2-layer MLP with GELU
34
+ - **Input**: 2176D
35
+ - **Hidden**: 4352D
36
+ - **Output**: 1536D (LFM2.5 embedding dimension)
37
+ - **Parameters**: ~5M
38
+
39
+ ### 5. LFM2.5-1.2B Language Model
40
+ - **Model**: LFM2.5-1.2B-Base (pretrained)
41
+ - **Parameters**: ~1.2B
42
+ - **Architecture**: Hybrid transformer (full_attention + conv layers)
43
+ - **Embedding Dimension**: 1536D
44
+ - **Depth**: 16 layers
45
+ - **Attention Heads**: 24
46
+ - **Vocab Size**: 131072
47
+ - **Context Length**: 32768 tokens
48
+ - **Why LFM2.5**: 3x faster training, 2x faster inference than Qwen3 on CPU
49
+
50
+ ### 6. Task-Specific Heads
51
+
52
+ #### Segmentation Head
53
+ - **Type**: MLP
54
+ - **Input**: 2176D
55
+ - **Hidden**: 256D
56
+ - **Output**: num_classes (e.g., 150 for ADE20K)
57
+ - **Output Shape**: (batch, 14, 14, num_classes)
58
+
59
+ #### Classification Head
60
+ - **Type**: MLP
61
+ - **Input**: 2176D
62
+ - **Hidden**: 256D
63
+ - **Output**: num_classes (e.g., 1000 for ImageNet)
64
+ - **Uses**: CLS token from fused features
65
+
66
+ #### Detection Head
67
+ - **Type**: MLP
68
+ - **Input**: 2176D
69
+ - **Hidden**: 256D
70
+ - **Outputs**:
71
+ - Class logits: (batch, 196, anchors, num_classes)
72
+ - Box predictions: (batch, 196, anchors, 4)
73
+
74
+ #### OCR Head
75
+ - **Type**: CNN + MLP
76
+ - **Input**: 2176D
77
+ - **Outputs**:
78
+ - Text logits: (batch, 14, 14, max_seq_len)
79
+ - Geometry: (batch, 196, 4) [x, y, w, h]
80
+
81
+ ## Model Flow
82
+
83
+ ```
84
+ Input Image 1 (224Γ—224) ──→ DINOv3 Encoder
85
+ ↓
86
+ 196 patches (14Γ—14)
87
+ 1024D per patch
88
+ ↓
89
+ └─────────────────┐
90
+ β”‚
91
+ Input Image 2 (384Γ—384) ──→ SigLIP2 Encoder β”‚
92
+ ↓ β”‚
93
+ 576 patches (24Γ—24) β”‚
94
+ 1152D per patch β”‚
95
+ ↓ β”‚
96
+ Resample to 14Γ—14 β”‚
97
+ ↓ β”‚
98
+ └────── Concatenate ──→ 2176D features
99
+ β”‚
100
+ ↓
101
+ Vision Projector (MLP)
102
+ β”‚
103
+ ↓
104
+ 1536D embeddings
105
+ β”‚
106
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
107
+ ↓ ↓ ↓
108
+ Segmentation Classification LFM2.5 LM
109
+ Head Head (1.2B)
110
+ ↓ ↓ ↓
111
+ (14Γ—14, classes) (class_id) Text Output
112
+ (Caption/VQA)
113
+ ↓ ↓ ↓
114
+ Segmentation Classification Generated
115
+ Predictions Predictions Text
116
+
117
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
118
+ ↓ ↓
119
+ Detection Head OCR Head
120
+ ↓ ↓
121
+ (boxes + classes) (text + geometry)
122
+ ```
123
+
124
+ ## Parameter Count
125
+
126
+ | Component | Parameters |
127
+ |-----------|------------|
128
+ | DINOv3 Encoder | 1,700,000,000 |
129
+ | SigLIP2 Encoder | 400,000,000 |
130
+ | Projector | 5,000,000 |
131
+ | LFM2.5 Language Model | 1,200,000,000 |
132
+ | Segmentation Head | 500,000 |
133
+ | Classification Head | 300,000 |
134
+ | Detection Head | 500,000 |
135
+ | OCR Head | 300,000 |
136
+ | **Total** | **~3,806,600,000** |
137
+
138
+ ## Training Strategy
139
+
140
+ ### Stage 1: Connector Pretraining
141
+ - **Freeze**: All vision encoders, LFM2.5
142
+ - **Train**: Projector only
143
+ - **Data**: Image-caption pairs (CC3M, LAION)
144
+ - **Goal**: Align vision and language representations
145
+ - **Batch Size**: 8-16
146
+ - **Learning Rate**: 1e-3
147
+
148
+ ### Stage 2: Head Training
149
+ - **Freeze**: Encoders, LFM2.5, Projector
150
+ - **Train**: Task heads only
151
+ - **Data**: Task-specific datasets
152
+ - **Goal**: Learn task-specific heads
153
+ - **Batch Size**: 8-16
154
+ - **Learning Rate**: 1e-3
155
+
156
+ ### Stage 3: Full Fine-tuning
157
+ - **Freeze**: None
158
+ - **Train**: All components
159
+ - **Data**: Multi-task or specific task
160
+ - **Goal**: End-to-end optimization
161
+ - **Learning Rate**: 1e-5 (encoders), 1e-4 (heads)
162
+
163
+ ## Memory Requirements
164
+
165
+ | Mode | Memory |
166
+ |------|--------|
167
+ | Inference | ~10 GB |
168
+ | Training (frozen encoders) | ~12 GB |
169
+ | Training (full) | ~30 GB |
170
+
171
+ ## Why LFM2.5?
172
+
173
+ - **3x faster training** than Qwen3 on CPU
174
+ - **2x faster decode/prefill** on CPU
175
+ - **Optimized for edge** - runs under 1GB memory
176
+ - **Native MLX support**
177
+ - **Hybrid architecture** - mix of attention and conv layers
178
+
179
+ ## Comparison with Alternatives
180
+
181
+ | Aspect | Oculus (LFM2.5) | Oculus (Qwen2) |
182
+ |--------|---------------|--------------|
183
+ | LM Parameters | 1.2B | 1.5B |
184
+ | Training Speed | 3x faster | Baseline |
185
+ | Inference Speed | 2x faster | Baseline |
186
+ | MLX Support | Native | Via mlx-lm |
187
+ | Edge Performance | Excellent | Good |
188
+
189
+ ## Supported Tasks
190
+
191
+ | Task | Input | Output |
192
+ |------|-------|--------|
193
+ | Captioning | Image + prompt | Generated text |
194
+ | VQA | Image + question | Answer text |
195
+ | Segmentation | Image | Class per pixel |
196
+ | Classification | Image | Class label |
197
+ | Detection | Image | Boxes + classes |
198
+ | OCR | Image | Text + bounding boxes |
199
+ | Feature Extraction | Image | 2176D features |
200
+
201
+ ## Input/Output Shapes
202
+
203
+ | Input | Shape |
204
+ |-------|-------|
205
+ | DINOv3 Image | (B, 3, 224, 224) |
206
+ | SigLIP2 Image | (B, 3, 384, 384) |
207
+ | Input IDs | (B, seq_len) |
208
+
209
+ | Output | Shape |
210
+ |--------|-------|
211
+ | Generated Text | (B, seq_len + new_tokens) |
212
+ | Segmentation | (B, 14, 14) |
213
+ | Classification | (B,) |
214
+ | Detection | (B, 196, 9, 80), (B, 196, 9, 4) |
215
+ | OCR Text | (B, 14, 14, max_seq_len) |