| # Oculus 0.1 Architecture | |
| ## Overview | |
| Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX. | |
| ## Architecture Components | |
| ### 1. DINOv3 Encoder (ViT-L/16) | |
| - **Model**: DINOv3 ViT-L/16 (pretrained) | |
| - **Parameters**: ~1.7B | |
| - **Input**: 224Γ224 images | |
| - **Output**: 197 tokens (1 CLS + 196 patches) | |
| - **Patch Grid**: 14Γ14 | |
| - **Feature Dimension**: 1024D | |
| - **Capabilities**: Universal vision backbone, dense prediction | |
| ### 2. SigLIP2 Encoder (SO400M) | |
| - **Model**: SigLIP2 SO400M (pretrained) | |
| - **Parameters**: ~400M | |
| - **Input**: 384Γ384 images | |
| - **Output**: 576 patch tokens | |
| - **Patch Grid**: 24Γ24 | |
| - **Feature Dimension**: 1152D | |
| - **Capabilities**: Vision-language understanding, fine-grained features | |
| ### 3. Feature Fusion | |
| - **Method**: Concatenation | |
| - **Input**: DINOv3 patches (1024D) + SigLIP2 patches (1152D) | |
| - **Output**: 2176D per spatial location | |
| - **Note**: SigLIP2 features resampled to 14Γ14 to match DINOv3 | |
| ### 4. Vision-Language Projector | |
| - **Type**: 2-layer MLP with GELU | |
| - **Input**: 2176D | |
| - **Hidden**: 4352D | |
| - **Output**: 1536D (LFM2.5 embedding dimension) | |
| - **Parameters**: ~5M | |
| ### 5. LFM2.5-1.2B Language Model | |
| - **Model**: LFM2.5-1.2B-Base (pretrained) | |
| - **Parameters**: ~1.2B | |
| - **Architecture**: Hybrid transformer (full_attention + conv layers) | |
| - **Embedding Dimension**: 1536D | |
| - **Depth**: 16 layers | |
| - **Attention Heads**: 24 | |
| - **Vocab Size**: 131072 | |
| - **Context Length**: 32768 tokens | |
| - **Why LFM2.5**: 3x faster training, 2x faster inference than Qwen3 on CPU | |
| ### 6. Task-Specific Heads | |
| #### Segmentation Head | |
| - **Type**: MLP | |
| - **Input**: 2176D | |
| - **Hidden**: 256D | |
| - **Output**: num_classes (e.g., 150 for ADE20K) | |
| - **Output Shape**: (batch, 14, 14, num_classes) | |
| #### Classification Head | |
| - **Type**: MLP | |
| - **Input**: 2176D | |
| - **Hidden**: 256D | |
| - **Output**: num_classes (e.g., 1000 for ImageNet) | |
| - **Uses**: CLS token from fused features | |
| #### Detection Head | |
| - **Type**: MLP | |
| - **Input**: 2176D | |
| - **Hidden**: 256D | |
| - **Outputs**: | |
| - Class logits: (batch, 196, anchors, num_classes) | |
| - Box predictions: (batch, 196, anchors, 4) | |
| #### OCR Head | |
| - **Type**: CNN + MLP | |
| - **Input**: 2176D | |
| - **Outputs**: | |
| - Text logits: (batch, 14, 14, max_seq_len) | |
| - Geometry: (batch, 196, 4) [x, y, w, h] | |
| ## Model Flow | |
| ``` | |
| Input Image 1 (224Γ224) βββ DINOv3 Encoder | |
| β | |
| 196 patches (14Γ14) | |
| 1024D per patch | |
| β | |
| βββββββββββββββββββ | |
| β | |
| Input Image 2 (384Γ384) βββ SigLIP2 Encoder β | |
| β β | |
| 576 patches (24Γ24) β | |
| 1152D per patch β | |
| β β | |
| Resample to 14Γ14 β | |
| β β | |
| βββββββ Concatenate βββ 2176D features | |
| β | |
| β | |
| Vision Projector (MLP) | |
| β | |
| β | |
| 1536D embeddings | |
| β | |
| ββββββββββββββββββββ¬βββββββββββββββββββββ΄βββββββββββββββββββββ | |
| β β β | |
| Segmentation Classification LFM2.5 LM | |
| Head Head (1.2B) | |
| β β β | |
| (14Γ14, classes) (class_id) Text Output | |
| (Caption/VQA) | |
| β β β | |
| Segmentation Classification Generated | |
| Predictions Predictions Text | |
| βββββββββββββββββββββββββ | |
| β β | |
| Detection Head OCR Head | |
| β β | |
| (boxes + classes) (text + geometry) | |
| ``` | |
| ## Parameter Count | |
| | Component | Parameters | | |
| |-----------|------------| | |
| | DINOv3 Encoder | 1,700,000,000 | | |
| | SigLIP2 Encoder | 400,000,000 | | |
| | Projector | 5,000,000 | | |
| | LFM2.5 Language Model | 1,200,000,000 | | |
| | Segmentation Head | 500,000 | | |
| | Classification Head | 300,000 | | |
| | Detection Head | 500,000 | | |
| | OCR Head | 300,000 | | |
| | **Total** | **~3,806,600,000** | | |
| ## Training Strategy | |
| ### Stage 1: Connector Pretraining | |
| - **Freeze**: All vision encoders, LFM2.5 | |
| - **Train**: Projector only | |
| - **Data**: Image-caption pairs (CC3M, LAION) | |
| - **Goal**: Align vision and language representations | |
| - **Batch Size**: 8-16 | |
| - **Learning Rate**: 1e-3 | |
| ### Stage 2: Head Training | |
| - **Freeze**: Encoders, LFM2.5, Projector | |
| - **Train**: Task heads only | |
| - **Data**: Task-specific datasets | |
| - **Goal**: Learn task-specific heads | |
| - **Batch Size**: 8-16 | |
| - **Learning Rate**: 1e-3 | |
| ### Stage 3: Full Fine-tuning | |
| - **Freeze**: None | |
| - **Train**: All components | |
| - **Data**: Multi-task or specific task | |
| - **Goal**: End-to-end optimization | |
| - **Learning Rate**: 1e-5 (encoders), 1e-4 (heads) | |
| ## Memory Requirements | |
| | Mode | Memory | | |
| |------|--------| | |
| | Inference | ~10 GB | | |
| | Training (frozen encoders) | ~12 GB | | |
| | Training (full) | ~30 GB | | |
| ## Why LFM2.5? | |
| - **3x faster training** than Qwen3 on CPU | |
| - **2x faster decode/prefill** on CPU | |
| - **Optimized for edge** - runs under 1GB memory | |
| - **Native MLX support** | |
| - **Hybrid architecture** - mix of attention and conv layers | |
| ## Comparison with Alternatives | |
| | Aspect | Oculus (LFM2.5) | Oculus (Qwen2) | | |
| |--------|---------------|--------------| | |
| | LM Parameters | 1.2B | 1.5B | | |
| | Training Speed | 3x faster | Baseline | | |
| | Inference Speed | 2x faster | Baseline | | |
| | MLX Support | Native | Via mlx-lm | | |
| | Edge Performance | Excellent | Good | | |
| ## Supported Tasks | |
| | Task | Input | Output | | |
| |------|-------|--------| | |
| | Captioning | Image + prompt | Generated text | | |
| | VQA | Image + question | Answer text | | |
| | Segmentation | Image | Class per pixel | | |
| | Classification | Image | Class label | | |
| | Detection | Image | Boxes + classes | | |
| | OCR | Image | Text + bounding boxes | | |
| | Feature Extraction | Image | 2176D features | | |
| ## Input/Output Shapes | |
| | Input | Shape | | |
| |-------|-------| | |
| | DINOv3 Image | (B, 3, 224, 224) | | |
| | SigLIP2 Image | (B, 3, 384, 384) | | |
| | Input IDs | (B, seq_len) | | |
| | Output | Shape | | |
| |--------|-------| | |
| | Generated Text | (B, seq_len + new_tokens) | | |
| | Segmentation | (B, 14, 14) | | |
| | Classification | (B,) | | |
| | Detection | (B, 196, 9, 80), (B, 196, 9, 4) | | |
| | OCR Text | (B, 14, 14, max_seq_len) | | |