# Oculus 0.1 Architecture ## Overview Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX. ## Architecture Components ### 1. DINOv3 Encoder (ViT-L/16) - **Model**: DINOv3 ViT-L/16 (pretrained) - **Parameters**: ~1.7B - **Input**: 224×224 images - **Output**: 197 tokens (1 CLS + 196 patches) - **Patch Grid**: 14×14 - **Feature Dimension**: 1024D - **Capabilities**: Universal vision backbone, dense prediction ### 2. SigLIP2 Encoder (SO400M) - **Model**: SigLIP2 SO400M (pretrained) - **Parameters**: ~400M - **Input**: 384×384 images - **Output**: 576 patch tokens - **Patch Grid**: 24×24 - **Feature Dimension**: 1152D - **Capabilities**: Vision-language understanding, fine-grained features ### 3. Feature Fusion - **Method**: Concatenation - **Input**: DINOv3 patches (1024D) + SigLIP2 patches (1152D) - **Output**: 2176D per spatial location - **Note**: SigLIP2 features resampled to 14×14 to match DINOv3 ### 4. Vision-Language Projector - **Type**: 2-layer MLP with GELU - **Input**: 2176D - **Hidden**: 4352D - **Output**: 1536D (LFM2.5 embedding dimension) - **Parameters**: ~5M ### 5. LFM2.5-1.2B Language Model - **Model**: LFM2.5-1.2B-Base (pretrained) - **Parameters**: ~1.2B - **Architecture**: Hybrid transformer (full_attention + conv layers) - **Embedding Dimension**: 1536D - **Depth**: 16 layers - **Attention Heads**: 24 - **Vocab Size**: 131072 - **Context Length**: 32768 tokens - **Why LFM2.5**: 3x faster training, 2x faster inference than Qwen3 on CPU ### 6. Task-Specific Heads #### Segmentation Head - **Type**: MLP - **Input**: 2176D - **Hidden**: 256D - **Output**: num_classes (e.g., 150 for ADE20K) - **Output Shape**: (batch, 14, 14, num_classes) #### Classification Head - **Type**: MLP - **Input**: 2176D - **Hidden**: 256D - **Output**: num_classes (e.g., 1000 for ImageNet) - **Uses**: CLS token from fused features #### Detection Head - **Type**: MLP - **Input**: 2176D - **Hidden**: 256D - **Outputs**: - Class logits: (batch, 196, anchors, num_classes) - Box predictions: (batch, 196, anchors, 4) #### OCR Head - **Type**: CNN + MLP - **Input**: 2176D - **Outputs**: - Text logits: (batch, 14, 14, max_seq_len) - Geometry: (batch, 196, 4) [x, y, w, h] ## Model Flow ``` Input Image 1 (224×224) ──→ DINOv3 Encoder ↓ 196 patches (14×14) 1024D per patch ↓ └─────────────────┐ │ Input Image 2 (384×384) ──→ SigLIP2 Encoder │ ↓ │ 576 patches (24×24) │ 1152D per patch │ ↓ │ Resample to 14×14 │ ↓ │ └────── Concatenate ──→ 2176D features │ ↓ Vision Projector (MLP) │ ↓ 1536D embeddings │ ┌──────────────────┬────────────────────┴────────────────────┐ ↓ ↓ ↓ Segmentation Classification LFM2.5 LM Head Head (1.2B) ↓ ↓ ↓ (14×14, classes) (class_id) Text Output (Caption/VQA) ↓ ↓ ↓ Segmentation Classification Generated Predictions Predictions Text ┌───────────────────────┐ ↓ ↓ Detection Head OCR Head ↓ ↓ (boxes + classes) (text + geometry) ``` ## Parameter Count | Component | Parameters | |-----------|------------| | DINOv3 Encoder | 1,700,000,000 | | SigLIP2 Encoder | 400,000,000 | | Projector | 5,000,000 | | LFM2.5 Language Model | 1,200,000,000 | | Segmentation Head | 500,000 | | Classification Head | 300,000 | | Detection Head | 500,000 | | OCR Head | 300,000 | | **Total** | **~3,806,600,000** | ## Training Strategy ### Stage 1: Connector Pretraining - **Freeze**: All vision encoders, LFM2.5 - **Train**: Projector only - **Data**: Image-caption pairs (CC3M, LAION) - **Goal**: Align vision and language representations - **Batch Size**: 8-16 - **Learning Rate**: 1e-3 ### Stage 2: Head Training - **Freeze**: Encoders, LFM2.5, Projector - **Train**: Task heads only - **Data**: Task-specific datasets - **Goal**: Learn task-specific heads - **Batch Size**: 8-16 - **Learning Rate**: 1e-3 ### Stage 3: Full Fine-tuning - **Freeze**: None - **Train**: All components - **Data**: Multi-task or specific task - **Goal**: End-to-end optimization - **Learning Rate**: 1e-5 (encoders), 1e-4 (heads) ## Memory Requirements | Mode | Memory | |------|--------| | Inference | ~10 GB | | Training (frozen encoders) | ~12 GB | | Training (full) | ~30 GB | ## Why LFM2.5? - **3x faster training** than Qwen3 on CPU - **2x faster decode/prefill** on CPU - **Optimized for edge** - runs under 1GB memory - **Native MLX support** - **Hybrid architecture** - mix of attention and conv layers ## Comparison with Alternatives | Aspect | Oculus (LFM2.5) | Oculus (Qwen2) | |--------|---------------|--------------| | LM Parameters | 1.2B | 1.5B | | Training Speed | 3x faster | Baseline | | Inference Speed | 2x faster | Baseline | | MLX Support | Native | Via mlx-lm | | Edge Performance | Excellent | Good | ## Supported Tasks | Task | Input | Output | |------|-------|--------| | Captioning | Image + prompt | Generated text | | VQA | Image + question | Answer text | | Segmentation | Image | Class per pixel | | Classification | Image | Class label | | Detection | Image | Boxes + classes | | OCR | Image | Text + bounding boxes | | Feature Extraction | Image | 2176D features | ## Input/Output Shapes | Input | Shape | |-------|-------| | DINOv3 Image | (B, 3, 224, 224) | | SigLIP2 Image | (B, 3, 384, 384) | | Input IDs | (B, seq_len) | | Output | Shape | |--------|-------| | Generated Text | (B, seq_len + new_tokens) | | Segmentation | (B, 14, 14) | | Classification | (B,) | | Detection | (B, 196, 9, 80), (B, 196, 9, 4) | | OCR Text | (B, 14, 14, max_seq_len) |