OceanirAI
/

Oculus

+# Oculus 0.1 Architecture
+## Overview
+Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.
+## Architecture Components
+### 1. DINOv3 Encoder (ViT-L/16)
+- **Model**: DINOv3 ViT-L/16 (pretrained)
+- **Parameters**: ~1.7B
+- **Input**: 224×224 images
+- **Output**: 197 tokens (1 CLS + 196 patches)
+- **Patch Grid**: 14×14
+- **Feature Dimension**: 1024D
+- **Capabilities**: Universal vision backbone, dense prediction
+### 2. SigLIP2 Encoder (SO400M)
+- **Model**: SigLIP2 SO400M (pretrained)
+- **Parameters**: ~400M
+- **Input**: 384×384 images
+- **Output**: 576 patch tokens
+- **Patch Grid**: 24×24
+- **Feature Dimension**: 1152D
+- **Capabilities**: Vision-language understanding, fine-grained features
+### 3. Feature Fusion
+- **Method**: Concatenation
+- **Input**: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
+- **Output**: 2176D per spatial location
+- **Note**: SigLIP2 features resampled to 14×14 to match DINOv3
+### 4. Vision-Language Projector
+- **Type**: 2-layer MLP with GELU
+- **Input**: 2176D
+- **Hidden**: 4352D
+- **Output**: 1536D (LFM2.5 embedding dimension)
+- **Parameters**: ~5M
+### 5. LFM2.5-1.2B Language Model
+- **Model**: LFM2.5-1.2B-Base (pretrained)
+- **Parameters**: ~1.2B
+- **Architecture**: Hybrid transformer (full_attention + conv layers)
+- **Embedding Dimension**: 1536D
+- **Depth**: 16 layers
+- **Attention Heads**: 24
+- **Vocab Size**: 131072
+- **Context Length**: 32768 tokens
+- **Why LFM2.5**: 3x faster training, 2x faster inference than Qwen3 on CPU
+### 6. Task-Specific Heads
+#### Segmentation Head
+- **Type**: MLP
+- **Input**: 2176D
+- **Hidden**: 256D
+- **Output**: num_classes (e.g., 150 for ADE20K)
+- **Output Shape**: (batch, 14, 14, num_classes)
+#### Classification Head
+- **Type**: MLP
+- **Input**: 2176D
+- **Hidden**: 256D
+- **Output**: num_classes (e.g., 1000 for ImageNet)
+- **Uses**: CLS token from fused features
+#### Detection Head
+- **Type**: MLP
+- **Input**: 2176D
+- **Hidden**: 256D
+- **Outputs**:
+  - Class logits: (batch, 196, anchors, num_classes)
+  - Box predictions: (batch, 196, anchors, 4)
+#### OCR Head
+- **Type**: CNN + MLP
+- **Input**: 2176D
+- **Outputs**:
+  - Text logits: (batch, 14, 14, max_seq_len)
+  - Geometry: (batch, 196, 4) [x, y, w, h]
+## Model Flow
+```
+Input Image 1 (224×224) ──→ DINOv3 Encoder
+                              ↓
+                        196 patches (14×14)
+                        1024D per patch
+                              ↓
+                              └─────────────────┐
+                                                │
+Input Image 2 (384×384) ──→ SigLIP2 Encoder     │
+                              ↓                 │
+                        576 patches (24×24)     │
+                        1152D per patch         │
+                              ↓                 │
+                        Resample to 14×14       │
+                              ↓                 │
+                              └────── Concatenate ──→ 2176D features
+                                                           │
+                                                           ↓
+                                              Vision Projector (MLP)
+                                                           │
+                                                           ↓
+                                                    1536D embeddings
+                                                           │
+                    ┌──────────────────┬────────────────────┴────────────────────┐
+                    ↓                    ↓                                          ↓
+              Segmentation          Classification                               LFM2.5 LM
+                Head                   Head                                       (1.2B)
+                    ↓                    ↓                                          ↓
+            (14×14, classes)      (class_id)                                Text Output
+                                                                             (Caption/VQA)
+                    ↓                    ↓                                          ↓
+              Segmentation         Classification                              Generated
+                Predictions          Predictions                                 Text
+                    ┌───────────────────────┐
+                    ↓                       ↓
+              Detection Head            OCR Head
+                    ↓                       ↓
+            (boxes + classes)      (text + geometry)
+```
+## Parameter Count
+| Component | Parameters |
+|-----------|------------|
+| DINOv3 Encoder | 1,700,000,000 |
+| SigLIP2 Encoder | 400,000,000 |
+| Projector | 5,000,000 |
+| LFM2.5 Language Model | 1,200,000,000 |
+| Segmentation Head | 500,000 |
+| Classification Head | 300,000 |
+| Detection Head | 500,000 |
+| OCR Head | 300,000 |
+| **Total** | **~3,806,600,000** |
+## Training Strategy
+### Stage 1: Connector Pretraining
+- **Freeze**: All vision encoders, LFM2.5
+- **Train**: Projector only
+- **Data**: Image-caption pairs (CC3M, LAION)
+- **Goal**: Align vision and language representations
+- **Batch Size**: 8-16
+- **Learning Rate**: 1e-3
+### Stage 2: Head Training
+- **Freeze**: Encoders, LFM2.5, Projector
+- **Train**: Task heads only
+- **Data**: Task-specific datasets
+- **Goal**: Learn task-specific heads
+- **Batch Size**: 8-16
+- **Learning Rate**: 1e-3
+### Stage 3: Full Fine-tuning
+- **Freeze**: None
+- **Train**: All components
+- **Data**: Multi-task or specific task
+- **Goal**: End-to-end optimization
+- **Learning Rate**: 1e-5 (encoders), 1e-4 (heads)
+## Memory Requirements
+| Mode | Memory |
+|------|--------|
+| Inference | ~10 GB |
+| Training (frozen encoders) | ~12 GB |
+| Training (full) | ~30 GB |
+## Why LFM2.5?
+- **3x faster training** than Qwen3 on CPU
+- **2x faster decode/prefill** on CPU
+- **Optimized for edge** - runs under 1GB memory
+- **Native MLX support**
+- **Hybrid architecture** - mix of attention and conv layers
+## Comparison with Alternatives
+| Aspect | Oculus (LFM2.5) | Oculus (Qwen2) |
+|--------|---------------|--------------|
+| LM Parameters | 1.2B | 1.5B |
+| Training Speed | 3x faster | Baseline |
+| Inference Speed | 2x faster | Baseline |
+| MLX Support | Native | Via mlx-lm |
+| Edge Performance | Excellent | Good |
+## Supported Tasks
+| Task | Input | Output |
+|------|-------|--------|
+| Captioning | Image + prompt | Generated text |
+| VQA | Image + question | Answer text |
+| Segmentation | Image | Class per pixel |
+| Classification | Image | Class label |
+| Detection | Image | Boxes + classes |
+| OCR | Image | Text + bounding boxes |
+| Feature Extraction | Image | 2176D features |
+## Input/Output Shapes
+| Input | Shape |
+|-------|-------|
+| DINOv3 Image | (B, 3, 224, 224) |
+| SigLIP2 Image | (B, 3, 384, 384) |
+| Input IDs | (B, seq_len) |
+| Output | Shape |
+|--------|-------|
+| Generated Text | (B, seq_len + new_tokens) |
+| Segmentation | (B, 14, 14) |
+| Classification | (B,) |
+| Detection | (B, 196, 9, 80), (B, 196, 9, 4) |
+| OCR Text | (B, 14, 14, max_seq_len) |