Oculus

File size: 7,452 Bytes

11e1f9d

# Oculus 0.1 Architecture

## Overview
Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.

## Architecture Components

### 1. DINOv3 Encoder (ViT-L/16)
- **Model**: DINOv3 ViT-L/16 (pretrained)
- **Parameters**: ~1.7B
- **Input**: 224×224 images
- **Output**: 197 tokens (1 CLS + 196 patches)
- **Patch Grid**: 14×14
- **Feature Dimension**: 1024D
- **Capabilities**: Universal vision backbone, dense prediction

### 2. SigLIP2 Encoder (SO400M)
- **Model**: SigLIP2 SO400M (pretrained)
- **Parameters**: ~400M
- **Input**: 384×384 images
- **Output**: 576 patch tokens
- **Patch Grid**: 24×24
- **Feature Dimension**: 1152D
- **Capabilities**: Vision-language understanding, fine-grained features

### 3. Feature Fusion
- **Method**: Concatenation
- **Input**: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
- **Output**: 2176D per spatial location
- **Note**: SigLIP2 features resampled to 14×14 to match DINOv3

### 4. Vision-Language Projector
- **Type**: 2-layer MLP with GELU
- **Input**: 2176D
- **Hidden**: 4352D
- **Output**: 1536D (LFM2.5 embedding dimension)
- **Parameters**: ~5M

### 5. LFM2.5-1.2B Language Model
- **Model**: LFM2.5-1.2B-Base (pretrained)
- **Parameters**: ~1.2B
- **Architecture**: Hybrid transformer (full_attention + conv layers)
- **Embedding Dimension**: 1536D
- **Depth**: 16 layers
- **Attention Heads**: 24
- **Vocab Size**: 131072
- **Context Length**: 32768 tokens
- **Why LFM2.5**: 3x faster training, 2x faster inference than Qwen3 on CPU

### 6. Task-Specific Heads

#### Segmentation Head
- **Type**: MLP
- **Input**: 2176D
- **Hidden**: 256D
- **Output**: num_classes (e.g., 150 for ADE20K)
- **Output Shape**: (batch, 14, 14, num_classes)

#### Classification Head
- **Type**: MLP
- **Input**: 2176D
- **Hidden**: 256D
- **Output**: num_classes (e.g., 1000 for ImageNet)
- **Uses**: CLS token from fused features

#### Detection Head
- **Type**: MLP
- **Input**: 2176D
- **Hidden**: 256D
- **Outputs**:
  - Class logits: (batch, 196, anchors, num_classes)
  - Box predictions: (batch, 196, anchors, 4)

#### OCR Head
- **Type**: CNN + MLP
- **Input**: 2176D
- **Outputs**:
  - Text logits: (batch, 14, 14, max_seq_len)
  - Geometry: (batch, 196, 4) [x, y, w, h]

## Model Flow

```
Input Image 1 (224×224) ──→ DINOv3 Encoder
                              ↓
                        196 patches (14×14)
                        1024D per patch
                              ↓
                              └─────────────────┐
                                                │
Input Image 2 (384×384) ──→ SigLIP2 Encoder     │
                              ↓                 │
                        576 patches (24×24)     │
                        1152D per patch         │
                              ↓                 │
                        Resample to 14×14       │
                              ↓                 │
                              └────── Concatenate ──→ 2176D features
                                                           │
                                                           ↓
                                              Vision Projector (MLP)
                                                           │
                                                           ↓
                                                    1536D embeddings
                                                           │
                    ┌──────────────────┬────────────────────┴────────────────────┐
                    ↓                    ↓                                          ↓
              Segmentation          Classification                               LFM2.5 LM
                Head                   Head                                       (1.2B)
                    ↓                    ↓                                          ↓
            (14×14, classes)      (class_id)                                Text Output
                                                                             (Caption/VQA)
                    ↓                    ↓                                          ↓
              Segmentation         Classification                              Generated
                Predictions          Predictions                                 Text

                    ┌───────────────────────┐
                    ↓                       ↓
              Detection Head            OCR Head
                    ↓                       ↓
            (boxes + classes)      (text + geometry)
```

## Parameter Count

| Component | Parameters |
|-----------|------------|
| DINOv3 Encoder | 1,700,000,000 |
| SigLIP2 Encoder | 400,000,000 |
| Projector | 5,000,000 |
| LFM2.5 Language Model | 1,200,000,000 |
| Segmentation Head | 500,000 |
| Classification Head | 300,000 |
| Detection Head | 500,000 |
| OCR Head | 300,000 |
| **Total** | **~3,806,600,000** |

## Training Strategy

### Stage 1: Connector Pretraining
- **Freeze**: All vision encoders, LFM2.5
- **Train**: Projector only
- **Data**: Image-caption pairs (CC3M, LAION)
- **Goal**: Align vision and language representations
- **Batch Size**: 8-16
- **Learning Rate**: 1e-3

### Stage 2: Head Training
- **Freeze**: Encoders, LFM2.5, Projector
- **Train**: Task heads only
- **Data**: Task-specific datasets
- **Goal**: Learn task-specific heads
- **Batch Size**: 8-16
- **Learning Rate**: 1e-3

### Stage 3: Full Fine-tuning
- **Freeze**: None
- **Train**: All components
- **Data**: Multi-task or specific task
- **Goal**: End-to-end optimization
- **Learning Rate**: 1e-5 (encoders), 1e-4 (heads)

## Memory Requirements

| Mode | Memory |
|------|--------|
| Inference | ~10 GB |
| Training (frozen encoders) | ~12 GB |
| Training (full) | ~30 GB |

## Why LFM2.5?

- **3x faster training** than Qwen3 on CPU
- **2x faster decode/prefill** on CPU
- **Optimized for edge** - runs under 1GB memory
- **Native MLX support**
- **Hybrid architecture** - mix of attention and conv layers

## Comparison with Alternatives

| Aspect | Oculus (LFM2.5) | Oculus (Qwen2) |
|--------|---------------|--------------|
| LM Parameters | 1.2B | 1.5B |
| Training Speed | 3x faster | Baseline |
| Inference Speed | 2x faster | Baseline |
| MLX Support | Native | Via mlx-lm |
| Edge Performance | Excellent | Good |

## Supported Tasks

| Task | Input | Output |
|------|-------|--------|
| Captioning | Image + prompt | Generated text |
| VQA | Image + question | Answer text |
| Segmentation | Image | Class per pixel |
| Classification | Image | Class label |
| Detection | Image | Boxes + classes |
| OCR | Image | Text + bounding boxes |
| Feature Extraction | Image | 2176D features |

## Input/Output Shapes

| Input | Shape |
|-------|-------|
| DINOv3 Image | (B, 3, 224, 224) |
| SigLIP2 Image | (B, 3, 384, 384) |
| Input IDs | (B, seq_len) |

| Output | Shape |
|--------|-------|
| Generated Text | (B, seq_len + new_tokens) |
| Segmentation | (B, 14, 14) |
| Classification | (B,) |
| Detection | (B, 196, 9, 80), (B, 196, 9, 4) |
| OCR Text | (B, 14, 14, max_seq_len) |