Oculus / docs /ARCHITECTURE.md
kobiakor15's picture
Upload docs/ARCHITECTURE.md with huggingface_hub
11e1f9d verified
# Oculus 0.1 Architecture
## Overview
Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.
## Architecture Components
### 1. DINOv3 Encoder (ViT-L/16)
- **Model**: DINOv3 ViT-L/16 (pretrained)
- **Parameters**: ~1.7B
- **Input**: 224Γ—224 images
- **Output**: 197 tokens (1 CLS + 196 patches)
- **Patch Grid**: 14Γ—14
- **Feature Dimension**: 1024D
- **Capabilities**: Universal vision backbone, dense prediction
### 2. SigLIP2 Encoder (SO400M)
- **Model**: SigLIP2 SO400M (pretrained)
- **Parameters**: ~400M
- **Input**: 384Γ—384 images
- **Output**: 576 patch tokens
- **Patch Grid**: 24Γ—24
- **Feature Dimension**: 1152D
- **Capabilities**: Vision-language understanding, fine-grained features
### 3. Feature Fusion
- **Method**: Concatenation
- **Input**: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
- **Output**: 2176D per spatial location
- **Note**: SigLIP2 features resampled to 14Γ—14 to match DINOv3
### 4. Vision-Language Projector
- **Type**: 2-layer MLP with GELU
- **Input**: 2176D
- **Hidden**: 4352D
- **Output**: 1536D (LFM2.5 embedding dimension)
- **Parameters**: ~5M
### 5. LFM2.5-1.2B Language Model
- **Model**: LFM2.5-1.2B-Base (pretrained)
- **Parameters**: ~1.2B
- **Architecture**: Hybrid transformer (full_attention + conv layers)
- **Embedding Dimension**: 1536D
- **Depth**: 16 layers
- **Attention Heads**: 24
- **Vocab Size**: 131072
- **Context Length**: 32768 tokens
- **Why LFM2.5**: 3x faster training, 2x faster inference than Qwen3 on CPU
### 6. Task-Specific Heads
#### Segmentation Head
- **Type**: MLP
- **Input**: 2176D
- **Hidden**: 256D
- **Output**: num_classes (e.g., 150 for ADE20K)
- **Output Shape**: (batch, 14, 14, num_classes)
#### Classification Head
- **Type**: MLP
- **Input**: 2176D
- **Hidden**: 256D
- **Output**: num_classes (e.g., 1000 for ImageNet)
- **Uses**: CLS token from fused features
#### Detection Head
- **Type**: MLP
- **Input**: 2176D
- **Hidden**: 256D
- **Outputs**:
- Class logits: (batch, 196, anchors, num_classes)
- Box predictions: (batch, 196, anchors, 4)
#### OCR Head
- **Type**: CNN + MLP
- **Input**: 2176D
- **Outputs**:
- Text logits: (batch, 14, 14, max_seq_len)
- Geometry: (batch, 196, 4) [x, y, w, h]
## Model Flow
```
Input Image 1 (224Γ—224) ──→ DINOv3 Encoder
↓
196 patches (14Γ—14)
1024D per patch
↓
└─────────────────┐
β”‚
Input Image 2 (384Γ—384) ──→ SigLIP2 Encoder β”‚
↓ β”‚
576 patches (24Γ—24) β”‚
1152D per patch β”‚
↓ β”‚
Resample to 14Γ—14 β”‚
↓ β”‚
└────── Concatenate ──→ 2176D features
β”‚
↓
Vision Projector (MLP)
β”‚
↓
1536D embeddings
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓ ↓
Segmentation Classification LFM2.5 LM
Head Head (1.2B)
↓ ↓ ↓
(14Γ—14, classes) (class_id) Text Output
(Caption/VQA)
↓ ↓ ↓
Segmentation Classification Generated
Predictions Predictions Text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
↓ ↓
Detection Head OCR Head
↓ ↓
(boxes + classes) (text + geometry)
```
## Parameter Count
| Component | Parameters |
|-----------|------------|
| DINOv3 Encoder | 1,700,000,000 |
| SigLIP2 Encoder | 400,000,000 |
| Projector | 5,000,000 |
| LFM2.5 Language Model | 1,200,000,000 |
| Segmentation Head | 500,000 |
| Classification Head | 300,000 |
| Detection Head | 500,000 |
| OCR Head | 300,000 |
| **Total** | **~3,806,600,000** |
## Training Strategy
### Stage 1: Connector Pretraining
- **Freeze**: All vision encoders, LFM2.5
- **Train**: Projector only
- **Data**: Image-caption pairs (CC3M, LAION)
- **Goal**: Align vision and language representations
- **Batch Size**: 8-16
- **Learning Rate**: 1e-3
### Stage 2: Head Training
- **Freeze**: Encoders, LFM2.5, Projector
- **Train**: Task heads only
- **Data**: Task-specific datasets
- **Goal**: Learn task-specific heads
- **Batch Size**: 8-16
- **Learning Rate**: 1e-3
### Stage 3: Full Fine-tuning
- **Freeze**: None
- **Train**: All components
- **Data**: Multi-task or specific task
- **Goal**: End-to-end optimization
- **Learning Rate**: 1e-5 (encoders), 1e-4 (heads)
## Memory Requirements
| Mode | Memory |
|------|--------|
| Inference | ~10 GB |
| Training (frozen encoders) | ~12 GB |
| Training (full) | ~30 GB |
## Why LFM2.5?
- **3x faster training** than Qwen3 on CPU
- **2x faster decode/prefill** on CPU
- **Optimized for edge** - runs under 1GB memory
- **Native MLX support**
- **Hybrid architecture** - mix of attention and conv layers
## Comparison with Alternatives
| Aspect | Oculus (LFM2.5) | Oculus (Qwen2) |
|--------|---------------|--------------|
| LM Parameters | 1.2B | 1.5B |
| Training Speed | 3x faster | Baseline |
| Inference Speed | 2x faster | Baseline |
| MLX Support | Native | Via mlx-lm |
| Edge Performance | Excellent | Good |
## Supported Tasks
| Task | Input | Output |
|------|-------|--------|
| Captioning | Image + prompt | Generated text |
| VQA | Image + question | Answer text |
| Segmentation | Image | Class per pixel |
| Classification | Image | Class label |
| Detection | Image | Boxes + classes |
| OCR | Image | Text + bounding boxes |
| Feature Extraction | Image | 2176D features |
## Input/Output Shapes
| Input | Shape |
|-------|-------|
| DINOv3 Image | (B, 3, 224, 224) |
| SigLIP2 Image | (B, 3, 384, 384) |
| Input IDs | (B, seq_len) |
| Output | Shape |
|--------|-------|
| Generated Text | (B, seq_len + new_tokens) |
| Segmentation | (B, 14, 14) |
| Classification | (B,) |
| Detection | (B, 196, 9, 80), (B, 196, 9, 4) |
| OCR Text | (B, 14, 14, max_seq_len) |