Oculus / docs /ARCHITECTURE.md
kobiakor15's picture
Upload docs/ARCHITECTURE.md with huggingface_hub
11e1f9d verified

Oculus 0.1 Architecture

Overview

Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.

Architecture Components

1. DINOv3 Encoder (ViT-L/16)

  • Model: DINOv3 ViT-L/16 (pretrained)
  • Parameters: ~1.7B
  • Input: 224Γ—224 images
  • Output: 197 tokens (1 CLS + 196 patches)
  • Patch Grid: 14Γ—14
  • Feature Dimension: 1024D
  • Capabilities: Universal vision backbone, dense prediction

2. SigLIP2 Encoder (SO400M)

  • Model: SigLIP2 SO400M (pretrained)
  • Parameters: ~400M
  • Input: 384Γ—384 images
  • Output: 576 patch tokens
  • Patch Grid: 24Γ—24
  • Feature Dimension: 1152D
  • Capabilities: Vision-language understanding, fine-grained features

3. Feature Fusion

  • Method: Concatenation
  • Input: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
  • Output: 2176D per spatial location
  • Note: SigLIP2 features resampled to 14Γ—14 to match DINOv3

4. Vision-Language Projector

  • Type: 2-layer MLP with GELU
  • Input: 2176D
  • Hidden: 4352D
  • Output: 1536D (LFM2.5 embedding dimension)
  • Parameters: ~5M

5. LFM2.5-1.2B Language Model

  • Model: LFM2.5-1.2B-Base (pretrained)
  • Parameters: ~1.2B
  • Architecture: Hybrid transformer (full_attention + conv layers)
  • Embedding Dimension: 1536D
  • Depth: 16 layers
  • Attention Heads: 24
  • Vocab Size: 131072
  • Context Length: 32768 tokens
  • Why LFM2.5: 3x faster training, 2x faster inference than Qwen3 on CPU

6. Task-Specific Heads

Segmentation Head

  • Type: MLP
  • Input: 2176D
  • Hidden: 256D
  • Output: num_classes (e.g., 150 for ADE20K)
  • Output Shape: (batch, 14, 14, num_classes)

Classification Head

  • Type: MLP
  • Input: 2176D
  • Hidden: 256D
  • Output: num_classes (e.g., 1000 for ImageNet)
  • Uses: CLS token from fused features

Detection Head

  • Type: MLP
  • Input: 2176D
  • Hidden: 256D
  • Outputs:
    • Class logits: (batch, 196, anchors, num_classes)
    • Box predictions: (batch, 196, anchors, 4)

OCR Head

  • Type: CNN + MLP
  • Input: 2176D
  • Outputs:
    • Text logits: (batch, 14, 14, max_seq_len)
    • Geometry: (batch, 196, 4) [x, y, w, h]

Model Flow

Input Image 1 (224Γ—224) ──→ DINOv3 Encoder
                              ↓
                        196 patches (14Γ—14)
                        1024D per patch
                              ↓
                              └─────────────────┐
                                                β”‚
Input Image 2 (384Γ—384) ──→ SigLIP2 Encoder     β”‚
                              ↓                 β”‚
                        576 patches (24Γ—24)     β”‚
                        1152D per patch         β”‚
                              ↓                 β”‚
                        Resample to 14Γ—14       β”‚
                              ↓                 β”‚
                              └────── Concatenate ──→ 2176D features
                                                           β”‚
                                                           ↓
                                              Vision Projector (MLP)
                                                           β”‚
                                                           ↓
                                                    1536D embeddings
                                                           β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    ↓                    ↓                                          ↓
              Segmentation          Classification                               LFM2.5 LM
                Head                   Head                                       (1.2B)
                    ↓                    ↓                                          ↓
            (14Γ—14, classes)      (class_id)                                Text Output
                                                                             (Caption/VQA)
                    ↓                    ↓                                          ↓
              Segmentation         Classification                              Generated
                Predictions          Predictions                                 Text

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    ↓                       ↓
              Detection Head            OCR Head
                    ↓                       ↓
            (boxes + classes)      (text + geometry)

Parameter Count

Component Parameters
DINOv3 Encoder 1,700,000,000
SigLIP2 Encoder 400,000,000
Projector 5,000,000
LFM2.5 Language Model 1,200,000,000
Segmentation Head 500,000
Classification Head 300,000
Detection Head 500,000
OCR Head 300,000
Total ~3,806,600,000

Training Strategy

Stage 1: Connector Pretraining

  • Freeze: All vision encoders, LFM2.5
  • Train: Projector only
  • Data: Image-caption pairs (CC3M, LAION)
  • Goal: Align vision and language representations
  • Batch Size: 8-16
  • Learning Rate: 1e-3

Stage 2: Head Training

  • Freeze: Encoders, LFM2.5, Projector
  • Train: Task heads only
  • Data: Task-specific datasets
  • Goal: Learn task-specific heads
  • Batch Size: 8-16
  • Learning Rate: 1e-3

Stage 3: Full Fine-tuning

  • Freeze: None
  • Train: All components
  • Data: Multi-task or specific task
  • Goal: End-to-end optimization
  • Learning Rate: 1e-5 (encoders), 1e-4 (heads)

Memory Requirements

Mode Memory
Inference ~10 GB
Training (frozen encoders) ~12 GB
Training (full) ~30 GB

Why LFM2.5?

  • 3x faster training than Qwen3 on CPU
  • 2x faster decode/prefill on CPU
  • Optimized for edge - runs under 1GB memory
  • Native MLX support
  • Hybrid architecture - mix of attention and conv layers

Comparison with Alternatives

Aspect Oculus (LFM2.5) Oculus (Qwen2)
LM Parameters 1.2B 1.5B
Training Speed 3x faster Baseline
Inference Speed 2x faster Baseline
MLX Support Native Via mlx-lm
Edge Performance Excellent Good

Supported Tasks

Task Input Output
Captioning Image + prompt Generated text
VQA Image + question Answer text
Segmentation Image Class per pixel
Classification Image Class label
Detection Image Boxes + classes
OCR Image Text + bounding boxes
Feature Extraction Image 2176D features

Input/Output Shapes

Input Shape
DINOv3 Image (B, 3, 224, 224)
SigLIP2 Image (B, 3, 384, 384)
Input IDs (B, seq_len)
Output Shape
Generated Text (B, seq_len + new_tokens)
Segmentation (B, 14, 14)
Classification (B,)
Detection (B, 196, 9, 80), (B, 196, 9, 4)
OCR Text (B, 14, 14, max_seq_len)