Oculus 0.1 Architecture

Overview

Oculus is a ~3.8B parameter multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5-1.2B. Designed for Apple Silicon using MLX.

Architecture Components

1. DINOv3 Encoder (ViT-L/16)

Model: DINOv3 ViT-L/16 (pretrained)
Parameters: ~1.7B
Input: 224×224 images
Output: 197 tokens (1 CLS + 196 patches)
Patch Grid: 14×14
Feature Dimension: 1024D
Capabilities: Universal vision backbone, dense prediction

2. SigLIP2 Encoder (SO400M)

Model: SigLIP2 SO400M (pretrained)
Parameters: ~400M
Input: 384×384 images
Output: 576 patch tokens
Patch Grid: 24×24
Feature Dimension: 1152D
Capabilities: Vision-language understanding, fine-grained features

3. Feature Fusion

Method: Concatenation
Input: DINOv3 patches (1024D) + SigLIP2 patches (1152D)
Output: 2176D per spatial location
Note: SigLIP2 features resampled to 14×14 to match DINOv3

4. Vision-Language Projector

Type: 2-layer MLP with GELU
Input: 2176D
Hidden: 4352D
Output: 1536D (LFM2.5 embedding dimension)
Parameters: ~5M

5. LFM2.5-1.2B Language Model

Model: LFM2.5-1.2B-Base (pretrained)
Parameters: ~1.2B
Architecture: Hybrid transformer (full_attention + conv layers)
Embedding Dimension: 1536D
Depth: 16 layers
Attention Heads: 24
Vocab Size: 131072
Context Length: 32768 tokens
Why LFM2.5: 3x faster training, 2x faster inference than Qwen3 on CPU

6. Task-Specific Heads

Segmentation Head

Type: MLP
Input: 2176D
Hidden: 256D
Output: num_classes (e.g., 150 for ADE20K)
Output Shape: (batch, 14, 14, num_classes)

Classification Head

Type: MLP
Input: 2176D
Hidden: 256D
Output: num_classes (e.g., 1000 for ImageNet)
Uses: CLS token from fused features

Detection Head

Type: MLP
Input: 2176D
Hidden: 256D
Outputs:
- Class logits: (batch, 196, anchors, num_classes)
- Box predictions: (batch, 196, anchors, 4)

OCR Head

Type: CNN + MLP
Input: 2176D
Outputs:
- Text logits: (batch, 14, 14, max_seq_len)
- Geometry: (batch, 196, 4) [x, y, w, h]

Model Flow

Input Image 1 (224×224) ──→ DINOv3 Encoder
                              ↓
                        196 patches (14×14)
                        1024D per patch
                              ↓
                              └─────────────────┐
                                                │
Input Image 2 (384×384) ──→ SigLIP2 Encoder     │
                              ↓                 │
                        576 patches (24×24)     │
                        1152D per patch         │
                              ↓                 │
                        Resample to 14×14       │
                              ↓                 │
                              └────── Concatenate ──→ 2176D features
                                                           │
                                                           ↓
                                              Vision Projector (MLP)
                                                           │
                                                           ↓
                                                    1536D embeddings
                                                           │
                    ┌──────────────────┬────────────────────┴────────────────────┐
                    ↓                    ↓                                          ↓
              Segmentation          Classification                               LFM2.5 LM
                Head                   Head                                       (1.2B)
                    ↓                    ↓                                          ↓
            (14×14, classes)      (class_id)                                Text Output
                                                                             (Caption/VQA)
                    ↓                    ↓                                          ↓
              Segmentation         Classification                              Generated
                Predictions          Predictions                                 Text

                    ┌───────────────────────┐
                    ↓                       ↓
              Detection Head            OCR Head
                    ↓                       ↓
            (boxes + classes)      (text + geometry)

Parameter Count

Component	Parameters
DINOv3 Encoder	1,700,000,000
SigLIP2 Encoder	400,000,000
Projector	5,000,000
LFM2.5 Language Model	1,200,000,000
Segmentation Head	500,000
Classification Head	300,000
Detection Head	500,000
OCR Head	300,000
Total	~3,806,600,000

Training Strategy

Stage 1: Connector Pretraining

Freeze: All vision encoders, LFM2.5
Train: Projector only
Data: Image-caption pairs (CC3M, LAION)
Goal: Align vision and language representations
Batch Size: 8-16
Learning Rate: 1e-3

Stage 2: Head Training

Freeze: Encoders, LFM2.5, Projector
Train: Task heads only
Data: Task-specific datasets
Goal: Learn task-specific heads
Batch Size: 8-16
Learning Rate: 1e-3

Stage 3: Full Fine-tuning

Freeze: None
Train: All components
Data: Multi-task or specific task
Goal: End-to-end optimization
Learning Rate: 1e-5 (encoders), 1e-4 (heads)

Memory Requirements

Mode	Memory
Inference	~10 GB
Training (frozen encoders)	~12 GB
Training (full)	~30 GB

Why LFM2.5?

3x faster training than Qwen3 on CPU
2x faster decode/prefill on CPU
Optimized for edge - runs under 1GB memory
Native MLX support
Hybrid architecture - mix of attention and conv layers

Comparison with Alternatives

Aspect	Oculus (LFM2.5)	Oculus (Qwen2)
LM Parameters	1.2B	1.5B
Training Speed	3x faster	Baseline
Inference Speed	2x faster	Baseline
MLX Support	Native	Via mlx-lm
Edge Performance	Excellent	Good

Supported Tasks

Task	Input	Output
Captioning	Image + prompt	Generated text
VQA	Image + question	Answer text
Segmentation	Image	Class per pixel
Classification	Image	Class label
Detection	Image	Boxes + classes
OCR	Image	Text + bounding boxes
Feature Extraction	Image	2176D features

Input/Output Shapes

Input	Shape
DINOv3 Image	(B, 3, 224, 224)
SigLIP2 Image	(B, 3, 384, 384)
Input IDs	(B, seq_len)

Output	Shape
Generated Text	(B, seq_len + new_tokens)
Segmentation	(B, 14, 14)
Classification	(B,)
Detection	(B, 196, 9, 80), (B, 196, 9, 4)
OCR Text	(B, 14, 14, max_seq_len)