OceanirAI
/

Oculus

+---
+license: cc-by-nc-4.0
+language:
+- en
+pipeline_tag: image-text-to-text
+tags:
+- vision
+- multimodal
+- vision-language
+- segmentation
+- detection
+- ocr
+- dinov3
+- siglip2
+- lfm2.5
+base_model:
+- facebook/dinov3-vith16plus-pretrain-lvd1689m
+- google/siglip2-so400m-patch16-naflex
+- LiquidAI/LFM2.5-1.2B-Base
+---
+# Oculus 0.1
+A multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.
+## What is this?
+Oculus is a universal vision-language model for:
+- **Image Captioning**: Generate natural language descriptions
+- **Visual Question Answering**: Answer questions about images
+- **Semantic Segmentation**: Pixel-level class prediction
+- **Image Classification**: Global image classification
+- **Object Detection**: Bounding box prediction
+- **OCR**: Text detection and recognition
+## Model Architecture
+```
+Image (224×224) ──→ DINOv3 ViT-L/16 ──┐
+                                       ├──→ Concatenate ──→ Projector ──→ LFM2.5-1.2B
+Image (384×384) ──→ SigLIP2 SO400M ──┘                          │
+                                                                 ├──→ Text Output (Caption/VQA)
+                                                    Segmentation Head ──→ Segmentation Map
+                                                   Classification Head ──→ Class Label
+                                                      Detection Head ──→ Boxes + Classes
+                                                          OCR Head ──→ Text + Geometry
+```
+## Components
+| Component | Model | Parameters | Input | Output |
+|-----------|-------|------------|-------|--------|
+| Vision Encoder 1 | DINOv3 ViT-H/16+ | 1.7B | 224×224 | 256×1280 |
+| Vision Encoder 2 | SigLIP2 SO400M | 400M | 384×384 | 576×1152 |
+| Fusion | Concatenation | - | 2432D | 2432D |
+| Projector | 2-layer MLP | ~5M | 2432D | 1536D |
+| Language Model | LFM2.5-1.2B | 1.2B | 1536D | Text |
+| Segmentation Head | MLP | ~0.5M | 2432D | 14×14×150 |
+| Classification Head | MLP | ~0.3M | 2432D | 1000 |
+| Detection Head | MLP | ~0.5M | 2432D | Boxes + Classes |
+| OCR Head | CNN + MLP | ~0.3M | 2432D | Text + Geometry |
+**Total: ~4.5B parameters**
+## Usage
+### Basic Language Generation
+```python
+from oculus import create_oculus_model
+import mx
+model = create_oculus_model(num_classes=150)
+dinov3_image = mx.random.normal((1, 3, 224, 224))
+siglip2_image = mx.random.normal((1, 3, 384, 384))
+prompt = mx.array([[1, 2, 3, 4, 5]])  # Tokenized text
+generated = model.generate(
+    input_ids=prompt,
+    x_dinov3=dinov3_image,
+    x_siglip2=siglip2_image,
+    max_new_tokens=512,
+    temperature=0.7,
+)
+print(f"Generated: {generated.tolist()}")
+```
+### Visual Question Answering
+```python
+from oculus import create_oculus_model
+import mx
+model = create_oculus_model()
+dinov3_image = mx.random.normal((1, 3, 224, 224))
+siglip2_image = mx.random.normal((1, 3, 384, 384))
+question = mx.array([[1, 2, 3, 4, 5, 6, 7, 8]])  # "What is in the image?"
+answer = model.generate(
+    input_ids=question,
+    x_dinov3=dinov3_image,
+    x_siglip2=siglip2_image,
+    max_new_tokens=100,
+)
+print(f"Answer: {answer.tolist()}")
+```
+### Semantic Segmentation
+```python
+from oculus import create_oculus_model
+import mx
+model = create_oculus_model(num_classes=150)  # ADE20K
+dinov3_image = mx.random.normal((1, 3, 224, 224))
+siglip2_image = mx.random.normal((1, 3, 384, 384))
+predictions = model.segment(dinov3_image, siglip2_image)
+print(f"Segmentation shape: {predictions.shape}")  # (1, 14, 14)
+```
+### Image Classification
+```python
+from oculus import create_oculus_model
+import mx
+model = create_oculus_model(num_classes=1000)
+dinov3_image = mx.random.normal((4, 3, 224, 224))
+siglip2_image = mx.random.normal((4, 3, 384, 384))
+class_id = model.classify(dinov3_image, siglip2_image)
+print(f"Predicted classes: {class_id.tolist()}")
+```
+### Object Detection
+```python
+from oculus import create_oculus_model
+import mx
+model = create_oculus_model(num_classes=80)  # COCO
+dinov3_image = mx.random.normal((1, 3, 224, 224))
+siglip2_image = mx.random.normal((1, 3, 384, 384))
+cls_logits, bbox_preds = model.detect(dinov3_image, siglip2_image)
+print(f"Class logits: {cls_logits.shape}")  # (1, 196, 9, 80)
+print(f"Box predictions: {bbox_preds.shape}")  # (1, 196, 9, 4)
+```
+### OCR
+```python
+from oculus import create_oculus_model
+import mx
+model = create_oculus_model()
+dinov3_image = mx.random.normal((1, 3, 224, 224))
+siglip2_image = mx.random.normal((1, 3, 384, 384))
+text_logits, geo_preds = model.ocr(dinov3_image, siglip2_image)
+print(f"Text logits: {text_logits.shape}")  # (14, 14, max_seq_len)
+print(f"Geometry: {geo_preds.shape}")  # (196, 4)
+```
+## Loading Pretrained Weights
+```python
+import os
+from oculus import (
+    create_oculus_model,
+    load_dinov3_from_hf,
+    load_siglip2_from_hf,
+    load_lfm2_from_hf,
+)
+model = create_oculus_model(num_classes=150)
+token = os.getenv("HF_TOKEN")
+load_dinov3_from_hf(
+    model.dinov3_encoder,
+    repo_id="facebook/dinov3-vith16plus-pretrain-lvd1689m",
+    token=token,
+)
+load_siglip2_from_hf(
+    model.siglip2_encoder,
+    repo_id="google/siglip2-so400m-patch16-naflex",
+    token=token,
+)
+load_lfm2_from_hf(
+    model.language_model,
+    repo_id="LiquidAI/LFM2.5-1.2B-Base",
+    token=token,
+)
+```
+## Running Examples
+```bash
+cd Oculus/src/models
+python oculus_example.py
+```
+## Performance
+| Task | Dataset | Metric | Expected |
+|------|---------|--------|----------|
+| Image Classification | ImageNet | Top-1 | ~75% |
+| Semantic Segmentation | ADE20K | mIoU | ~45% |
+| Object Detection | COCO | mAP | ~45% |
+| VQA | VQA2.0 | Accuracy | ~65% |
+## Memory Requirements
+| Mode | Memory |
+|------|--------|
+| Inference | ~10 GB |
+| Training (frozen encoders) | ~12 GB |
+| Training (full) | ~30 GB |
+## Requirements
+```bash
+pip install mlx
+pip install huggingface_hub  # for pretrained weights
+```
+## Model Sources
+- DINOv3: [facebook/dinov3-vith16plus-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
+- SigLIP2: [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
+- LFM2.5: [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
+## License
+CC-BY-NC-4.0
+## Contact
+- Organization: OceanirAI
+- GitHub: github.com/Oceanir