metadata
license: cc-by-nc-4.0
language:
- en
pipeline_tag: image-text-to-text
tags:
- vision
- multimodal
- vision-language
- segmentation
- detection
- ocr
- dinov3
- siglip2
- lfm2.5
base_model:
- facebook/dinov3-vith16plus-pretrain-lvd1689m
- google/siglip2-so400m-patch16-naflex
- LiquidAI/LFM2.5-1.2B-Base
Oculus 0.1
A multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.
What is this?
Oculus is a universal vision-language model for:
- Image Captioning: Generate natural language descriptions
- Visual Question Answering: Answer questions about images
- Semantic Segmentation: Pixel-level class prediction
- Image Classification: Global image classification
- Object Detection: Bounding box prediction
- OCR: Text detection and recognition
Model Architecture
Image (224Γ224) βββ DINOv3 ViT-L/16 βββ
ββββ Concatenate βββ Projector βββ LFM2.5-1.2B
Image (384Γ384) βββ SigLIP2 SO400M βββ β
ββββ Text Output (Caption/VQA)
Segmentation Head βββ Segmentation Map
Classification Head βββ Class Label
Detection Head βββ Boxes + Classes
OCR Head βββ Text + Geometry
Components
| Component | Model | Parameters | Input | Output |
|---|---|---|---|---|
| Vision Encoder 1 | DINOv3 ViT-H/16+ | 1.7B | 224Γ224 | 256Γ1280 |
| Vision Encoder 2 | SigLIP2 SO400M | 400M | 384Γ384 | 576Γ1152 |
| Fusion | Concatenation | - | 2432D | 2432D |
| Projector | 2-layer MLP | ~5M | 2432D | 1536D |
| Language Model | LFM2.5-1.2B | 1.2B | 1536D | Text |
| Segmentation Head | MLP | ~0.5M | 2432D | 14Γ14Γ150 |
| Classification Head | MLP | ~0.3M | 2432D | 1000 |
| Detection Head | MLP | ~0.5M | 2432D | Boxes + Classes |
| OCR Head | CNN + MLP | ~0.3M | 2432D | Text + Geometry |
Total: ~4.5B parameters
Usage
Basic Language Generation
from oculus import create_oculus_model
import mx
model = create_oculus_model(num_classes=150)
dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))
prompt = mx.array([[1, 2, 3, 4, 5]]) # Tokenized text
generated = model.generate(
input_ids=prompt,
x_dinov3=dinov3_image,
x_siglip2=siglip2_image,
max_new_tokens=512,
temperature=0.7,
)
print(f"Generated: {generated.tolist()}")
Visual Question Answering
from oculus import create_oculus_model
import mx
model = create_oculus_model()
dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))
question = mx.array([[1, 2, 3, 4, 5, 6, 7, 8]]) # "What is in the image?"
answer = model.generate(
input_ids=question,
x_dinov3=dinov3_image,
x_siglip2=siglip2_image,
max_new_tokens=100,
)
print(f"Answer: {answer.tolist()}")
Semantic Segmentation
from oculus import create_oculus_model
import mx
model = create_oculus_model(num_classes=150) # ADE20K
dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))
predictions = model.segment(dinov3_image, siglip2_image)
print(f"Segmentation shape: {predictions.shape}") # (1, 14, 14)
Image Classification
from oculus import create_oculus_model
import mx
model = create_oculus_model(num_classes=1000)
dinov3_image = mx.random.normal((4, 3, 224, 224))
siglip2_image = mx.random.normal((4, 3, 384, 384))
class_id = model.classify(dinov3_image, siglip2_image)
print(f"Predicted classes: {class_id.tolist()}")
Object Detection
from oculus import create_oculus_model
import mx
model = create_oculus_model(num_classes=80) # COCO
dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))
cls_logits, bbox_preds = model.detect(dinov3_image, siglip2_image)
print(f"Class logits: {cls_logits.shape}") # (1, 196, 9, 80)
print(f"Box predictions: {bbox_preds.shape}") # (1, 196, 9, 4)
OCR
from oculus import create_oculus_model
import mx
model = create_oculus_model()
dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))
text_logits, geo_preds = model.ocr(dinov3_image, siglip2_image)
print(f"Text logits: {text_logits.shape}") # (14, 14, max_seq_len)
print(f"Geometry: {geo_preds.shape}") # (196, 4)
Loading Pretrained Weights
import os
from oculus import (
create_oculus_model,
load_dinov3_from_hf,
load_siglip2_from_hf,
load_lfm2_from_hf,
)
model = create_oculus_model(num_classes=150)
token = os.getenv("HF_TOKEN")
load_dinov3_from_hf(
model.dinov3_encoder,
repo_id="facebook/dinov3-vith16plus-pretrain-lvd1689m",
token=token,
)
load_siglip2_from_hf(
model.siglip2_encoder,
repo_id="google/siglip2-so400m-patch16-naflex",
token=token,
)
load_lfm2_from_hf(
model.language_model,
repo_id="LiquidAI/LFM2.5-1.2B-Base",
token=token,
)
Running Examples
cd Oculus/src/models
python oculus_example.py
Performance
| Task | Dataset | Metric | Expected |
|---|---|---|---|
| Image Classification | ImageNet | Top-1 | ~75% |
| Semantic Segmentation | ADE20K | mIoU | ~45% |
| Object Detection | COCO | mAP | ~45% |
| VQA | VQA2.0 | Accuracy | ~65% |
Memory Requirements
| Mode | Memory |
|---|---|
| Inference | ~10 GB |
| Training (frozen encoders) | ~12 GB |
| Training (full) | ~30 GB |
Requirements
pip install mlx
pip install huggingface_hub # for pretrained weights
Model Sources
- DINOv3: facebook/dinov3-vith16plus-pretrain-lvd1689m
- SigLIP2: google/siglip2-so400m-patch16-naflex
- LFM2.5: LiquidAI/LFM2.5-1.2B-Base
License
CC-BY-NC-4.0
Contact
- Organization: OceanirAI
- GitHub: github.com/Oceanir