Oculus / README.md
kobiakor15's picture
Upload README.md with huggingface_hub
d933e76 verified
|
raw
history blame
6.39 kB
metadata
license: cc-by-nc-4.0
language:
  - en
pipeline_tag: image-text-to-text
tags:
  - vision
  - multimodal
  - vision-language
  - segmentation
  - detection
  - ocr
  - dinov3
  - siglip2
  - lfm2.5
base_model:
  - facebook/dinov3-vith16plus-pretrain-lvd1689m
  - google/siglip2-so400m-patch16-naflex
  - LiquidAI/LFM2.5-1.2B-Base

Oculus 0.1

A multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.

What is this?

Oculus is a universal vision-language model for:

  • Image Captioning: Generate natural language descriptions
  • Visual Question Answering: Answer questions about images
  • Semantic Segmentation: Pixel-level class prediction
  • Image Classification: Global image classification
  • Object Detection: Bounding box prediction
  • OCR: Text detection and recognition

Model Architecture

Image (224Γ—224) ──→ DINOv3 ViT-L/16 ──┐
                                       β”œβ”€β”€β†’ Concatenate ──→ Projector ──→ LFM2.5-1.2B
Image (384Γ—384) ──→ SigLIP2 SO400M β”€β”€β”˜                          β”‚
                                                                 β”œβ”€β”€β†’ Text Output (Caption/VQA)
                                                    Segmentation Head ──→ Segmentation Map
                                                   Classification Head ──→ Class Label
                                                      Detection Head ──→ Boxes + Classes
                                                          OCR Head ──→ Text + Geometry

Components

Component Model Parameters Input Output
Vision Encoder 1 DINOv3 ViT-H/16+ 1.7B 224Γ—224 256Γ—1280
Vision Encoder 2 SigLIP2 SO400M 400M 384Γ—384 576Γ—1152
Fusion Concatenation - 2432D 2432D
Projector 2-layer MLP ~5M 2432D 1536D
Language Model LFM2.5-1.2B 1.2B 1536D Text
Segmentation Head MLP ~0.5M 2432D 14Γ—14Γ—150
Classification Head MLP ~0.3M 2432D 1000
Detection Head MLP ~0.5M 2432D Boxes + Classes
OCR Head CNN + MLP ~0.3M 2432D Text + Geometry

Total: ~4.5B parameters

Usage

Basic Language Generation

from oculus import create_oculus_model
import mx

model = create_oculus_model(num_classes=150)

dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))
prompt = mx.array([[1, 2, 3, 4, 5]])  # Tokenized text

generated = model.generate(
    input_ids=prompt,
    x_dinov3=dinov3_image,
    x_siglip2=siglip2_image,
    max_new_tokens=512,
    temperature=0.7,
)
print(f"Generated: {generated.tolist()}")

Visual Question Answering

from oculus import create_oculus_model
import mx

model = create_oculus_model()

dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))

question = mx.array([[1, 2, 3, 4, 5, 6, 7, 8]])  # "What is in the image?"

answer = model.generate(
    input_ids=question,
    x_dinov3=dinov3_image,
    x_siglip2=siglip2_image,
    max_new_tokens=100,
)
print(f"Answer: {answer.tolist()}")

Semantic Segmentation

from oculus import create_oculus_model
import mx

model = create_oculus_model(num_classes=150)  # ADE20K

dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))

predictions = model.segment(dinov3_image, siglip2_image)
print(f"Segmentation shape: {predictions.shape}")  # (1, 14, 14)

Image Classification

from oculus import create_oculus_model
import mx

model = create_oculus_model(num_classes=1000)

dinov3_image = mx.random.normal((4, 3, 224, 224))
siglip2_image = mx.random.normal((4, 3, 384, 384))

class_id = model.classify(dinov3_image, siglip2_image)
print(f"Predicted classes: {class_id.tolist()}")

Object Detection

from oculus import create_oculus_model
import mx

model = create_oculus_model(num_classes=80)  # COCO

dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))

cls_logits, bbox_preds = model.detect(dinov3_image, siglip2_image)
print(f"Class logits: {cls_logits.shape}")  # (1, 196, 9, 80)
print(f"Box predictions: {bbox_preds.shape}")  # (1, 196, 9, 4)

OCR

from oculus import create_oculus_model
import mx

model = create_oculus_model()

dinov3_image = mx.random.normal((1, 3, 224, 224))
siglip2_image = mx.random.normal((1, 3, 384, 384))

text_logits, geo_preds = model.ocr(dinov3_image, siglip2_image)
print(f"Text logits: {text_logits.shape}")  # (14, 14, max_seq_len)
print(f"Geometry: {geo_preds.shape}")  # (196, 4)

Loading Pretrained Weights

import os
from oculus import (
    create_oculus_model,
    load_dinov3_from_hf,
    load_siglip2_from_hf,
    load_lfm2_from_hf,
)

model = create_oculus_model(num_classes=150)

token = os.getenv("HF_TOKEN")

load_dinov3_from_hf(
    model.dinov3_encoder,
    repo_id="facebook/dinov3-vith16plus-pretrain-lvd1689m",
    token=token,
)

load_siglip2_from_hf(
    model.siglip2_encoder,
    repo_id="google/siglip2-so400m-patch16-naflex",
    token=token,
)

load_lfm2_from_hf(
    model.language_model,
    repo_id="LiquidAI/LFM2.5-1.2B-Base",
    token=token,
)

Running Examples

cd Oculus/src/models
python oculus_example.py

Performance

Task Dataset Metric Expected
Image Classification ImageNet Top-1 ~75%
Semantic Segmentation ADE20K mIoU ~45%
Object Detection COCO mAP ~45%
VQA VQA2.0 Accuracy ~65%

Memory Requirements

Mode Memory
Inference ~10 GB
Training (frozen encoders) ~12 GB
Training (full) ~30 GB

Requirements

pip install mlx
pip install huggingface_hub  # for pretrained weights

Model Sources

License

CC-BY-NC-4.0

Contact

  • Organization: OceanirAI
  • GitHub: github.com/Oceanir