kobiakor15's picture
Upload oculus_unified_model/README.md with huggingface_hub
ad39c92 verified
metadata
license: cc-by-nc-4.0
language:
  - en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - vision
  - multimodal
  - vision-language
  - reasoning
  - detection
  - segmentation
  - ocr
  - vqa
  - captioning
base_model:
  - facebook/dinov2-large
  - google/siglip-base-patch16-224
  - Salesforce/blip-image-captioning-base

Oculus 0.2

A unified vision-language model with multi-modal reasoning capabilities.

Oculus 0.2 is a hybrid-reasoning vision-language model that combines:

  • DINOv3 for semantic visual understanding
  • SigLIP2 for vision-language alignment
  • Trained Projector for vision-to-language mapping
  • Optional Reasoning via thinking traces

πŸš€ What's New in Oculus 0.2

Feature Description
🧠 Reasoning via Thinking Traces Short, structured reasoning traces improve multi-step decisions and ambiguous spatial tasks
πŸ” Focus System (Zoom & Crop) Automatically focus on smaller regions for fine-grained perception
πŸ“¦ Multiple Output Modes Text, Point, Box, and Polygon outputs for different tasks
πŸ“ Improved Captioning Better descriptions with context awareness
❓ Enhanced VQA More accurate answers to visual questions

Output Modes

Mode Description Use Case
πŸ“ Text Natural language output Captioning, VQA, descriptions
πŸ“ Point (x, y) coordinates + labels Object counting, localization
πŸ“¦ Box Bounding boxes + labels Object detection
πŸ”· Polygon Segmentation masks Semantic/instance segmentation

Quick Start

from oculus_unified_model import OculusForConditionalGeneration
from PIL import Image

# Load model
model = OculusForConditionalGeneration.from_pretrained("OceanirAI/oculus-0.2")

# Load image
image = Image.open("your_image.jpg")

# Caption mode
output = model.generate(image, mode="text", prompt="Describe this image")
print(output.text)

# VQA mode
output = model.generate(image, mode="text", prompt="What color is the car?")
print(output.text)

# With reasoning traces
output = model.generate(image, mode="text", prompt="Count the people", think=True)
print(f"Thinking: {output.thinking_trace}")
print(f"Answer: {output.text}")

# Detection mode (bounding boxes)
output = model.generate(image, mode="box", prompt="Find all vehicles")
for box, label, conf in zip(output.boxes, output.labels, output.confidences):
    print(f"  {label}: {box} (conf={conf:.2f})")

# Point mode (counting)
output = model.generate(image, mode="point", prompt="Count the birds")
print(f"Found {len(output.points)} points")

# Segmentation mode
output = model.generate(image, mode="polygon", prompt="Segment the road")
print(f"Mask shape: {output.mask.shape}")

Reasoning Mode

Enable thinking traces for complex reasoning tasks:

output = model.generate(
    image,
    mode="text",
    prompt="How many people are sitting vs standing?",
    think=True  # Enable reasoning
)

print(f"πŸ’­ Thinking: {output.thinking_trace}")
print(f"πŸ“ Answer: {output.text}")

Focus System

The Focus system enables zoom-and-crop for fine-grained perception:

output = model.generate(
    image,
    mode="text", 
    prompt="What does the small text say?",
    focus=True  # Enable focus/zoom
)

Architecture

Image β†’ DINOv3 ────┐
                   β”œβ†’ Fusion β†’ Projector β†’ 64 tokens Γ— 1536D ───┐
Image β†’ SigLIP2 β”€β”€β”˜                                             β”‚
                                                                 ↓
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚                                 β”‚
                                              ↓                                 ↓
                                         LM Head                         Task Heads
                                              β”‚                                 β”‚
                                              ↓                                 ↓
                                    Text/Caption/VQA              Point/Box/Polygon

Model Details

Component Size Description
DINOv3 Encoder 1.0B Semantic visual features
SigLIP2 Encoder 400M Vision-language aligned features
Projector 160M Vision-to-language bridge
Detection Head 12M Bounding box prediction
Point Head 8M Point localization
Segmentation Head 24M Mask prediction
Total ~1.6B Full model

Training

The model components were trained in stages:

  1. Projector: Trained on COCO Captions (5k paired images) for 3 epochs.
  2. Detection Heads: Trained on COCO Detection for 5+ epochs using GIoU and Focal Loss.

Benchmarks & Evaluation

We use a comprehensive benchmark suite eval_benchmarks.py covering:

  • COCO Detection: mAP evaluation
  • Car Part Damage: Specialized evaluation on HuggingFace moondream/car_part_damage dataset
  • Counting: Accuracy on Pixmo-style counting tasks
  • VQA: Open-ended question answering accuracy

To run benchmarks:

python eval_benchmarks.py --model checkpoints/oculus_detection_v2/final

πŸ”Œ Python API Usage

To use Oculus in your own applications, simply import the OculusPredictor:

from oculus_inference import OculusPredictor

# Initialize (automatically loads best checkpoint)
model = OculusPredictor()

# 1. Object Detection
results = model.detect("image.jpg")
print(f"Found {len(results['boxes'])} objects")

# 2. Visual Question Answering (Reasoning)
answer = model.ask("image.jpg", "What is the person holding?")
print(f"Answer: {answer}")

# 3. Captioning
caption = model.caption("image.jpg")
print(f"Caption: {caption}")

Requirements

pip install transformers torch pillow numpy

For Apple Silicon:

pip install mlx

Citation

@misc{oculus2025,
  title={Oculus: Unified Vision-Language Model with Multi-Modal Reasoning},
  author={OceanirAI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/OceanirAI/oculus-0.2}
}

License

CC-BY-NC-4.0

Contact