Oculus 0.1

Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.

Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.

What's New in Oculus 0.1

Reasoning via Thinking Traces

Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.

answer = model.ask(image, "How many red cars on the left?", think=True)
# Output includes <think>...</think> reasoning trace

Perceptive Tool Calling + Focus (Zoom & Crop)

Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions โ€” dramatically improving fine-grained perception.

answer = model.ask(image, "Read the small text on the sign", focus=True)
# Model automatically zooms to relevant region

Structured Outputs

More reliable structured output generation for consistent JSON and predictable downstream integration.

result = model.generate(image, prompt="List all objects", mode="json")
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}

Complex OCR

Improved text recognition across cluttered, low-resolution, or distorted regions โ€” enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.

text = model.ocr(image)  # Extracts text from any visual content

Desktop Use

Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.

elements = model.detect_ui(screenshot)
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]

Architecture

Oceanir-Oculus OO1 Architecture โ€” A hybrid vision-language architecture optimized for:

  • Visual reasoning outperforming systems 10x larger
  • Edge deployment on commodity GPUs
  • Grounded perception with spatial understanding
  • Tool calling and agentic workflows

Installation

pip install oceanir

Usage

from oceanir import Oculus

model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")

# Basic VQA
answer = model.ask("image.jpg", "What is this?")

# With reasoning traces
answer = model.ask("scene.jpg", "Count the people", think=True)

# With focus/zoom for fine details
answer = model.ask("document.jpg", "Read the fine print", focus=True)

# Structured JSON output
result = model.generate(image, prompt="Describe objects", mode="json")

# OCR
text = model.ocr("screenshot.png")

# UI Detection
ui_elements = model.detect_ui("desktop.png")

# Object Detection with grounding
boxes = model.detect("image.jpg")

# Segmentation
mask = model.segment("image.jpg")

Output Modes

Mode Method Output
Text model.ask(image, question) Natural language answer
Reasoning model.ask(image, question, think=True) Answer with <think> trace
JSON model.generate(image, mode="json") Structured JSON
Points model.generate(image, mode="point") Object center points
Boxes model.detect(image) Bounding boxes + labels
Polygons model.segment(image) Segmentation masks
OCR model.ocr(image) Extracted text + locations
UI model.detect_ui(image) UI elements + types

Special Tokens

Token Purpose
<think>...</think> Reasoning traces
<focus>...</focus> Focus/zoom regions
<json>...</json> Structured output
<box>...</box> Bounding box coordinates
<point>...</point> Point coordinates

Use Cases

  • Robotics: Grounded perception for manipulation and navigation
  • Industrial Inspection: Defect detection and quality control
  • Document Processing: Complex OCR and form extraction
  • Media Search: Visual content understanding and retrieval
  • Desktop Automation: UI understanding for agentic workflows
  • Security: Visual monitoring and anomaly detection

What's in This Repo

  • trained_components/projector.npz - Vision-language projector
  • trained_components/heads.pth - Task heads (detection, segmentation, OCR, UI)
  • oculus_unified_model/ - Model code

License

Oceanir Research License - Non-commercial research only.

For commercial licensing: licensing@oceanir.ai

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support