Oculus 0.1

Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.

Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.

What's New in Oculus 0.1

Reasoning via Thinking Traces

Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.

answer = model.ask(image, "How many red cars on the left?", think=True)
# Output includes <think>...</think> reasoning trace

Perceptive Tool Calling + Focus (Zoom & Crop)

Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception.

answer = model.ask(image, "Read the small text on the sign", focus=True)
# Model automatically zooms to relevant region

Structured Outputs

More reliable structured output generation for consistent JSON and predictable downstream integration.

result = model.generate(image, prompt="List all objects", mode="json")
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}

Complex OCR

Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.

text = model.ocr(image)  # Extracts text from any visual content

Desktop Use

Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.

elements = model.detect_ui(screenshot)
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]

Architecture

Oceanir-Oculus OO1 Architecture — A hybrid vision-language architecture optimized for:

Visual reasoning outperforming systems 10x larger
Edge deployment on commodity GPUs
Grounded perception with spatial understanding
Tool calling and agentic workflows

Installation

pip install oceanir

Usage

from oceanir import Oculus

model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")

# Basic VQA
answer = model.ask("image.jpg", "What is this?")

# With reasoning traces
answer = model.ask("scene.jpg", "Count the people", think=True)

# With focus/zoom for fine details
answer = model.ask("document.jpg", "Read the fine print", focus=True)

# Structured JSON output
result = model.generate(image, prompt="Describe objects", mode="json")

# OCR
text = model.ocr("screenshot.png")

# UI Detection
ui_elements = model.detect_ui("desktop.png")

# Object Detection with grounding
boxes = model.detect("image.jpg")

# Segmentation
mask = model.segment("image.jpg")

Output Modes

Mode	Method	Output
Text	`model.ask(image, question)`	Natural language answer
Reasoning	`model.ask(image, question, think=True)`	Answer with `<think>` trace
JSON	`model.generate(image, mode="json")`	Structured JSON
Points	`model.generate(image, mode="point")`	Object center points
Boxes	`model.detect(image)`	Bounding boxes + labels
Polygons	`model.segment(image)`	Segmentation masks
OCR	`model.ocr(image)`	Extracted text + locations
UI	`model.detect_ui(image)`	UI elements + types

Special Tokens

Token	Purpose
`<think>...</think>`	Reasoning traces
`<focus>...</focus>`	Focus/zoom regions
`<json>...</json>`	Structured output
`<box>...</box>`	Bounding box coordinates
`<point>...</point>`	Point coordinates

Use Cases

Robotics: Grounded perception for manipulation and navigation
Industrial Inspection: Defect detection and quality control
Document Processing: Complex OCR and form extraction
Media Search: Visual content understanding and retrieval
Desktop Automation: UI understanding for agentic workflows
Security: Visual monitoring and anomaly detection

What's in This Repo

trained_components/projector.npz - Vision-language projector
trained_components/heads.pth - Task heads (detection, segmentation, OCR, UI)
oculus_unified_model/ - Model code

License

Oceanir Research License - Non-commercial research only.

For commercial licensing: licensing@oceanir.ai

Downloads last month: 42

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support