Oculus 0.1
Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.
Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.
What's New in Oculus 0.1
Reasoning via Thinking Traces
Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
answer = model.ask(image, "How many red cars on the left?", think=True)
# Output includes <think>...</think> reasoning trace
Perceptive Tool Calling + Focus (Zoom & Crop)
Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions โ dramatically improving fine-grained perception.
answer = model.ask(image, "Read the small text on the sign", focus=True)
# Model automatically zooms to relevant region
Structured Outputs
More reliable structured output generation for consistent JSON and predictable downstream integration.
result = model.generate(image, prompt="List all objects", mode="json")
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
Complex OCR
Improved text recognition across cluttered, low-resolution, or distorted regions โ enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
text = model.ocr(image) # Extracts text from any visual content
Desktop Use
Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.
elements = model.detect_ui(screenshot)
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
Architecture
Oceanir-Oculus OO1 Architecture โ A hybrid vision-language architecture optimized for:
- Visual reasoning outperforming systems 10x larger
- Edge deployment on commodity GPUs
- Grounded perception with spatial understanding
- Tool calling and agentic workflows
Installation
pip install oceanir
Usage
from oceanir import Oculus
model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
# Basic VQA
answer = model.ask("image.jpg", "What is this?")
# With reasoning traces
answer = model.ask("scene.jpg", "Count the people", think=True)
# With focus/zoom for fine details
answer = model.ask("document.jpg", "Read the fine print", focus=True)
# Structured JSON output
result = model.generate(image, prompt="Describe objects", mode="json")
# OCR
text = model.ocr("screenshot.png")
# UI Detection
ui_elements = model.detect_ui("desktop.png")
# Object Detection with grounding
boxes = model.detect("image.jpg")
# Segmentation
mask = model.segment("image.jpg")
Output Modes
| Mode | Method | Output |
|---|---|---|
| Text | model.ask(image, question) |
Natural language answer |
| Reasoning | model.ask(image, question, think=True) |
Answer with <think> trace |
| JSON | model.generate(image, mode="json") |
Structured JSON |
| Points | model.generate(image, mode="point") |
Object center points |
| Boxes | model.detect(image) |
Bounding boxes + labels |
| Polygons | model.segment(image) |
Segmentation masks |
| OCR | model.ocr(image) |
Extracted text + locations |
| UI | model.detect_ui(image) |
UI elements + types |
Special Tokens
| Token | Purpose |
|---|---|
<think>...</think> |
Reasoning traces |
<focus>...</focus> |
Focus/zoom regions |
<json>...</json> |
Structured output |
<box>...</box> |
Bounding box coordinates |
<point>...</point> |
Point coordinates |
Use Cases
- Robotics: Grounded perception for manipulation and navigation
- Industrial Inspection: Defect detection and quality control
- Document Processing: Complex OCR and form extraction
- Media Search: Visual content understanding and retrieval
- Desktop Automation: UI understanding for agentic workflows
- Security: Visual monitoring and anomaly detection
What's in This Repo
trained_components/projector.npz- Vision-language projectortrained_components/heads.pth- Task heads (detection, segmentation, OCR, UI)oculus_unified_model/- Model code
License
Oceanir Research License - Non-commercial research only.
For commercial licensing: licensing@oceanir.ai
- Downloads last month
- 42