---
license: other
license_name: oceanir-research-license
license_link: LICENSE
language:
- en
library_name: oceanir
pipeline_tag: image-text-to-text
tags:
- vision
- multimodal
- vision-language
- vqa
- reasoning
- thinking-traces
- structured-output
- ocr
- ui-understanding
- tool-calling
- grounding
- robotics
- edge-deployment
- oculus
---

# Oculus 0.1

**Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.**

Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.

## What's New in Oculus 0.1

### Reasoning via Thinking Traces
Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.

```python
answer = model.ask(image, "How many red cars on the left?", think=True)
# Output includes <think>...</think> reasoning trace
```

### Perceptive Tool Calling + Focus (Zoom & Crop)
Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception.

```python
answer = model.ask(image, "Read the small text on the sign", focus=True)
# Model automatically zooms to relevant region
```

### Structured Outputs
More reliable structured output generation for consistent JSON and predictable downstream integration.

```python
result = model.generate(image, prompt="List all objects", mode="json")
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
```

### Complex OCR
Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.

```python
text = model.ocr(image)  # Extracts text from any visual content
```

### Desktop Use
Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.

```python
elements = model.detect_ui(screenshot)
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
```

## Architecture

**Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for:
- Visual reasoning outperforming systems 10x larger
- Edge deployment on commodity GPUs
- Grounded perception with spatial understanding
- Tool calling and agentic workflows

## Installation

```bash
pip install oceanir
```

## Usage

```python
from oceanir import Oculus

model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")

# Basic VQA
answer = model.ask("image.jpg", "What is this?")

# With reasoning traces
answer = model.ask("scene.jpg", "Count the people", think=True)

# With focus/zoom for fine details
answer = model.ask("document.jpg", "Read the fine print", focus=True)

# Structured JSON output
result = model.generate(image, prompt="Describe objects", mode="json")

# OCR
text = model.ocr("screenshot.png")

# UI Detection
ui_elements = model.detect_ui("desktop.png")

# Object Detection with grounding
boxes = model.detect("image.jpg")

# Segmentation
mask = model.segment("image.jpg")
```

## Output Modes

| Mode | Method | Output |
|------|--------|--------|
| Text | `model.ask(image, question)` | Natural language answer |
| Reasoning | `model.ask(image, question, think=True)` | Answer with `<think>` trace |
| JSON | `model.generate(image, mode="json")` | Structured JSON |
| Points | `model.generate(image, mode="point")` | Object center points |
| Boxes | `model.detect(image)` | Bounding boxes + labels |
| Polygons | `model.segment(image)` | Segmentation masks |
| OCR | `model.ocr(image)` | Extracted text + locations |
| UI | `model.detect_ui(image)` | UI elements + types |

## Special Tokens

| Token | Purpose |
|-------|---------|
| `<think>...</think>` | Reasoning traces |
| `<focus>...</focus>` | Focus/zoom regions |
| `<json>...</json>` | Structured output |
| `<box>...</box>` | Bounding box coordinates |
| `<point>...</point>` | Point coordinates |

## Use Cases

- **Robotics**: Grounded perception for manipulation and navigation
- **Industrial Inspection**: Defect detection and quality control
- **Document Processing**: Complex OCR and form extraction
- **Media Search**: Visual content understanding and retrieval
- **Desktop Automation**: UI understanding for agentic workflows
- **Security**: Visual monitoring and anomaly detection

## What's in This Repo

- `trained_components/projector.npz` - Vision-language projector
- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
- `oculus_unified_model/` - Model code

## License

Oceanir Research License - Non-commercial research only.

For commercial licensing: licensing@oceanir.ai