---
license: other
license_name: oceanir-research-license
license_link: LICENSE
language:
- en
library_name: oceanir
pipeline_tag: image-text-to-text
tags:
- vision
- multimodal
- vision-language
- vqa
- reasoning
- thinking-traces
- structured-output
- ocr
- ui-understanding
- tool-calling
- grounding
- robotics
- edge-deployment
- oculus
---
# Oculus 0.1
**Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.**
Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.
## What's New in Oculus 0.1
### Reasoning via Thinking Traces
Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
```python
answer = model.ask(image, "How many red cars on the left?", think=True)
# Output includes ... reasoning trace
```
### Perceptive Tool Calling + Focus (Zoom & Crop)
Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception.
```python
answer = model.ask(image, "Read the small text on the sign", focus=True)
# Model automatically zooms to relevant region
```
### Structured Outputs
More reliable structured output generation for consistent JSON and predictable downstream integration.
```python
result = model.generate(image, prompt="List all objects", mode="json")
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
```
### Complex OCR
Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
```python
text = model.ocr(image) # Extracts text from any visual content
```
### Desktop Use
Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.
```python
elements = model.detect_ui(screenshot)
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
```
## Architecture
**Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for:
- Visual reasoning outperforming systems 10x larger
- Edge deployment on commodity GPUs
- Grounded perception with spatial understanding
- Tool calling and agentic workflows
## Installation
```bash
pip install oceanir
```
## Usage
```python
from oceanir import Oculus
model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
# Basic VQA
answer = model.ask("image.jpg", "What is this?")
# With reasoning traces
answer = model.ask("scene.jpg", "Count the people", think=True)
# With focus/zoom for fine details
answer = model.ask("document.jpg", "Read the fine print", focus=True)
# Structured JSON output
result = model.generate(image, prompt="Describe objects", mode="json")
# OCR
text = model.ocr("screenshot.png")
# UI Detection
ui_elements = model.detect_ui("desktop.png")
# Object Detection with grounding
boxes = model.detect("image.jpg")
# Segmentation
mask = model.segment("image.jpg")
```
## Output Modes
| Mode | Method | Output |
|------|--------|--------|
| Text | `model.ask(image, question)` | Natural language answer |
| Reasoning | `model.ask(image, question, think=True)` | Answer with `` trace |
| JSON | `model.generate(image, mode="json")` | Structured JSON |
| Points | `model.generate(image, mode="point")` | Object center points |
| Boxes | `model.detect(image)` | Bounding boxes + labels |
| Polygons | `model.segment(image)` | Segmentation masks |
| OCR | `model.ocr(image)` | Extracted text + locations |
| UI | `model.detect_ui(image)` | UI elements + types |
## Special Tokens
| Token | Purpose |
|-------|---------|
| `...` | Reasoning traces |
| `...` | Focus/zoom regions |
| `...` | Structured output |
| `...` | Bounding box coordinates |
| `...` | Point coordinates |
## Use Cases
- **Robotics**: Grounded perception for manipulation and navigation
- **Industrial Inspection**: Defect detection and quality control
- **Document Processing**: Complex OCR and form extraction
- **Media Search**: Visual content understanding and retrieval
- **Desktop Automation**: UI understanding for agentic workflows
- **Security**: Visual monitoring and anomaly detection
## What's in This Repo
- `trained_components/projector.npz` - Vision-language projector
- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
- `oculus_unified_model/` - Model code
## License
Oceanir Research License - Non-commercial research only.
For commercial licensing: licensing@oceanir.ai