File size: 4,751 Bytes
ff21a21
 
 
 
 
 
 
 
 
 
 
 
 
 
cb1db66
8eb7430
f214cd0
8eb7430
 
cb1db66
 
 
ff21a21
 
 
cb1db66
ff21a21
cb1db66
f214cd0
cb1db66
57a52a8
cb1db66
8eb7430
cb1db66
8eb7430
 
 
 
 
57a52a8
 
cb1db66
 
57a52a8
8eb7430
 
 
 
57a52a8
cb1db66
 
f214cd0
8eb7430
 
cb1db66
8eb7430
ff21a21
cb1db66
 
ff21a21
8eb7430
cb1db66
8eb7430
ff21a21
cb1db66
 
8eb7430
 
 
 
ff21a21
 
cb1db66
 
 
 
 
 
 
 
 
 
 
 
 
 
ff21a21
 
 
 
 
 
 
8eb7430
 
 
 
 
ff21a21
cb1db66
8eb7430
ff21a21
8eb7430
 
 
 
 
 
 
 
ff21a21
cb1db66
57a52a8
8eb7430
 
 
ff21a21
 
cb1db66
8eb7430
cb1db66
 
 
 
 
 
 
 
 
 
8eb7430
 
ff21a21
8eb7430
 
 
 
 
cb1db66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f214cd0
ff21a21
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
license: other
license_name: oceanir-research-license
license_link: LICENSE
language:
- en
library_name: oceanir
pipeline_tag: image-text-to-text
tags:
- vision
- multimodal
- vision-language
- vqa
- reasoning
- thinking-traces
- structured-output
- ocr
- ui-understanding
- tool-calling
- grounding
- robotics
- edge-deployment
- oculus
---

# Oculus 0.1

**Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.**

Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.

## What's New in Oculus 0.1

### Reasoning via Thinking Traces
Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.

```python
answer = model.ask(image, "How many red cars on the left?", think=True)
# Output includes <think>...</think> reasoning trace
```

### Perceptive Tool Calling + Focus (Zoom & Crop)
Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception.

```python
answer = model.ask(image, "Read the small text on the sign", focus=True)
# Model automatically zooms to relevant region
```

### Structured Outputs
More reliable structured output generation for consistent JSON and predictable downstream integration.

```python
result = model.generate(image, prompt="List all objects", mode="json")
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
```

### Complex OCR
Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.

```python
text = model.ocr(image)  # Extracts text from any visual content
```

### Desktop Use
Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.

```python
elements = model.detect_ui(screenshot)
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
```

## Architecture

**Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for:
- Visual reasoning outperforming systems 10x larger
- Edge deployment on commodity GPUs
- Grounded perception with spatial understanding
- Tool calling and agentic workflows

## Installation

```bash
pip install oceanir
```

## Usage

```python
from oceanir import Oculus

model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")

# Basic VQA
answer = model.ask("image.jpg", "What is this?")

# With reasoning traces
answer = model.ask("scene.jpg", "Count the people", think=True)

# With focus/zoom for fine details
answer = model.ask("document.jpg", "Read the fine print", focus=True)

# Structured JSON output
result = model.generate(image, prompt="Describe objects", mode="json")

# OCR
text = model.ocr("screenshot.png")

# UI Detection
ui_elements = model.detect_ui("desktop.png")

# Object Detection with grounding
boxes = model.detect("image.jpg")

# Segmentation
mask = model.segment("image.jpg")
```

## Output Modes

| Mode | Method | Output |
|------|--------|--------|
| Text | `model.ask(image, question)` | Natural language answer |
| Reasoning | `model.ask(image, question, think=True)` | Answer with `<think>` trace |
| JSON | `model.generate(image, mode="json")` | Structured JSON |
| Points | `model.generate(image, mode="point")` | Object center points |
| Boxes | `model.detect(image)` | Bounding boxes + labels |
| Polygons | `model.segment(image)` | Segmentation masks |
| OCR | `model.ocr(image)` | Extracted text + locations |
| UI | `model.detect_ui(image)` | UI elements + types |

## Special Tokens

| Token | Purpose |
|-------|---------|
| `<think>...</think>` | Reasoning traces |
| `<focus>...</focus>` | Focus/zoom regions |
| `<json>...</json>` | Structured output |
| `<box>...</box>` | Bounding box coordinates |
| `<point>...</point>` | Point coordinates |

## Use Cases

- **Robotics**: Grounded perception for manipulation and navigation
- **Industrial Inspection**: Defect detection and quality control
- **Document Processing**: Complex OCR and form extraction
- **Media Search**: Visual content understanding and retrieval
- **Desktop Automation**: UI understanding for agentic workflows
- **Security**: Visual monitoring and anomaly detection

## What's in This Repo

- `trained_components/projector.npz` - Vision-language projector
- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
- `oculus_unified_model/` - Model code

## License

Oceanir Research License - Non-commercial research only.

For commercial licensing: licensing@oceanir.ai