File size: 4,751 Bytes
ff21a21 cb1db66 8eb7430 f214cd0 8eb7430 cb1db66 ff21a21 cb1db66 ff21a21 cb1db66 f214cd0 cb1db66 57a52a8 cb1db66 8eb7430 cb1db66 8eb7430 57a52a8 cb1db66 57a52a8 8eb7430 57a52a8 cb1db66 f214cd0 8eb7430 cb1db66 8eb7430 ff21a21 cb1db66 ff21a21 8eb7430 cb1db66 8eb7430 ff21a21 cb1db66 8eb7430 ff21a21 cb1db66 ff21a21 8eb7430 ff21a21 cb1db66 8eb7430 ff21a21 8eb7430 ff21a21 cb1db66 57a52a8 8eb7430 ff21a21 cb1db66 8eb7430 cb1db66 8eb7430 ff21a21 8eb7430 cb1db66 f214cd0 ff21a21 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
license: other
license_name: oceanir-research-license
license_link: LICENSE
language:
- en
library_name: oceanir
pipeline_tag: image-text-to-text
tags:
- vision
- multimodal
- vision-language
- vqa
- reasoning
- thinking-traces
- structured-output
- ocr
- ui-understanding
- tool-calling
- grounding
- robotics
- edge-deployment
- oculus
---
# Oculus 0.1
**Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.**
Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.
## What's New in Oculus 0.1
### Reasoning via Thinking Traces
Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
```python
answer = model.ask(image, "How many red cars on the left?", think=True)
# Output includes <think>...</think> reasoning trace
```
### Perceptive Tool Calling + Focus (Zoom & Crop)
Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception.
```python
answer = model.ask(image, "Read the small text on the sign", focus=True)
# Model automatically zooms to relevant region
```
### Structured Outputs
More reliable structured output generation for consistent JSON and predictable downstream integration.
```python
result = model.generate(image, prompt="List all objects", mode="json")
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
```
### Complex OCR
Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
```python
text = model.ocr(image) # Extracts text from any visual content
```
### Desktop Use
Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.
```python
elements = model.detect_ui(screenshot)
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
```
## Architecture
**Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for:
- Visual reasoning outperforming systems 10x larger
- Edge deployment on commodity GPUs
- Grounded perception with spatial understanding
- Tool calling and agentic workflows
## Installation
```bash
pip install oceanir
```
## Usage
```python
from oceanir import Oculus
model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
# Basic VQA
answer = model.ask("image.jpg", "What is this?")
# With reasoning traces
answer = model.ask("scene.jpg", "Count the people", think=True)
# With focus/zoom for fine details
answer = model.ask("document.jpg", "Read the fine print", focus=True)
# Structured JSON output
result = model.generate(image, prompt="Describe objects", mode="json")
# OCR
text = model.ocr("screenshot.png")
# UI Detection
ui_elements = model.detect_ui("desktop.png")
# Object Detection with grounding
boxes = model.detect("image.jpg")
# Segmentation
mask = model.segment("image.jpg")
```
## Output Modes
| Mode | Method | Output |
|------|--------|--------|
| Text | `model.ask(image, question)` | Natural language answer |
| Reasoning | `model.ask(image, question, think=True)` | Answer with `<think>` trace |
| JSON | `model.generate(image, mode="json")` | Structured JSON |
| Points | `model.generate(image, mode="point")` | Object center points |
| Boxes | `model.detect(image)` | Bounding boxes + labels |
| Polygons | `model.segment(image)` | Segmentation masks |
| OCR | `model.ocr(image)` | Extracted text + locations |
| UI | `model.detect_ui(image)` | UI elements + types |
## Special Tokens
| Token | Purpose |
|-------|---------|
| `<think>...</think>` | Reasoning traces |
| `<focus>...</focus>` | Focus/zoom regions |
| `<json>...</json>` | Structured output |
| `<box>...</box>` | Bounding box coordinates |
| `<point>...</point>` | Point coordinates |
## Use Cases
- **Robotics**: Grounded perception for manipulation and navigation
- **Industrial Inspection**: Defect detection and quality control
- **Document Processing**: Complex OCR and form extraction
- **Media Search**: Visual content understanding and retrieval
- **Desktop Automation**: UI understanding for agentic workflows
- **Security**: Visual monitoring and anomaly detection
## What's in This Repo
- `trained_components/projector.npz` - Vision-language projector
- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
- `oculus_unified_model/` - Model code
## License
Oceanir Research License - Non-commercial research only.
For commercial licensing: licensing@oceanir.ai
|