|
|
--- |
|
|
license: other |
|
|
license_name: oceanir-research-license |
|
|
license_link: LICENSE |
|
|
language: |
|
|
- en |
|
|
library_name: oceanir |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- vision |
|
|
- multimodal |
|
|
- vision-language |
|
|
- vqa |
|
|
- reasoning |
|
|
- thinking-traces |
|
|
- structured-output |
|
|
- ocr |
|
|
- ui-understanding |
|
|
- tool-calling |
|
|
- grounding |
|
|
- robotics |
|
|
- edge-deployment |
|
|
- oculus |
|
|
--- |
|
|
|
|
|
# Oculus 0.1 |
|
|
|
|
|
**Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.** |
|
|
|
|
|
Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices. |
|
|
|
|
|
## What's New in Oculus 0.1 |
|
|
|
|
|
### Reasoning via Thinking Traces |
|
|
Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks. |
|
|
|
|
|
```python |
|
|
answer = model.ask(image, "How many red cars on the left?", think=True) |
|
|
# Output includes <think>...</think> reasoning trace |
|
|
``` |
|
|
|
|
|
### Perceptive Tool Calling + Focus (Zoom & Crop) |
|
|
Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception. |
|
|
|
|
|
```python |
|
|
answer = model.ask(image, "Read the small text on the sign", focus=True) |
|
|
# Model automatically zooms to relevant region |
|
|
``` |
|
|
|
|
|
### Structured Outputs |
|
|
More reliable structured output generation for consistent JSON and predictable downstream integration. |
|
|
|
|
|
```python |
|
|
result = model.generate(image, prompt="List all objects", mode="json") |
|
|
# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]} |
|
|
``` |
|
|
|
|
|
### Complex OCR |
|
|
Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes. |
|
|
|
|
|
```python |
|
|
text = model.ocr(image) # Extracts text from any visual content |
|
|
``` |
|
|
|
|
|
### Desktop Use |
|
|
Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases. |
|
|
|
|
|
```python |
|
|
elements = model.detect_ui(screenshot) |
|
|
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...] |
|
|
``` |
|
|
|
|
|
## Architecture |
|
|
|
|
|
**Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for: |
|
|
- Visual reasoning outperforming systems 10x larger |
|
|
- Edge deployment on commodity GPUs |
|
|
- Grounded perception with spatial understanding |
|
|
- Tool calling and agentic workflows |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install oceanir |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from oceanir import Oculus |
|
|
|
|
|
model = Oculus.from_pretrained("OceanirAI/Oculus-0.1") |
|
|
|
|
|
# Basic VQA |
|
|
answer = model.ask("image.jpg", "What is this?") |
|
|
|
|
|
# With reasoning traces |
|
|
answer = model.ask("scene.jpg", "Count the people", think=True) |
|
|
|
|
|
# With focus/zoom for fine details |
|
|
answer = model.ask("document.jpg", "Read the fine print", focus=True) |
|
|
|
|
|
# Structured JSON output |
|
|
result = model.generate(image, prompt="Describe objects", mode="json") |
|
|
|
|
|
# OCR |
|
|
text = model.ocr("screenshot.png") |
|
|
|
|
|
# UI Detection |
|
|
ui_elements = model.detect_ui("desktop.png") |
|
|
|
|
|
# Object Detection with grounding |
|
|
boxes = model.detect("image.jpg") |
|
|
|
|
|
# Segmentation |
|
|
mask = model.segment("image.jpg") |
|
|
``` |
|
|
|
|
|
## Output Modes |
|
|
|
|
|
| Mode | Method | Output | |
|
|
|------|--------|--------| |
|
|
| Text | `model.ask(image, question)` | Natural language answer | |
|
|
| Reasoning | `model.ask(image, question, think=True)` | Answer with `<think>` trace | |
|
|
| JSON | `model.generate(image, mode="json")` | Structured JSON | |
|
|
| Points | `model.generate(image, mode="point")` | Object center points | |
|
|
| Boxes | `model.detect(image)` | Bounding boxes + labels | |
|
|
| Polygons | `model.segment(image)` | Segmentation masks | |
|
|
| OCR | `model.ocr(image)` | Extracted text + locations | |
|
|
| UI | `model.detect_ui(image)` | UI elements + types | |
|
|
|
|
|
## Special Tokens |
|
|
|
|
|
| Token | Purpose | |
|
|
|-------|---------| |
|
|
| `<think>...</think>` | Reasoning traces | |
|
|
| `<focus>...</focus>` | Focus/zoom regions | |
|
|
| `<json>...</json>` | Structured output | |
|
|
| `<box>...</box>` | Bounding box coordinates | |
|
|
| `<point>...</point>` | Point coordinates | |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Robotics**: Grounded perception for manipulation and navigation |
|
|
- **Industrial Inspection**: Defect detection and quality control |
|
|
- **Document Processing**: Complex OCR and form extraction |
|
|
- **Media Search**: Visual content understanding and retrieval |
|
|
- **Desktop Automation**: UI understanding for agentic workflows |
|
|
- **Security**: Visual monitoring and anomaly detection |
|
|
|
|
|
## What's in This Repo |
|
|
|
|
|
- `trained_components/projector.npz` - Vision-language projector |
|
|
- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI) |
|
|
- `oculus_unified_model/` - Model code |
|
|
|
|
|
## License |
|
|
|
|
|
Oceanir Research License - Non-commercial research only. |
|
|
|
|
|
For commercial licensing: licensing@oceanir.ai |
|
|
|