README.md · OceanirAI/Oculus-0.1 at main

Oculus-0.1 / README.md

kobiakor15

Upload README.md with huggingface_hub

cb1db66 verified 5 days ago

preview code

raw

history blame contribute delete

4.75 kB

	---
	license: other
	license_name: oceanir-research-license
	license_link: LICENSE
	language:
	- en
	library_name: oceanir
	pipeline_tag: image-text-to-text
	tags:
	- vision
	- multimodal
	- vision-language
	- vqa
	- reasoning
	- thinking-traces
	- structured-output
	- ocr
	- ui-understanding
	- tool-calling
	- grounding
	- robotics
	- edge-deployment
	- oculus
	---

	# Oculus 0.1

	Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.

	Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.

	## What's New in Oculus 0.1

	### Reasoning via Thinking Traces
	Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.

	```python
	answer = model.ask(image, "How many red cars on the left?", think=True)
	# Output includes <think>...</think> reasoning trace
	```

	### Perceptive Tool Calling + Focus (Zoom & Crop)
	Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception.

	```python
	answer = model.ask(image, "Read the small text on the sign", focus=True)
	# Model automatically zooms to relevant region
	```

	### Structured Outputs
	More reliable structured output generation for consistent JSON and predictable downstream integration.

	```python
	result = model.generate(image, prompt="List all objects", mode="json")
	# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
	```

	### Complex OCR
	Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.

	```python
	text = model.ocr(image) # Extracts text from any visual content
	```

	### Desktop Use
	Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.

	```python
	elements = model.detect_ui(screenshot)
	# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
	```

	## Architecture

	Oceanir-Oculus OO1 Architecture — A hybrid vision-language architecture optimized for:
	- Visual reasoning outperforming systems 10x larger
	- Edge deployment on commodity GPUs
	- Grounded perception with spatial understanding
	- Tool calling and agentic workflows

	## Installation

	```bash
	pip install oceanir
	```

	## Usage

	```python
	from oceanir import Oculus

	model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")

	# Basic VQA
	answer = model.ask("image.jpg", "What is this?")

	# With reasoning traces
	answer = model.ask("scene.jpg", "Count the people", think=True)

	# With focus/zoom for fine details
	answer = model.ask("document.jpg", "Read the fine print", focus=True)

	# Structured JSON output
	result = model.generate(image, prompt="Describe objects", mode="json")

	# OCR
	text = model.ocr("screenshot.png")

	# UI Detection
	ui_elements = model.detect_ui("desktop.png")

	# Object Detection with grounding
	boxes = model.detect("image.jpg")

	# Segmentation
	mask = model.segment("image.jpg")
	```

	## Output Modes

	\| Mode \| Method \| Output \|
	\|------\|--------\|--------\|
	\| Text \| `model.ask(image, question)` \| Natural language answer \|
	\| Reasoning \| `model.ask(image, question, think=True)` \| Answer with `<think>` trace \|
	\| JSON \| `model.generate(image, mode="json")` \| Structured JSON \|
	\| Points \| `model.generate(image, mode="point")` \| Object center points \|
	\| Boxes \| `model.detect(image)` \| Bounding boxes + labels \|
	\| Polygons \| `model.segment(image)` \| Segmentation masks \|
	\| OCR \| `model.ocr(image)` \| Extracted text + locations \|
	\| UI \| `model.detect_ui(image)` \| UI elements + types \|

	## Special Tokens

	\| Token \| Purpose \|
	\|-------\|---------\|
	\| `<think>...</think>` \| Reasoning traces \|
	\| `<focus>...</focus>` \| Focus/zoom regions \|
	\| `<json>...</json>` \| Structured output \|
	\| `<box>...</box>` \| Bounding box coordinates \|
	\| `<point>...</point>` \| Point coordinates \|

	## Use Cases

	- Robotics: Grounded perception for manipulation and navigation
	- Industrial Inspection: Defect detection and quality control
	- Document Processing: Complex OCR and form extraction
	- Media Search: Visual content understanding and retrieval
	- Desktop Automation: UI understanding for agentic workflows
	- Security: Visual monitoring and anomaly detection

	## What's in This Repo

	- `trained_components/projector.npz` - Vision-language projector
	- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
	- `oculus_unified_model/` - Model code

	## License

	Oceanir Research License - Non-commercial research only.

	For commercial licensing: licensing@oceanir.ai