--- license: other license_name: oceanir-research-license license_link: LICENSE language: - en library_name: oceanir pipeline_tag: image-text-to-text tags: - vision - multimodal - vision-language - vqa - reasoning - thinking-traces - structured-output - ocr - ui-understanding - tool-calling - grounding - robotics - edge-deployment - oculus --- # Oculus 0.1 **Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.** Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices. ## What's New in Oculus 0.1 ### Reasoning via Thinking Traces Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks. ```python answer = model.ask(image, "How many red cars on the left?", think=True) # Output includes ... reasoning trace ``` ### Perceptive Tool Calling + Focus (Zoom & Crop) Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception. ```python answer = model.ask(image, "Read the small text on the sign", focus=True) # Model automatically zooms to relevant region ``` ### Structured Outputs More reliable structured output generation for consistent JSON and predictable downstream integration. ```python result = model.generate(image, prompt="List all objects", mode="json") # Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]} ``` ### Complex OCR Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes. ```python text = model.ocr(image) # Extracts text from any visual content ``` ### Desktop Use Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases. ```python elements = model.detect_ui(screenshot) # Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...] ``` ## Architecture **Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for: - Visual reasoning outperforming systems 10x larger - Edge deployment on commodity GPUs - Grounded perception with spatial understanding - Tool calling and agentic workflows ## Installation ```bash pip install oceanir ``` ## Usage ```python from oceanir import Oculus model = Oculus.from_pretrained("OceanirAI/Oculus-0.1") # Basic VQA answer = model.ask("image.jpg", "What is this?") # With reasoning traces answer = model.ask("scene.jpg", "Count the people", think=True) # With focus/zoom for fine details answer = model.ask("document.jpg", "Read the fine print", focus=True) # Structured JSON output result = model.generate(image, prompt="Describe objects", mode="json") # OCR text = model.ocr("screenshot.png") # UI Detection ui_elements = model.detect_ui("desktop.png") # Object Detection with grounding boxes = model.detect("image.jpg") # Segmentation mask = model.segment("image.jpg") ``` ## Output Modes | Mode | Method | Output | |------|--------|--------| | Text | `model.ask(image, question)` | Natural language answer | | Reasoning | `model.ask(image, question, think=True)` | Answer with `` trace | | JSON | `model.generate(image, mode="json")` | Structured JSON | | Points | `model.generate(image, mode="point")` | Object center points | | Boxes | `model.detect(image)` | Bounding boxes + labels | | Polygons | `model.segment(image)` | Segmentation masks | | OCR | `model.ocr(image)` | Extracted text + locations | | UI | `model.detect_ui(image)` | UI elements + types | ## Special Tokens | Token | Purpose | |-------|---------| | `...` | Reasoning traces | | `...` | Focus/zoom regions | | `...` | Structured output | | `...` | Bounding box coordinates | | `...` | Point coordinates | ## Use Cases - **Robotics**: Grounded perception for manipulation and navigation - **Industrial Inspection**: Defect detection and quality control - **Document Processing**: Complex OCR and form extraction - **Media Search**: Visual content understanding and retrieval - **Desktop Automation**: UI understanding for agentic workflows - **Security**: Visual monitoring and anomaly detection ## What's in This Repo - `trained_components/projector.npz` - Vision-language projector - `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI) - `oculus_unified_model/` - Model code ## License Oceanir Research License - Non-commercial research only. For commercial licensing: licensing@oceanir.ai