Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +77 -45

README.md CHANGED Viewed

@@ -13,10 +13,10 @@ tags:
 - vqa
 - reasoning
 - chain-of-thought
-- instruction-following
-- segmentation
-- detection
 - ocr
 - dinov3
 - siglip2
 - lfm2.5
@@ -30,49 +30,57 @@ base_model:
 # Oculus 0.1 (~3.8B params)
-**Multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.**
 ## Architecture
-| Component | Model | Parameters | Source |
-|-----------|-------|------------|--------|
-| Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B | [facebook/dinov3-vitl16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m) |
-| Vision Encoder 2 | SigLIP2 SO400M | ~400M | [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) |
-| Projector | 2-layer MLP | ~5M | This repo |
-| Language Model | LFM2.5-1.2B | ~1.2B | [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base) |
-| Task Heads | Seg/Det/OCR | ~1.5M | This repo |
-```
-Image (224x224) --> DINOv3 ViT-L/16 --\
-                                       +--> Concat --> Projector --> LFM2.5-1.2B --> Text
-Image (384x384) --> SigLIP2 SO400M ---/                    |
-                                                           +--> Segmentation Head
-                                                           +--> Detection Head
-                                                           +--> OCR Head
 ```
-## What's in this repo
-This repo contains **only the trained components**:
-- `trained_components/projector.npz` - Vision-language projector
-- `trained_components/heads.pth` - Task heads (segmentation, detection, OCR)
-- `oculus_unified_model/` - Model code
-Base models (DINOv3, SigLIP2, LFM2.5) are loaded from their source repos.
-## Capabilities
-- **VQA**: Visual question answering
-- **Reasoning**: Chain-of-thought with `<think>...</think>` tokens
-- **Captioning**: Image descriptions
-- **Segmentation**: Pixel-level prediction (150 classes)
-- **Detection**: Object detection (80 COCO classes)
-- **OCR**: Text detection and recognition
-## Installation
-```bash
-pip install oceanir
 ```
 ## Usage
@@ -82,25 +90,49 @@ from oceanir import Oculus
 model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
-# VQA
-answer = model.ask("image.jpg", "What is in this image?")
-# Reasoning
-answer = model.ask("scene.jpg", "How many people?", think=True)
-# Captioning
-caption = model.caption("image.jpg")
 # Detection
 boxes = model.detect("image.jpg")
 ```
-## Why LFM2.5?
-- 3x faster training than Qwen on CPU
-- 2x faster inference on CPU
-- Native MLX support
-- Optimized for edge devices
 ## License

 - vqa
 - reasoning
 - chain-of-thought
+- structured-output
 - ocr
+- ui-understanding
+- tool-calling
 - dinov3
 - siglip2
 - lfm2.5
 # Oculus 0.1 (~3.8B params)
+**Multimodal vision-language model with Isaac 0.2 features.**
 ## Architecture
+| Component | Model | Parameters |
+|-----------|-------|------------|
+| Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B |
+| Vision Encoder 2 | SigLIP2 SO400M | ~400M |
+| Projector | 2-layer MLP | ~5M |
+| Language Model | LFM2.5-1.2B (Liquid AI) | ~1.2B |
+| Task Heads | Seg/Det/OCR/UI | ~2M |
+## Isaac 0.2 Features
+### 1. Reasoning via Thinking Traces
+Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
+```python
+answer = model.ask(image, "How many red cars on the left?", think=True)
+# Output includes <think>...</think> reasoning trace
 ```
+### 2. Perceptive Tool Calling + Focus (Zoom & Crop)
+Trigger tool calls to focus (zoom and crop) and re-query on smaller regions for fine-grained perception.
+```python
+answer = model.ask(image, "Read the small text on the sign", focus=True)
+# Model automatically zooms to relevant region
+```
+### 3. Structured Outputs
+Reliable JSON output generation for consistent downstream integration.
+```python
+result = model.generate(image, prompt="List all objects", mode="json")
+# Returns structured JSON: {"objects": [{"label": "car", "confidence": 0.95}, ...]}
+```
+### 4. Complex OCR
+Improved text recognition across cluttered, low-resolution, or distorted regions.
+```python
+text = model.ocr(image)  # Extracts text from documents, diagrams, labels, screens
+```
+### 5. Desktop UI Understanding
+Better performance on desktop and mobile workflows for agentic use cases.
+```python
+elements = model.detect_ui(screenshot)
+# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
 ```
 ## Usage
 model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
+# Basic VQA
+answer = model.ask("image.jpg", "What is this?")
+# With reasoning traces
+answer = model.ask("scene.jpg", "Count the people", think=True)
+# With focus/zoom for small objects
+answer = model.ask("document.jpg", "Read the fine print", focus=True)
+# Structured JSON output
+result = model.generate(image, prompt="Describe objects", mode="json")
+# OCR
+text = model.ocr("screenshot.png")
+# UI Detection
+ui_elements = model.detect_ui("desktop.png")
 # Detection
 boxes = model.detect("image.jpg")
+# Segmentation
+mask = model.segment("image.jpg")
 ```
+## What's in this repo
+- `trained_components/projector.npz` - Vision-language projector
+- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
+- `oculus_unified_model/` - Model code
+Base models load from source repos:
+- [facebook/dinov3-vitl16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m)
+- [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
+- [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
+## Special Tokens
+| Token | Purpose |
+|-------|---------|
+| `<think>...</think>` | Reasoning traces |
+| `<focus>...</focus>` | Focus/zoom regions |
+| `<json>...</json>` | Structured output |
 ## License