Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +63 -43

README.md CHANGED Viewed

@@ -12,39 +12,26 @@ tags:
 - vision-language
 - vqa
 - reasoning
-- chain-of-thought
 - structured-output
 - ocr
 - ui-understanding
 - tool-calling
-- dinov3
-- siglip2
-- lfm2.5
-- liquid-ai
 - oculus
-base_model:
-- facebook/dinov3-vitl16-pretrain-lvd1689m
-- google/siglip2-so400m-patch16-naflex
-- LiquidAI/LFM2.5-1.2B-Base
 ---
-# Oculus 0.1 (~3.8B params)
-**Multimodal vision-language model with Isaac 0.2 features.**
-## Architecture
-| Component | Model | Parameters |
-|-----------|-------|------------|
-| Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B |
-| Vision Encoder 2 | SigLIP2 SO400M | ~400M |
-| Projector | 2-layer MLP | ~5M |
-| Language Model | LFM2.5-1.2B (Liquid AI) | ~1.2B |
-| Task Heads | Seg/Det/OCR/UI | ~2M |
-## Isaac 0.2 Features
-### 1. Reasoning via Thinking Traces
 Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
 ```python
@@ -52,37 +39,51 @@ answer = model.ask(image, "How many red cars on the left?", think=True)
 # Output includes <think>...</think> reasoning trace
 ```
-### 2. Perceptive Tool Calling + Focus (Zoom & Crop)
-Trigger tool calls to focus (zoom and crop) and re-query on smaller regions for fine-grained perception.
 ```python
 answer = model.ask(image, "Read the small text on the sign", focus=True)
 # Model automatically zooms to relevant region
 ```
-### 3. Structured Outputs
-Reliable JSON output generation for consistent downstream integration.
 ```python
 result = model.generate(image, prompt="List all objects", mode="json")
-# Returns structured JSON: {"objects": [{"label": "car", "confidence": 0.95}, ...]}
 ```
-### 4. Complex OCR
-Improved text recognition across cluttered, low-resolution, or distorted regions.
 ```python
-text = model.ocr(image)  # Extracts text from documents, diagrams, labels, screens
 ```
-### 5. Desktop UI Understanding
-Better performance on desktop and mobile workflows for agentic use cases.
 ```python
 elements = model.detect_ui(screenshot)
 # Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
 ```
 ## Usage
 ```python
@@ -96,7 +97,7 @@ answer = model.ask("image.jpg", "What is this?")
 # With reasoning traces
 answer = model.ask("scene.jpg", "Count the people", think=True)
-# With focus/zoom for small objects
 answer = model.ask("document.jpg", "Read the fine print", focus=True)
 # Structured JSON output
@@ -108,23 +109,25 @@ text = model.ocr("screenshot.png")
 # UI Detection
 ui_elements = model.detect_ui("desktop.png")
-# Detection
 boxes = model.detect("image.jpg")
 # Segmentation
 mask = model.segment("image.jpg")
 ```
-## What's in this repo
-- `trained_components/projector.npz` - Vision-language projector
-- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
-- `oculus_unified_model/` - Model code
-Base models load from source repos:
-- [facebook/dinov3-vitl16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m)
-- [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
-- [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
 ## Special Tokens
@@ -133,6 +136,23 @@ Base models load from source repos:
 | `<think>...</think>` | Reasoning traces |
 | `<focus>...</focus>` | Focus/zoom regions |
 | `<json>...</json>` | Structured output |
 ## License

 - vision-language
 - vqa
 - reasoning
+- thinking-traces
 - structured-output
 - ocr
 - ui-understanding
 - tool-calling
+- grounding
+- robotics
+- edge-deployment
 - oculus
 ---
+# Oculus 0.1
+**Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.**
+Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.
+## What's New in Oculus 0.1
+### Reasoning via Thinking Traces
 Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
 ```python
 # Output includes <think>...</think> reasoning trace
 ```
+### Perceptive Tool Calling + Focus (Zoom & Crop)
+Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions — dramatically improving fine-grained perception.
 ```python
 answer = model.ask(image, "Read the small text on the sign", focus=True)
 # Model automatically zooms to relevant region
 ```
+### Structured Outputs
+More reliable structured output generation for consistent JSON and predictable downstream integration.
 ```python
 result = model.generate(image, prompt="List all objects", mode="json")
+# Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
 ```
+### Complex OCR
+Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
 ```python
+text = model.ocr(image)  # Extracts text from any visual content
 ```
+### Desktop Use
+Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.
 ```python
 elements = model.detect_ui(screenshot)
 # Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
 ```
+## Architecture
+**Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for:
+- Visual reasoning outperforming systems 10x larger
+- Edge deployment on commodity GPUs
+- Grounded perception with spatial understanding
+- Tool calling and agentic workflows
+## Installation
+```bash
+pip install oceanir
+```
 ## Usage
 ```python
 # With reasoning traces
 answer = model.ask("scene.jpg", "Count the people", think=True)
+# With focus/zoom for fine details
 answer = model.ask("document.jpg", "Read the fine print", focus=True)
 # Structured JSON output
 # UI Detection
 ui_elements = model.detect_ui("desktop.png")
+# Object Detection with grounding
 boxes = model.detect("image.jpg")
 # Segmentation
 mask = model.segment("image.jpg")
 ```
+## Output Modes
+| Mode | Method | Output |
+|------|--------|--------|
+| Text | `model.ask(image, question)` | Natural language answer |
+| Reasoning | `model.ask(image, question, think=True)` | Answer with `<think>` trace |
+| JSON | `model.generate(image, mode="json")` | Structured JSON |
+| Points | `model.generate(image, mode="point")` | Object center points |
+| Boxes | `model.detect(image)` | Bounding boxes + labels |
+| Polygons | `model.segment(image)` | Segmentation masks |
+| OCR | `model.ocr(image)` | Extracted text + locations |
+| UI | `model.detect_ui(image)` | UI elements + types |
 ## Special Tokens
 | `<think>...</think>` | Reasoning traces |
 | `<focus>...</focus>` | Focus/zoom regions |
 | `<json>...</json>` | Structured output |
+| `<box>...</box>` | Bounding box coordinates |
+| `<point>...</point>` | Point coordinates |
+## Use Cases
+- **Robotics**: Grounded perception for manipulation and navigation
+- **Industrial Inspection**: Defect detection and quality control
+- **Document Processing**: Complex OCR and form extraction
+- **Media Search**: Visual content understanding and retrieval
+- **Desktop Automation**: UI understanding for agentic workflows
+- **Security**: Visual monitoring and anomaly detection
+## What's in This Repo
+- `trained_components/projector.npz` - Vision-language projector
+- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
+- `oculus_unified_model/` - Model code
 ## License