Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -13,10 +13,10 @@ tags:
|
|
| 13 |
- vqa
|
| 14 |
- reasoning
|
| 15 |
- chain-of-thought
|
| 16 |
-
-
|
| 17 |
-
- segmentation
|
| 18 |
-
- detection
|
| 19 |
- ocr
|
|
|
|
|
|
|
| 20 |
- dinov3
|
| 21 |
- siglip2
|
| 22 |
- lfm2.5
|
|
@@ -30,49 +30,57 @@ base_model:
|
|
| 30 |
|
| 31 |
# Oculus 0.1 (~3.8B params)
|
| 32 |
|
| 33 |
-
**Multimodal vision-language model
|
| 34 |
|
| 35 |
## Architecture
|
| 36 |
|
| 37 |
-
| Component | Model | Parameters |
|
| 38 |
-
|
| 39 |
-
| Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B |
|
| 40 |
-
| Vision Encoder 2 | SigLIP2 SO400M | ~400M |
|
| 41 |
-
| Projector | 2-layer MLP | ~5M |
|
| 42 |
-
| Language Model | LFM2.5-1.2B
|
| 43 |
-
| Task Heads | Seg/Det/OCR | ~
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
| 52 |
```
|
| 53 |
|
| 54 |
-
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
|
|
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
- **Captioning**: Image descriptions
|
| 68 |
-
- **Segmentation**: Pixel-level prediction (150 classes)
|
| 69 |
-
- **Detection**: Object detection (80 COCO classes)
|
| 70 |
-
- **OCR**: Text detection and recognition
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
```
|
| 77 |
|
| 78 |
## Usage
|
|
@@ -82,25 +90,49 @@ from oceanir import Oculus
|
|
| 82 |
|
| 83 |
model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
|
| 84 |
|
| 85 |
-
# VQA
|
| 86 |
-
answer = model.ask("image.jpg", "What is
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
#
|
| 89 |
-
answer = model.ask("
|
| 90 |
|
| 91 |
-
#
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
# Detection
|
| 95 |
boxes = model.detect("image.jpg")
|
|
|
|
|
|
|
|
|
|
| 96 |
```
|
| 97 |
|
| 98 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
|
|
|
| 104 |
|
| 105 |
## License
|
| 106 |
|
|
|
|
| 13 |
- vqa
|
| 14 |
- reasoning
|
| 15 |
- chain-of-thought
|
| 16 |
+
- structured-output
|
|
|
|
|
|
|
| 17 |
- ocr
|
| 18 |
+
- ui-understanding
|
| 19 |
+
- tool-calling
|
| 20 |
- dinov3
|
| 21 |
- siglip2
|
| 22 |
- lfm2.5
|
|
|
|
| 30 |
|
| 31 |
# Oculus 0.1 (~3.8B params)
|
| 32 |
|
| 33 |
+
**Multimodal vision-language model with Isaac 0.2 features.**
|
| 34 |
|
| 35 |
## Architecture
|
| 36 |
|
| 37 |
+
| Component | Model | Parameters |
|
| 38 |
+
|-----------|-------|------------|
|
| 39 |
+
| Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B |
|
| 40 |
+
| Vision Encoder 2 | SigLIP2 SO400M | ~400M |
|
| 41 |
+
| Projector | 2-layer MLP | ~5M |
|
| 42 |
+
| Language Model | LFM2.5-1.2B (Liquid AI) | ~1.2B |
|
| 43 |
+
| Task Heads | Seg/Det/OCR/UI | ~2M |
|
| 44 |
|
| 45 |
+
## Isaac 0.2 Features
|
| 46 |
+
|
| 47 |
+
### 1. Reasoning via Thinking Traces
|
| 48 |
+
Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
answer = model.ask(image, "How many red cars on the left?", think=True)
|
| 52 |
+
# Output includes <think>...</think> reasoning trace
|
| 53 |
```
|
| 54 |
|
| 55 |
+
### 2. Perceptive Tool Calling + Focus (Zoom & Crop)
|
| 56 |
+
Trigger tool calls to focus (zoom and crop) and re-query on smaller regions for fine-grained perception.
|
| 57 |
|
| 58 |
+
```python
|
| 59 |
+
answer = model.ask(image, "Read the small text on the sign", focus=True)
|
| 60 |
+
# Model automatically zooms to relevant region
|
| 61 |
+
```
|
| 62 |
|
| 63 |
+
### 3. Structured Outputs
|
| 64 |
+
Reliable JSON output generation for consistent downstream integration.
|
| 65 |
|
| 66 |
+
```python
|
| 67 |
+
result = model.generate(image, prompt="List all objects", mode="json")
|
| 68 |
+
# Returns structured JSON: {"objects": [{"label": "car", "confidence": 0.95}, ...]}
|
| 69 |
+
```
|
| 70 |
|
| 71 |
+
### 4. Complex OCR
|
| 72 |
+
Improved text recognition across cluttered, low-resolution, or distorted regions.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
```python
|
| 75 |
+
text = model.ocr(image) # Extracts text from documents, diagrams, labels, screens
|
| 76 |
+
```
|
| 77 |
|
| 78 |
+
### 5. Desktop UI Understanding
|
| 79 |
+
Better performance on desktop and mobile workflows for agentic use cases.
|
| 80 |
+
|
| 81 |
+
```python
|
| 82 |
+
elements = model.detect_ui(screenshot)
|
| 83 |
+
# Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
|
| 84 |
```
|
| 85 |
|
| 86 |
## Usage
|
|
|
|
| 90 |
|
| 91 |
model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
|
| 92 |
|
| 93 |
+
# Basic VQA
|
| 94 |
+
answer = model.ask("image.jpg", "What is this?")
|
| 95 |
+
|
| 96 |
+
# With reasoning traces
|
| 97 |
+
answer = model.ask("scene.jpg", "Count the people", think=True)
|
| 98 |
|
| 99 |
+
# With focus/zoom for small objects
|
| 100 |
+
answer = model.ask("document.jpg", "Read the fine print", focus=True)
|
| 101 |
|
| 102 |
+
# Structured JSON output
|
| 103 |
+
result = model.generate(image, prompt="Describe objects", mode="json")
|
| 104 |
+
|
| 105 |
+
# OCR
|
| 106 |
+
text = model.ocr("screenshot.png")
|
| 107 |
+
|
| 108 |
+
# UI Detection
|
| 109 |
+
ui_elements = model.detect_ui("desktop.png")
|
| 110 |
|
| 111 |
# Detection
|
| 112 |
boxes = model.detect("image.jpg")
|
| 113 |
+
|
| 114 |
+
# Segmentation
|
| 115 |
+
mask = model.segment("image.jpg")
|
| 116 |
```
|
| 117 |
|
| 118 |
+
## What's in this repo
|
| 119 |
+
|
| 120 |
+
- `trained_components/projector.npz` - Vision-language projector
|
| 121 |
+
- `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
|
| 122 |
+
- `oculus_unified_model/` - Model code
|
| 123 |
+
|
| 124 |
+
Base models load from source repos:
|
| 125 |
+
- [facebook/dinov3-vitl16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m)
|
| 126 |
+
- [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
|
| 127 |
+
- [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
|
| 128 |
+
|
| 129 |
+
## Special Tokens
|
| 130 |
|
| 131 |
+
| Token | Purpose |
|
| 132 |
+
|-------|---------|
|
| 133 |
+
| `<think>...</think>` | Reasoning traces |
|
| 134 |
+
| `<focus>...</focus>` | Focus/zoom regions |
|
| 135 |
+
| `<json>...</json>` | Structured output |
|
| 136 |
|
| 137 |
## License
|
| 138 |
|