kobiakor15 commited on
Commit
cb1db66
·
verified ·
1 Parent(s): 95c5fe2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +63 -43
README.md CHANGED
@@ -12,39 +12,26 @@ tags:
12
  - vision-language
13
  - vqa
14
  - reasoning
15
- - chain-of-thought
16
  - structured-output
17
  - ocr
18
  - ui-understanding
19
  - tool-calling
20
- - dinov3
21
- - siglip2
22
- - lfm2.5
23
- - liquid-ai
24
  - oculus
25
- base_model:
26
- - facebook/dinov3-vitl16-pretrain-lvd1689m
27
- - google/siglip2-so400m-patch16-naflex
28
- - LiquidAI/LFM2.5-1.2B-Base
29
  ---
30
 
31
- # Oculus 0.1 (~3.8B params)
32
 
33
- **Multimodal vision-language model with Isaac 0.2 features.**
34
 
35
- ## Architecture
36
-
37
- | Component | Model | Parameters |
38
- |-----------|-------|------------|
39
- | Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B |
40
- | Vision Encoder 2 | SigLIP2 SO400M | ~400M |
41
- | Projector | 2-layer MLP | ~5M |
42
- | Language Model | LFM2.5-1.2B (Liquid AI) | ~1.2B |
43
- | Task Heads | Seg/Det/OCR/UI | ~2M |
44
 
45
- ## Isaac 0.2 Features
46
 
47
- ### 1. Reasoning via Thinking Traces
48
  Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
49
 
50
  ```python
@@ -52,37 +39,51 @@ answer = model.ask(image, "How many red cars on the left?", think=True)
52
  # Output includes <think>...</think> reasoning trace
53
  ```
54
 
55
- ### 2. Perceptive Tool Calling + Focus (Zoom & Crop)
56
- Trigger tool calls to focus (zoom and crop) and re-query on smaller regions for fine-grained perception.
57
 
58
  ```python
59
  answer = model.ask(image, "Read the small text on the sign", focus=True)
60
  # Model automatically zooms to relevant region
61
  ```
62
 
63
- ### 3. Structured Outputs
64
- Reliable JSON output generation for consistent downstream integration.
65
 
66
  ```python
67
  result = model.generate(image, prompt="List all objects", mode="json")
68
- # Returns structured JSON: {"objects": [{"label": "car", "confidence": 0.95}, ...]}
69
  ```
70
 
71
- ### 4. Complex OCR
72
- Improved text recognition across cluttered, low-resolution, or distorted regions.
73
 
74
  ```python
75
- text = model.ocr(image) # Extracts text from documents, diagrams, labels, screens
76
  ```
77
 
78
- ### 5. Desktop UI Understanding
79
- Better performance on desktop and mobile workflows for agentic use cases.
80
 
81
  ```python
82
  elements = model.detect_ui(screenshot)
83
  # Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
84
  ```
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ## Usage
87
 
88
  ```python
@@ -96,7 +97,7 @@ answer = model.ask("image.jpg", "What is this?")
96
  # With reasoning traces
97
  answer = model.ask("scene.jpg", "Count the people", think=True)
98
 
99
- # With focus/zoom for small objects
100
  answer = model.ask("document.jpg", "Read the fine print", focus=True)
101
 
102
  # Structured JSON output
@@ -108,23 +109,25 @@ text = model.ocr("screenshot.png")
108
  # UI Detection
109
  ui_elements = model.detect_ui("desktop.png")
110
 
111
- # Detection
112
  boxes = model.detect("image.jpg")
113
 
114
  # Segmentation
115
  mask = model.segment("image.jpg")
116
  ```
117
 
118
- ## What's in this repo
119
 
120
- - `trained_components/projector.npz` - Vision-language projector
121
- - `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
122
- - `oculus_unified_model/` - Model code
123
-
124
- Base models load from source repos:
125
- - [facebook/dinov3-vitl16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m)
126
- - [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
127
- - [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
 
 
128
 
129
  ## Special Tokens
130
 
@@ -133,6 +136,23 @@ Base models load from source repos:
133
  | `<think>...</think>` | Reasoning traces |
134
  | `<focus>...</focus>` | Focus/zoom regions |
135
  | `<json>...</json>` | Structured output |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
  ## License
138
 
 
12
  - vision-language
13
  - vqa
14
  - reasoning
15
+ - thinking-traces
16
  - structured-output
17
  - ocr
18
  - ui-understanding
19
  - tool-calling
20
+ - grounding
21
+ - robotics
22
+ - edge-deployment
 
23
  - oculus
 
 
 
 
24
  ---
25
 
26
+ # Oculus 0.1
27
 
28
+ **Hybrid-reasoning vision-language model built on the Oceanir-Oculus OO1 Architecture.**
29
 
30
+ Small models that outperform systems 10x larger on visual reasoning and perception tasks, running on commodity GPUs or edge devices.
 
 
 
 
 
 
 
 
31
 
32
+ ## What's New in Oculus 0.1
33
 
34
+ ### Reasoning via Thinking Traces
35
  Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
36
 
37
  ```python
 
39
  # Output includes <think>...</think> reasoning trace
40
  ```
41
 
42
+ ### Perceptive Tool Calling + Focus (Zoom & Crop)
43
+ Oculus can trigger tool calls to focus (zoom and crop) and re-query on smaller regions dramatically improving fine-grained perception.
44
 
45
  ```python
46
  answer = model.ask(image, "Read the small text on the sign", focus=True)
47
  # Model automatically zooms to relevant region
48
  ```
49
 
50
+ ### Structured Outputs
51
+ More reliable structured output generation for consistent JSON and predictable downstream integration.
52
 
53
  ```python
54
  result = model.generate(image, prompt="List all objects", mode="json")
55
+ # Returns structured JSON: {"objects": [{"label": "car", "box": [x1,y1,x2,y2]}, ...]}
56
  ```
57
 
58
+ ### Complex OCR
59
+ Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
60
 
61
  ```python
62
+ text = model.ocr(image) # Extracts text from any visual content
63
  ```
64
 
65
+ ### Desktop Use
66
+ Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Oculus faster and more capable for agentic use cases.
67
 
68
  ```python
69
  elements = model.detect_ui(screenshot)
70
  # Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
71
  ```
72
 
73
+ ## Architecture
74
+
75
+ **Oceanir-Oculus OO1 Architecture** — A hybrid vision-language architecture optimized for:
76
+ - Visual reasoning outperforming systems 10x larger
77
+ - Edge deployment on commodity GPUs
78
+ - Grounded perception with spatial understanding
79
+ - Tool calling and agentic workflows
80
+
81
+ ## Installation
82
+
83
+ ```bash
84
+ pip install oceanir
85
+ ```
86
+
87
  ## Usage
88
 
89
  ```python
 
97
  # With reasoning traces
98
  answer = model.ask("scene.jpg", "Count the people", think=True)
99
 
100
+ # With focus/zoom for fine details
101
  answer = model.ask("document.jpg", "Read the fine print", focus=True)
102
 
103
  # Structured JSON output
 
109
  # UI Detection
110
  ui_elements = model.detect_ui("desktop.png")
111
 
112
+ # Object Detection with grounding
113
  boxes = model.detect("image.jpg")
114
 
115
  # Segmentation
116
  mask = model.segment("image.jpg")
117
  ```
118
 
119
+ ## Output Modes
120
 
121
+ | Mode | Method | Output |
122
+ |------|--------|--------|
123
+ | Text | `model.ask(image, question)` | Natural language answer |
124
+ | Reasoning | `model.ask(image, question, think=True)` | Answer with `<think>` trace |
125
+ | JSON | `model.generate(image, mode="json")` | Structured JSON |
126
+ | Points | `model.generate(image, mode="point")` | Object center points |
127
+ | Boxes | `model.detect(image)` | Bounding boxes + labels |
128
+ | Polygons | `model.segment(image)` | Segmentation masks |
129
+ | OCR | `model.ocr(image)` | Extracted text + locations |
130
+ | UI | `model.detect_ui(image)` | UI elements + types |
131
 
132
  ## Special Tokens
133
 
 
136
  | `<think>...</think>` | Reasoning traces |
137
  | `<focus>...</focus>` | Focus/zoom regions |
138
  | `<json>...</json>` | Structured output |
139
+ | `<box>...</box>` | Bounding box coordinates |
140
+ | `<point>...</point>` | Point coordinates |
141
+
142
+ ## Use Cases
143
+
144
+ - **Robotics**: Grounded perception for manipulation and navigation
145
+ - **Industrial Inspection**: Defect detection and quality control
146
+ - **Document Processing**: Complex OCR and form extraction
147
+ - **Media Search**: Visual content understanding and retrieval
148
+ - **Desktop Automation**: UI understanding for agentic workflows
149
+ - **Security**: Visual monitoring and anomaly detection
150
+
151
+ ## What's in This Repo
152
+
153
+ - `trained_components/projector.npz` - Vision-language projector
154
+ - `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
155
+ - `oculus_unified_model/` - Model code
156
 
157
  ## License
158