kobiakor15 commited on
Commit
8eb7430
·
verified ·
1 Parent(s): a26f847

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +77 -45
README.md CHANGED
@@ -13,10 +13,10 @@ tags:
13
  - vqa
14
  - reasoning
15
  - chain-of-thought
16
- - instruction-following
17
- - segmentation
18
- - detection
19
  - ocr
 
 
20
  - dinov3
21
  - siglip2
22
  - lfm2.5
@@ -30,49 +30,57 @@ base_model:
30
 
31
  # Oculus 0.1 (~3.8B params)
32
 
33
- **Multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.**
34
 
35
  ## Architecture
36
 
37
- | Component | Model | Parameters | Source |
38
- |-----------|-------|------------|--------|
39
- | Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B | [facebook/dinov3-vitl16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m) |
40
- | Vision Encoder 2 | SigLIP2 SO400M | ~400M | [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) |
41
- | Projector | 2-layer MLP | ~5M | This repo |
42
- | Language Model | LFM2.5-1.2B | ~1.2B | [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base) |
43
- | Task Heads | Seg/Det/OCR | ~1.5M | This repo |
44
 
45
- ```
46
- Image (224x224) --> DINOv3 ViT-L/16 --\
47
- +--> Concat --> Projector --> LFM2.5-1.2B --> Text
48
- Image (384x384) --> SigLIP2 SO400M ---/ |
49
- +--> Segmentation Head
50
- +--> Detection Head
51
- +--> OCR Head
 
52
  ```
53
 
54
- ## What's in this repo
 
55
 
56
- This repo contains **only the trained components**:
57
- - `trained_components/projector.npz` - Vision-language projector
58
- - `trained_components/heads.pth` - Task heads (segmentation, detection, OCR)
59
- - `oculus_unified_model/` - Model code
60
 
61
- Base models (DINOv3, SigLIP2, LFM2.5) are loaded from their source repos.
 
62
 
63
- ## Capabilities
 
 
 
64
 
65
- - **VQA**: Visual question answering
66
- - **Reasoning**: Chain-of-thought with `<think>...</think>` tokens
67
- - **Captioning**: Image descriptions
68
- - **Segmentation**: Pixel-level prediction (150 classes)
69
- - **Detection**: Object detection (80 COCO classes)
70
- - **OCR**: Text detection and recognition
71
 
72
- ## Installation
 
 
73
 
74
- ```bash
75
- pip install oceanir
 
 
 
 
76
  ```
77
 
78
  ## Usage
@@ -82,25 +90,49 @@ from oceanir import Oculus
82
 
83
  model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
84
 
85
- # VQA
86
- answer = model.ask("image.jpg", "What is in this image?")
 
 
 
87
 
88
- # Reasoning
89
- answer = model.ask("scene.jpg", "How many people?", think=True)
90
 
91
- # Captioning
92
- caption = model.caption("image.jpg")
 
 
 
 
 
 
93
 
94
  # Detection
95
  boxes = model.detect("image.jpg")
 
 
 
96
  ```
97
 
98
- ## Why LFM2.5?
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- - 3x faster training than Qwen on CPU
101
- - 2x faster inference on CPU
102
- - Native MLX support
103
- - Optimized for edge devices
 
104
 
105
  ## License
106
 
 
13
  - vqa
14
  - reasoning
15
  - chain-of-thought
16
+ - structured-output
 
 
17
  - ocr
18
+ - ui-understanding
19
+ - tool-calling
20
  - dinov3
21
  - siglip2
22
  - lfm2.5
 
30
 
31
  # Oculus 0.1 (~3.8B params)
32
 
33
+ **Multimodal vision-language model with Isaac 0.2 features.**
34
 
35
  ## Architecture
36
 
37
+ | Component | Model | Parameters |
38
+ |-----------|-------|------------|
39
+ | Vision Encoder 1 | DINOv3 ViT-L/16 | ~1.7B |
40
+ | Vision Encoder 2 | SigLIP2 SO400M | ~400M |
41
+ | Projector | 2-layer MLP | ~5M |
42
+ | Language Model | LFM2.5-1.2B (Liquid AI) | ~1.2B |
43
+ | Task Heads | Seg/Det/OCR/UI | ~2M |
44
 
45
+ ## Isaac 0.2 Features
46
+
47
+ ### 1. Reasoning via Thinking Traces
48
+ Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
49
+
50
+ ```python
51
+ answer = model.ask(image, "How many red cars on the left?", think=True)
52
+ # Output includes <think>...</think> reasoning trace
53
  ```
54
 
55
+ ### 2. Perceptive Tool Calling + Focus (Zoom & Crop)
56
+ Trigger tool calls to focus (zoom and crop) and re-query on smaller regions for fine-grained perception.
57
 
58
+ ```python
59
+ answer = model.ask(image, "Read the small text on the sign", focus=True)
60
+ # Model automatically zooms to relevant region
61
+ ```
62
 
63
+ ### 3. Structured Outputs
64
+ Reliable JSON output generation for consistent downstream integration.
65
 
66
+ ```python
67
+ result = model.generate(image, prompt="List all objects", mode="json")
68
+ # Returns structured JSON: {"objects": [{"label": "car", "confidence": 0.95}, ...]}
69
+ ```
70
 
71
+ ### 4. Complex OCR
72
+ Improved text recognition across cluttered, low-resolution, or distorted regions.
 
 
 
 
73
 
74
+ ```python
75
+ text = model.ocr(image) # Extracts text from documents, diagrams, labels, screens
76
+ ```
77
 
78
+ ### 5. Desktop UI Understanding
79
+ Better performance on desktop and mobile workflows for agentic use cases.
80
+
81
+ ```python
82
+ elements = model.detect_ui(screenshot)
83
+ # Returns: [{"type": "button", "text": "Submit", "bbox": [x1,y1,x2,y2]}, ...]
84
  ```
85
 
86
  ## Usage
 
90
 
91
  model = Oculus.from_pretrained("OceanirAI/Oculus-0.1")
92
 
93
+ # Basic VQA
94
+ answer = model.ask("image.jpg", "What is this?")
95
+
96
+ # With reasoning traces
97
+ answer = model.ask("scene.jpg", "Count the people", think=True)
98
 
99
+ # With focus/zoom for small objects
100
+ answer = model.ask("document.jpg", "Read the fine print", focus=True)
101
 
102
+ # Structured JSON output
103
+ result = model.generate(image, prompt="Describe objects", mode="json")
104
+
105
+ # OCR
106
+ text = model.ocr("screenshot.png")
107
+
108
+ # UI Detection
109
+ ui_elements = model.detect_ui("desktop.png")
110
 
111
  # Detection
112
  boxes = model.detect("image.jpg")
113
+
114
+ # Segmentation
115
+ mask = model.segment("image.jpg")
116
  ```
117
 
118
+ ## What's in this repo
119
+
120
+ - `trained_components/projector.npz` - Vision-language projector
121
+ - `trained_components/heads.pth` - Task heads (detection, segmentation, OCR, UI)
122
+ - `oculus_unified_model/` - Model code
123
+
124
+ Base models load from source repos:
125
+ - [facebook/dinov3-vitl16-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m)
126
+ - [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
127
+ - [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
128
+
129
+ ## Special Tokens
130
 
131
+ | Token | Purpose |
132
+ |-------|---------|
133
+ | `<think>...</think>` | Reasoning traces |
134
+ | `<focus>...</focus>` | Focus/zoom regions |
135
+ | `<json>...</json>` | Structured output |
136
 
137
  ## License
138