kobiakor15 commited on
Commit
ad39c92
Β·
verified Β·
1 Parent(s): 0ff66ab

Upload oculus_unified_model/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. oculus_unified_model/README.md +220 -0
oculus_unified_model/README.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
+ tags:
8
+ - vision
9
+ - multimodal
10
+ - vision-language
11
+ - reasoning
12
+ - detection
13
+ - segmentation
14
+ - ocr
15
+ - vqa
16
+ - captioning
17
+ base_model:
18
+ - facebook/dinov2-large
19
+ - google/siglip-base-patch16-224
20
+ - Salesforce/blip-image-captioning-base
21
+ ---
22
+
23
+ # Oculus 0.2
24
+
25
+ **A unified vision-language model with multi-modal reasoning capabilities.**
26
+
27
+ Oculus 0.2 is a hybrid-reasoning vision-language model that combines:
28
+ - **DINOv3** for semantic visual understanding
29
+ - **SigLIP2** for vision-language alignment
30
+ - **Trained Projector** for vision-to-language mapping
31
+ - **Optional Reasoning** via thinking traces
32
+
33
+ ## πŸš€ What's New in Oculus 0.2
34
+
35
+ | Feature | Description |
36
+ |---------|-------------|
37
+ | **🧠 Reasoning via Thinking Traces** | Short, structured reasoning traces improve multi-step decisions and ambiguous spatial tasks |
38
+ | **πŸ” Focus System (Zoom & Crop)** | Automatically focus on smaller regions for fine-grained perception |
39
+ | **πŸ“¦ Multiple Output Modes** | Text, Point, Box, and Polygon outputs for different tasks |
40
+ | **πŸ“ Improved Captioning** | Better descriptions with context awareness |
41
+ | **❓ Enhanced VQA** | More accurate answers to visual questions |
42
+
43
+ ## Output Modes
44
+
45
+ | Mode | Description | Use Case |
46
+ |------|-------------|----------|
47
+ | **πŸ“ Text** | Natural language output | Captioning, VQA, descriptions |
48
+ | **πŸ“ Point** | (x, y) coordinates + labels | Object counting, localization |
49
+ | **πŸ“¦ Box** | Bounding boxes + labels | Object detection |
50
+ | **πŸ”· Polygon** | Segmentation masks | Semantic/instance segmentation |
51
+
52
+ ## Quick Start
53
+
54
+ ```python
55
+ from oculus_unified_model import OculusForConditionalGeneration
56
+ from PIL import Image
57
+
58
+ # Load model
59
+ model = OculusForConditionalGeneration.from_pretrained("OceanirAI/oculus-0.2")
60
+
61
+ # Load image
62
+ image = Image.open("your_image.jpg")
63
+
64
+ # Caption mode
65
+ output = model.generate(image, mode="text", prompt="Describe this image")
66
+ print(output.text)
67
+
68
+ # VQA mode
69
+ output = model.generate(image, mode="text", prompt="What color is the car?")
70
+ print(output.text)
71
+
72
+ # With reasoning traces
73
+ output = model.generate(image, mode="text", prompt="Count the people", think=True)
74
+ print(f"Thinking: {output.thinking_trace}")
75
+ print(f"Answer: {output.text}")
76
+
77
+ # Detection mode (bounding boxes)
78
+ output = model.generate(image, mode="box", prompt="Find all vehicles")
79
+ for box, label, conf in zip(output.boxes, output.labels, output.confidences):
80
+ print(f" {label}: {box} (conf={conf:.2f})")
81
+
82
+ # Point mode (counting)
83
+ output = model.generate(image, mode="point", prompt="Count the birds")
84
+ print(f"Found {len(output.points)} points")
85
+
86
+ # Segmentation mode
87
+ output = model.generate(image, mode="polygon", prompt="Segment the road")
88
+ print(f"Mask shape: {output.mask.shape}")
89
+ ```
90
+
91
+ ## Reasoning Mode
92
+
93
+ Enable thinking traces for complex reasoning tasks:
94
+
95
+ ```python
96
+ output = model.generate(
97
+ image,
98
+ mode="text",
99
+ prompt="How many people are sitting vs standing?",
100
+ think=True # Enable reasoning
101
+ )
102
+
103
+ print(f"πŸ’­ Thinking: {output.thinking_trace}")
104
+ print(f"πŸ“ Answer: {output.text}")
105
+ ```
106
+
107
+ ## Focus System
108
+
109
+ The Focus system enables zoom-and-crop for fine-grained perception:
110
+
111
+ ```python
112
+ output = model.generate(
113
+ image,
114
+ mode="text",
115
+ prompt="What does the small text say?",
116
+ focus=True # Enable focus/zoom
117
+ )
118
+ ```
119
+
120
+ ## Architecture
121
+
122
+ ```
123
+ Image β†’ DINOv3 ────┐
124
+ β”œβ†’ Fusion β†’ Projector β†’ 64 tokens Γ— 1536D ───┐
125
+ Image β†’ SigLIP2 β”€β”€β”˜ β”‚
126
+ ↓
127
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
128
+ β”‚ β”‚
129
+ ↓ ↓
130
+ LM Head Task Heads
131
+ β”‚ β”‚
132
+ ↓ ↓
133
+ Text/Caption/VQA Point/Box/Polygon
134
+ ```
135
+
136
+ ## Model Details
137
+
138
+ | Component | Size | Description |
139
+ |-----------|------|-------------|
140
+ | DINOv3 Encoder | 1.0B | Semantic visual features |
141
+ | SigLIP2 Encoder | 400M | Vision-language aligned features |
142
+ | Projector | 160M | Vision-to-language bridge |
143
+ | Detection Head | 12M | Bounding box prediction |
144
+ | Point Head | 8M | Point localization |
145
+ | Segmentation Head | 24M | Mask prediction |
146
+ | **Total** | **~1.6B** | Full model |
147
+
148
+ ## Training
149
+
150
+ The model components were trained in stages:
151
+ 1. **Projector**: Trained on COCO Captions (5k paired images) for 3 epochs.
152
+ 2. **Detection Heads**: Trained on COCO Detection for 5+ epochs using GIoU and Focal Loss.
153
+
154
+ ## Benchmarks & Evaluation
155
+
156
+ We use a comprehensive benchmark suite `eval_benchmarks.py` covering:
157
+ - **COCO Detection**: mAP evaluation
158
+ - **Car Part Damage**: Specialized evaluation on HuggingFace `moondream/car_part_damage` dataset
159
+ - **Counting**: Accuracy on Pixmo-style counting tasks
160
+ - **VQA**: Open-ended question answering accuracy
161
+
162
+ To run benchmarks:
163
+ ```bash
164
+ python eval_benchmarks.py --model checkpoints/oculus_detection_v2/final
165
+ ```
166
+
167
+ ## πŸ”Œ Python API Usage
168
+
169
+ To use Oculus in your own applications, simply import the `OculusPredictor`:
170
+
171
+ ```python
172
+ from oculus_inference import OculusPredictor
173
+
174
+ # Initialize (automatically loads best checkpoint)
175
+ model = OculusPredictor()
176
+
177
+ # 1. Object Detection
178
+ results = model.detect("image.jpg")
179
+ print(f"Found {len(results['boxes'])} objects")
180
+
181
+ # 2. Visual Question Answering (Reasoning)
182
+ answer = model.ask("image.jpg", "What is the person holding?")
183
+ print(f"Answer: {answer}")
184
+
185
+ # 3. Captioning
186
+ caption = model.caption("image.jpg")
187
+ print(f"Caption: {caption}")
188
+ ```
189
+
190
+ ## Requirements
191
+
192
+ ```bash
193
+ pip install transformers torch pillow numpy
194
+ ```
195
+
196
+ For Apple Silicon:
197
+ ```bash
198
+ pip install mlx
199
+ ```
200
+
201
+ ## Citation
202
+
203
+ ```bibtex
204
+ @misc{oculus2025,
205
+ title={Oculus: Unified Vision-Language Model with Multi-Modal Reasoning},
206
+ author={OceanirAI},
207
+ year={2025},
208
+ publisher={Hugging Face},
209
+ url={https://huggingface.co/OceanirAI/oculus-0.2}
210
+ }
211
+ ```
212
+
213
+ ## License
214
+
215
+ CC-BY-NC-4.0
216
+
217
+ ## Contact
218
+
219
+ - **Organization**: OceanirAI
220
+ - **GitHub**: [github.com/Oceanir](https://github.com/Oceanir)