File size: 6,471 Bytes
ad39c92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- vision
- multimodal
- vision-language
- reasoning
- detection
- segmentation
- ocr
- vqa
- captioning
base_model:
- facebook/dinov2-large
- google/siglip-base-patch16-224
- Salesforce/blip-image-captioning-base
---

# Oculus 0.2

**A unified vision-language model with multi-modal reasoning capabilities.**

Oculus 0.2 is a hybrid-reasoning vision-language model that combines:
- **DINOv3** for semantic visual understanding
- **SigLIP2** for vision-language alignment
- **Trained Projector** for vision-to-language mapping
- **Optional Reasoning** via thinking traces

## πŸš€ What's New in Oculus 0.2

| Feature | Description |
|---------|-------------|
| **🧠 Reasoning via Thinking Traces** | Short, structured reasoning traces improve multi-step decisions and ambiguous spatial tasks |
| **πŸ” Focus System (Zoom & Crop)** | Automatically focus on smaller regions for fine-grained perception |
| **πŸ“¦ Multiple Output Modes** | Text, Point, Box, and Polygon outputs for different tasks |
| **πŸ“ Improved Captioning** | Better descriptions with context awareness |
| **❓ Enhanced VQA** | More accurate answers to visual questions |

## Output Modes

| Mode | Description | Use Case |
|------|-------------|----------|
| **πŸ“ Text** | Natural language output | Captioning, VQA, descriptions |
| **πŸ“ Point** | (x, y) coordinates + labels | Object counting, localization |
| **πŸ“¦ Box** | Bounding boxes + labels | Object detection |
| **πŸ”· Polygon** | Segmentation masks | Semantic/instance segmentation |

## Quick Start

```python
from oculus_unified_model import OculusForConditionalGeneration
from PIL import Image

# Load model
model = OculusForConditionalGeneration.from_pretrained("OceanirAI/oculus-0.2")

# Load image
image = Image.open("your_image.jpg")

# Caption mode
output = model.generate(image, mode="text", prompt="Describe this image")
print(output.text)

# VQA mode
output = model.generate(image, mode="text", prompt="What color is the car?")
print(output.text)

# With reasoning traces
output = model.generate(image, mode="text", prompt="Count the people", think=True)
print(f"Thinking: {output.thinking_trace}")
print(f"Answer: {output.text}")

# Detection mode (bounding boxes)
output = model.generate(image, mode="box", prompt="Find all vehicles")
for box, label, conf in zip(output.boxes, output.labels, output.confidences):
    print(f"  {label}: {box} (conf={conf:.2f})")

# Point mode (counting)
output = model.generate(image, mode="point", prompt="Count the birds")
print(f"Found {len(output.points)} points")

# Segmentation mode
output = model.generate(image, mode="polygon", prompt="Segment the road")
print(f"Mask shape: {output.mask.shape}")
```

## Reasoning Mode

Enable thinking traces for complex reasoning tasks:

```python
output = model.generate(
    image,
    mode="text",
    prompt="How many people are sitting vs standing?",
    think=True  # Enable reasoning
)

print(f"πŸ’­ Thinking: {output.thinking_trace}")
print(f"πŸ“ Answer: {output.text}")
```

## Focus System

The Focus system enables zoom-and-crop for fine-grained perception:

```python
output = model.generate(
    image,
    mode="text", 
    prompt="What does the small text say?",
    focus=True  # Enable focus/zoom
)
```

## Architecture

```
Image β†’ DINOv3 ────┐
                   β”œβ†’ Fusion β†’ Projector β†’ 64 tokens Γ— 1536D ───┐
Image β†’ SigLIP2 β”€β”€β”˜                                             β”‚
                                                                 ↓
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚                                 β”‚
                                              ↓                                 ↓
                                         LM Head                         Task Heads
                                              β”‚                                 β”‚
                                              ↓                                 ↓
                                    Text/Caption/VQA              Point/Box/Polygon
```

## Model Details

| Component | Size | Description |
|-----------|------|-------------|
| DINOv3 Encoder | 1.0B | Semantic visual features |
| SigLIP2 Encoder | 400M | Vision-language aligned features |
| Projector | 160M | Vision-to-language bridge |
| Detection Head | 12M | Bounding box prediction |
| Point Head | 8M | Point localization |
| Segmentation Head | 24M | Mask prediction |
| **Total** | **~1.6B** | Full model |

## Training

The model components were trained in stages:
1. **Projector**: Trained on COCO Captions (5k paired images) for 3 epochs.
2. **Detection Heads**: Trained on COCO Detection for 5+ epochs using GIoU and Focal Loss.

## Benchmarks & Evaluation

We use a comprehensive benchmark suite `eval_benchmarks.py` covering:
- **COCO Detection**: mAP evaluation
- **Car Part Damage**: Specialized evaluation on HuggingFace `moondream/car_part_damage` dataset
- **Counting**: Accuracy on Pixmo-style counting tasks
- **VQA**: Open-ended question answering accuracy

To run benchmarks:
```bash
python eval_benchmarks.py --model checkpoints/oculus_detection_v2/final
```

## πŸ”Œ Python API Usage

To use Oculus in your own applications, simply import the `OculusPredictor`:

```python
from oculus_inference import OculusPredictor

# Initialize (automatically loads best checkpoint)
model = OculusPredictor()

# 1. Object Detection
results = model.detect("image.jpg")
print(f"Found {len(results['boxes'])} objects")

# 2. Visual Question Answering (Reasoning)
answer = model.ask("image.jpg", "What is the person holding?")
print(f"Answer: {answer}")

# 3. Captioning
caption = model.caption("image.jpg")
print(f"Caption: {caption}")
```

## Requirements

```bash
pip install transformers torch pillow numpy
```

For Apple Silicon:
```bash
pip install mlx
```

## Citation

```bibtex
@misc{oculus2025,
  title={Oculus: Unified Vision-Language Model with Multi-Modal Reasoning},
  author={OceanirAI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/OceanirAI/oculus-0.2}
}
```

## License

CC-BY-NC-4.0

## Contact

- **Organization**: OceanirAI
- **GitHub**: [github.com/Oceanir](https://github.com/Oceanir)