File size: 6,471 Bytes
ad39c92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- vision
- multimodal
- vision-language
- reasoning
- detection
- segmentation
- ocr
- vqa
- captioning
base_model:
- facebook/dinov2-large
- google/siglip-base-patch16-224
- Salesforce/blip-image-captioning-base
---
# Oculus 0.2
**A unified vision-language model with multi-modal reasoning capabilities.**
Oculus 0.2 is a hybrid-reasoning vision-language model that combines:
- **DINOv3** for semantic visual understanding
- **SigLIP2** for vision-language alignment
- **Trained Projector** for vision-to-language mapping
- **Optional Reasoning** via thinking traces
## π What's New in Oculus 0.2
| Feature | Description |
|---------|-------------|
| **π§ Reasoning via Thinking Traces** | Short, structured reasoning traces improve multi-step decisions and ambiguous spatial tasks |
| **π Focus System (Zoom & Crop)** | Automatically focus on smaller regions for fine-grained perception |
| **π¦ Multiple Output Modes** | Text, Point, Box, and Polygon outputs for different tasks |
| **π Improved Captioning** | Better descriptions with context awareness |
| **β Enhanced VQA** | More accurate answers to visual questions |
## Output Modes
| Mode | Description | Use Case |
|------|-------------|----------|
| **π Text** | Natural language output | Captioning, VQA, descriptions |
| **π Point** | (x, y) coordinates + labels | Object counting, localization |
| **π¦ Box** | Bounding boxes + labels | Object detection |
| **π· Polygon** | Segmentation masks | Semantic/instance segmentation |
## Quick Start
```python
from oculus_unified_model import OculusForConditionalGeneration
from PIL import Image
# Load model
model = OculusForConditionalGeneration.from_pretrained("OceanirAI/oculus-0.2")
# Load image
image = Image.open("your_image.jpg")
# Caption mode
output = model.generate(image, mode="text", prompt="Describe this image")
print(output.text)
# VQA mode
output = model.generate(image, mode="text", prompt="What color is the car?")
print(output.text)
# With reasoning traces
output = model.generate(image, mode="text", prompt="Count the people", think=True)
print(f"Thinking: {output.thinking_trace}")
print(f"Answer: {output.text}")
# Detection mode (bounding boxes)
output = model.generate(image, mode="box", prompt="Find all vehicles")
for box, label, conf in zip(output.boxes, output.labels, output.confidences):
print(f" {label}: {box} (conf={conf:.2f})")
# Point mode (counting)
output = model.generate(image, mode="point", prompt="Count the birds")
print(f"Found {len(output.points)} points")
# Segmentation mode
output = model.generate(image, mode="polygon", prompt="Segment the road")
print(f"Mask shape: {output.mask.shape}")
```
## Reasoning Mode
Enable thinking traces for complex reasoning tasks:
```python
output = model.generate(
image,
mode="text",
prompt="How many people are sitting vs standing?",
think=True # Enable reasoning
)
print(f"π Thinking: {output.thinking_trace}")
print(f"π Answer: {output.text}")
```
## Focus System
The Focus system enables zoom-and-crop for fine-grained perception:
```python
output = model.generate(
image,
mode="text",
prompt="What does the small text say?",
focus=True # Enable focus/zoom
)
```
## Architecture
```
Image β DINOv3 βββββ
ββ Fusion β Projector β 64 tokens Γ 1536D ββββ
Image β SigLIP2 βββ β
β
βββββββββββββββββββββββββββββββββββ
β β
β β
LM Head Task Heads
β β
β β
Text/Caption/VQA Point/Box/Polygon
```
## Model Details
| Component | Size | Description |
|-----------|------|-------------|
| DINOv3 Encoder | 1.0B | Semantic visual features |
| SigLIP2 Encoder | 400M | Vision-language aligned features |
| Projector | 160M | Vision-to-language bridge |
| Detection Head | 12M | Bounding box prediction |
| Point Head | 8M | Point localization |
| Segmentation Head | 24M | Mask prediction |
| **Total** | **~1.6B** | Full model |
## Training
The model components were trained in stages:
1. **Projector**: Trained on COCO Captions (5k paired images) for 3 epochs.
2. **Detection Heads**: Trained on COCO Detection for 5+ epochs using GIoU and Focal Loss.
## Benchmarks & Evaluation
We use a comprehensive benchmark suite `eval_benchmarks.py` covering:
- **COCO Detection**: mAP evaluation
- **Car Part Damage**: Specialized evaluation on HuggingFace `moondream/car_part_damage` dataset
- **Counting**: Accuracy on Pixmo-style counting tasks
- **VQA**: Open-ended question answering accuracy
To run benchmarks:
```bash
python eval_benchmarks.py --model checkpoints/oculus_detection_v2/final
```
## π Python API Usage
To use Oculus in your own applications, simply import the `OculusPredictor`:
```python
from oculus_inference import OculusPredictor
# Initialize (automatically loads best checkpoint)
model = OculusPredictor()
# 1. Object Detection
results = model.detect("image.jpg")
print(f"Found {len(results['boxes'])} objects")
# 2. Visual Question Answering (Reasoning)
answer = model.ask("image.jpg", "What is the person holding?")
print(f"Answer: {answer}")
# 3. Captioning
caption = model.caption("image.jpg")
print(f"Caption: {caption}")
```
## Requirements
```bash
pip install transformers torch pillow numpy
```
For Apple Silicon:
```bash
pip install mlx
```
## Citation
```bibtex
@misc{oculus2025,
title={Oculus: Unified Vision-Language Model with Multi-Modal Reasoning},
author={OceanirAI},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/OceanirAI/oculus-0.2}
}
```
## License
CC-BY-NC-4.0
## Contact
- **Organization**: OceanirAI
- **GitHub**: [github.com/Oceanir](https://github.com/Oceanir)
|