|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- vision |
|
|
- multimodal |
|
|
- vision-language |
|
|
- reasoning |
|
|
- detection |
|
|
- segmentation |
|
|
- ocr |
|
|
- vqa |
|
|
- captioning |
|
|
base_model: |
|
|
- facebook/dinov2-large |
|
|
- google/siglip-base-patch16-224 |
|
|
- Salesforce/blip-image-captioning-base |
|
|
--- |
|
|
|
|
|
# Oculus 0.2 |
|
|
|
|
|
**A unified vision-language model with multi-modal reasoning capabilities.** |
|
|
|
|
|
Oculus 0.2 is a hybrid-reasoning vision-language model that combines: |
|
|
- **DINOv3** for semantic visual understanding |
|
|
- **SigLIP2** for vision-language alignment |
|
|
- **Trained Projector** for vision-to-language mapping |
|
|
- **Optional Reasoning** via thinking traces |
|
|
|
|
|
## π What's New in Oculus 0.2 |
|
|
|
|
|
| Feature | Description | |
|
|
|---------|-------------| |
|
|
| **π§ Reasoning via Thinking Traces** | Short, structured reasoning traces improve multi-step decisions and ambiguous spatial tasks | |
|
|
| **π Focus System (Zoom & Crop)** | Automatically focus on smaller regions for fine-grained perception | |
|
|
| **π¦ Multiple Output Modes** | Text, Point, Box, and Polygon outputs for different tasks | |
|
|
| **π Improved Captioning** | Better descriptions with context awareness | |
|
|
| **β Enhanced VQA** | More accurate answers to visual questions | |
|
|
|
|
|
## Output Modes |
|
|
|
|
|
| Mode | Description | Use Case | |
|
|
|------|-------------|----------| |
|
|
| **π Text** | Natural language output | Captioning, VQA, descriptions | |
|
|
| **π Point** | (x, y) coordinates + labels | Object counting, localization | |
|
|
| **π¦ Box** | Bounding boxes + labels | Object detection | |
|
|
| **π· Polygon** | Segmentation masks | Semantic/instance segmentation | |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from oculus_unified_model import OculusForConditionalGeneration |
|
|
from PIL import Image |
|
|
|
|
|
# Load model |
|
|
model = OculusForConditionalGeneration.from_pretrained("OceanirAI/oculus-0.2") |
|
|
|
|
|
# Load image |
|
|
image = Image.open("your_image.jpg") |
|
|
|
|
|
# Caption mode |
|
|
output = model.generate(image, mode="text", prompt="Describe this image") |
|
|
print(output.text) |
|
|
|
|
|
# VQA mode |
|
|
output = model.generate(image, mode="text", prompt="What color is the car?") |
|
|
print(output.text) |
|
|
|
|
|
# With reasoning traces |
|
|
output = model.generate(image, mode="text", prompt="Count the people", think=True) |
|
|
print(f"Thinking: {output.thinking_trace}") |
|
|
print(f"Answer: {output.text}") |
|
|
|
|
|
# Detection mode (bounding boxes) |
|
|
output = model.generate(image, mode="box", prompt="Find all vehicles") |
|
|
for box, label, conf in zip(output.boxes, output.labels, output.confidences): |
|
|
print(f" {label}: {box} (conf={conf:.2f})") |
|
|
|
|
|
# Point mode (counting) |
|
|
output = model.generate(image, mode="point", prompt="Count the birds") |
|
|
print(f"Found {len(output.points)} points") |
|
|
|
|
|
# Segmentation mode |
|
|
output = model.generate(image, mode="polygon", prompt="Segment the road") |
|
|
print(f"Mask shape: {output.mask.shape}") |
|
|
``` |
|
|
|
|
|
## Reasoning Mode |
|
|
|
|
|
Enable thinking traces for complex reasoning tasks: |
|
|
|
|
|
```python |
|
|
output = model.generate( |
|
|
image, |
|
|
mode="text", |
|
|
prompt="How many people are sitting vs standing?", |
|
|
think=True # Enable reasoning |
|
|
) |
|
|
|
|
|
print(f"π Thinking: {output.thinking_trace}") |
|
|
print(f"π Answer: {output.text}") |
|
|
``` |
|
|
|
|
|
## Focus System |
|
|
|
|
|
The Focus system enables zoom-and-crop for fine-grained perception: |
|
|
|
|
|
```python |
|
|
output = model.generate( |
|
|
image, |
|
|
mode="text", |
|
|
prompt="What does the small text say?", |
|
|
focus=True # Enable focus/zoom |
|
|
) |
|
|
``` |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
Image β DINOv3 βββββ |
|
|
ββ Fusion β Projector β 64 tokens Γ 1536D ββββ |
|
|
Image β SigLIP2 βββ β |
|
|
β |
|
|
βββββββββββββββββββββββββββββββββββ |
|
|
β β |
|
|
β β |
|
|
LM Head Task Heads |
|
|
β β |
|
|
β β |
|
|
Text/Caption/VQA Point/Box/Polygon |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Component | Size | Description | |
|
|
|-----------|------|-------------| |
|
|
| DINOv3 Encoder | 1.0B | Semantic visual features | |
|
|
| SigLIP2 Encoder | 400M | Vision-language aligned features | |
|
|
| Projector | 160M | Vision-to-language bridge | |
|
|
| Detection Head | 12M | Bounding box prediction | |
|
|
| Point Head | 8M | Point localization | |
|
|
| Segmentation Head | 24M | Mask prediction | |
|
|
| **Total** | **~1.6B** | Full model | |
|
|
|
|
|
## Training |
|
|
|
|
|
The model components were trained in stages: |
|
|
1. **Projector**: Trained on COCO Captions (5k paired images) for 3 epochs. |
|
|
2. **Detection Heads**: Trained on COCO Detection for 5+ epochs using GIoU and Focal Loss. |
|
|
|
|
|
## Benchmarks & Evaluation |
|
|
|
|
|
We use a comprehensive benchmark suite `eval_benchmarks.py` covering: |
|
|
- **COCO Detection**: mAP evaluation |
|
|
- **Car Part Damage**: Specialized evaluation on HuggingFace `moondream/car_part_damage` dataset |
|
|
- **Counting**: Accuracy on Pixmo-style counting tasks |
|
|
- **VQA**: Open-ended question answering accuracy |
|
|
|
|
|
To run benchmarks: |
|
|
```bash |
|
|
python eval_benchmarks.py --model checkpoints/oculus_detection_v2/final |
|
|
``` |
|
|
|
|
|
## π Python API Usage |
|
|
|
|
|
To use Oculus in your own applications, simply import the `OculusPredictor`: |
|
|
|
|
|
```python |
|
|
from oculus_inference import OculusPredictor |
|
|
|
|
|
# Initialize (automatically loads best checkpoint) |
|
|
model = OculusPredictor() |
|
|
|
|
|
# 1. Object Detection |
|
|
results = model.detect("image.jpg") |
|
|
print(f"Found {len(results['boxes'])} objects") |
|
|
|
|
|
# 2. Visual Question Answering (Reasoning) |
|
|
answer = model.ask("image.jpg", "What is the person holding?") |
|
|
print(f"Answer: {answer}") |
|
|
|
|
|
# 3. Captioning |
|
|
caption = model.caption("image.jpg") |
|
|
print(f"Caption: {caption}") |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
```bash |
|
|
pip install transformers torch pillow numpy |
|
|
``` |
|
|
|
|
|
For Apple Silicon: |
|
|
```bash |
|
|
pip install mlx |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{oculus2025, |
|
|
title={Oculus: Unified Vision-Language Model with Multi-Modal Reasoning}, |
|
|
author={OceanirAI}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/OceanirAI/oculus-0.2} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
CC-BY-NC-4.0 |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Organization**: OceanirAI |
|
|
- **GitHub**: [github.com/Oceanir](https://github.com/Oceanir) |
|
|
|