kobiakor15's picture
Upload oculus_unified_model/README.md with huggingface_hub
ad39c92 verified
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- vision
- multimodal
- vision-language
- reasoning
- detection
- segmentation
- ocr
- vqa
- captioning
base_model:
- facebook/dinov2-large
- google/siglip-base-patch16-224
- Salesforce/blip-image-captioning-base
---
# Oculus 0.2
**A unified vision-language model with multi-modal reasoning capabilities.**
Oculus 0.2 is a hybrid-reasoning vision-language model that combines:
- **DINOv3** for semantic visual understanding
- **SigLIP2** for vision-language alignment
- **Trained Projector** for vision-to-language mapping
- **Optional Reasoning** via thinking traces
## πŸš€ What's New in Oculus 0.2
| Feature | Description |
|---------|-------------|
| **🧠 Reasoning via Thinking Traces** | Short, structured reasoning traces improve multi-step decisions and ambiguous spatial tasks |
| **πŸ” Focus System (Zoom & Crop)** | Automatically focus on smaller regions for fine-grained perception |
| **πŸ“¦ Multiple Output Modes** | Text, Point, Box, and Polygon outputs for different tasks |
| **πŸ“ Improved Captioning** | Better descriptions with context awareness |
| **❓ Enhanced VQA** | More accurate answers to visual questions |
## Output Modes
| Mode | Description | Use Case |
|------|-------------|----------|
| **πŸ“ Text** | Natural language output | Captioning, VQA, descriptions |
| **πŸ“ Point** | (x, y) coordinates + labels | Object counting, localization |
| **πŸ“¦ Box** | Bounding boxes + labels | Object detection |
| **πŸ”· Polygon** | Segmentation masks | Semantic/instance segmentation |
## Quick Start
```python
from oculus_unified_model import OculusForConditionalGeneration
from PIL import Image
# Load model
model = OculusForConditionalGeneration.from_pretrained("OceanirAI/oculus-0.2")
# Load image
image = Image.open("your_image.jpg")
# Caption mode
output = model.generate(image, mode="text", prompt="Describe this image")
print(output.text)
# VQA mode
output = model.generate(image, mode="text", prompt="What color is the car?")
print(output.text)
# With reasoning traces
output = model.generate(image, mode="text", prompt="Count the people", think=True)
print(f"Thinking: {output.thinking_trace}")
print(f"Answer: {output.text}")
# Detection mode (bounding boxes)
output = model.generate(image, mode="box", prompt="Find all vehicles")
for box, label, conf in zip(output.boxes, output.labels, output.confidences):
print(f" {label}: {box} (conf={conf:.2f})")
# Point mode (counting)
output = model.generate(image, mode="point", prompt="Count the birds")
print(f"Found {len(output.points)} points")
# Segmentation mode
output = model.generate(image, mode="polygon", prompt="Segment the road")
print(f"Mask shape: {output.mask.shape}")
```
## Reasoning Mode
Enable thinking traces for complex reasoning tasks:
```python
output = model.generate(
image,
mode="text",
prompt="How many people are sitting vs standing?",
think=True # Enable reasoning
)
print(f"πŸ’­ Thinking: {output.thinking_trace}")
print(f"πŸ“ Answer: {output.text}")
```
## Focus System
The Focus system enables zoom-and-crop for fine-grained perception:
```python
output = model.generate(
image,
mode="text",
prompt="What does the small text say?",
focus=True # Enable focus/zoom
)
```
## Architecture
```
Image β†’ DINOv3 ────┐
β”œβ†’ Fusion β†’ Projector β†’ 64 tokens Γ— 1536D ───┐
Image β†’ SigLIP2 β”€β”€β”˜ β”‚
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
↓ ↓
LM Head Task Heads
β”‚ β”‚
↓ ↓
Text/Caption/VQA Point/Box/Polygon
```
## Model Details
| Component | Size | Description |
|-----------|------|-------------|
| DINOv3 Encoder | 1.0B | Semantic visual features |
| SigLIP2 Encoder | 400M | Vision-language aligned features |
| Projector | 160M | Vision-to-language bridge |
| Detection Head | 12M | Bounding box prediction |
| Point Head | 8M | Point localization |
| Segmentation Head | 24M | Mask prediction |
| **Total** | **~1.6B** | Full model |
## Training
The model components were trained in stages:
1. **Projector**: Trained on COCO Captions (5k paired images) for 3 epochs.
2. **Detection Heads**: Trained on COCO Detection for 5+ epochs using GIoU and Focal Loss.
## Benchmarks & Evaluation
We use a comprehensive benchmark suite `eval_benchmarks.py` covering:
- **COCO Detection**: mAP evaluation
- **Car Part Damage**: Specialized evaluation on HuggingFace `moondream/car_part_damage` dataset
- **Counting**: Accuracy on Pixmo-style counting tasks
- **VQA**: Open-ended question answering accuracy
To run benchmarks:
```bash
python eval_benchmarks.py --model checkpoints/oculus_detection_v2/final
```
## πŸ”Œ Python API Usage
To use Oculus in your own applications, simply import the `OculusPredictor`:
```python
from oculus_inference import OculusPredictor
# Initialize (automatically loads best checkpoint)
model = OculusPredictor()
# 1. Object Detection
results = model.detect("image.jpg")
print(f"Found {len(results['boxes'])} objects")
# 2. Visual Question Answering (Reasoning)
answer = model.ask("image.jpg", "What is the person holding?")
print(f"Answer: {answer}")
# 3. Captioning
caption = model.caption("image.jpg")
print(f"Caption: {caption}")
```
## Requirements
```bash
pip install transformers torch pillow numpy
```
For Apple Silicon:
```bash
pip install mlx
```
## Citation
```bibtex
@misc{oculus2025,
title={Oculus: Unified Vision-Language Model with Multi-Modal Reasoning},
author={OceanirAI},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/OceanirAI/oculus-0.2}
}
```
## License
CC-BY-NC-4.0
## Contact
- **Organization**: OceanirAI
- **GitHub**: [github.com/Oceanir](https://github.com/Oceanir)