--- license: cc-by-nc-4.0 language: - en pipeline_tag: image-text-to-text library_name: transformers tags: - vision - multimodal - vision-language - reasoning - detection - segmentation - ocr - vqa - captioning base_model: - facebook/dinov2-large - google/siglip-base-patch16-224 - Salesforce/blip-image-captioning-base --- # Oculus 0.2 **A unified vision-language model with multi-modal reasoning capabilities.** Oculus 0.2 is a hybrid-reasoning vision-language model that combines: - **DINOv3** for semantic visual understanding - **SigLIP2** for vision-language alignment - **Trained Projector** for vision-to-language mapping - **Optional Reasoning** via thinking traces ## 🚀 What's New in Oculus 0.2 | Feature | Description | |---------|-------------| | **🧠 Reasoning via Thinking Traces** | Short, structured reasoning traces improve multi-step decisions and ambiguous spatial tasks | | **🔍 Focus System (Zoom & Crop)** | Automatically focus on smaller regions for fine-grained perception | | **📦 Multiple Output Modes** | Text, Point, Box, and Polygon outputs for different tasks | | **📝 Improved Captioning** | Better descriptions with context awareness | | **❓ Enhanced VQA** | More accurate answers to visual questions | ## Output Modes | Mode | Description | Use Case | |------|-------------|----------| | **📝 Text** | Natural language output | Captioning, VQA, descriptions | | **📍 Point** | (x, y) coordinates + labels | Object counting, localization | | **📦 Box** | Bounding boxes + labels | Object detection | | **🔷 Polygon** | Segmentation masks | Semantic/instance segmentation | ## Quick Start ```python from oculus_unified_model import OculusForConditionalGeneration from PIL import Image # Load model model = OculusForConditionalGeneration.from_pretrained("OceanirAI/oculus-0.2") # Load image image = Image.open("your_image.jpg") # Caption mode output = model.generate(image, mode="text", prompt="Describe this image") print(output.text) # VQA mode output = model.generate(image, mode="text", prompt="What color is the car?") print(output.text) # With reasoning traces output = model.generate(image, mode="text", prompt="Count the people", think=True) print(f"Thinking: {output.thinking_trace}") print(f"Answer: {output.text}") # Detection mode (bounding boxes) output = model.generate(image, mode="box", prompt="Find all vehicles") for box, label, conf in zip(output.boxes, output.labels, output.confidences): print(f" {label}: {box} (conf={conf:.2f})") # Point mode (counting) output = model.generate(image, mode="point", prompt="Count the birds") print(f"Found {len(output.points)} points") # Segmentation mode output = model.generate(image, mode="polygon", prompt="Segment the road") print(f"Mask shape: {output.mask.shape}") ``` ## Reasoning Mode Enable thinking traces for complex reasoning tasks: ```python output = model.generate( image, mode="text", prompt="How many people are sitting vs standing?", think=True # Enable reasoning ) print(f"💭 Thinking: {output.thinking_trace}") print(f"📝 Answer: {output.text}") ``` ## Focus System The Focus system enables zoom-and-crop for fine-grained perception: ```python output = model.generate( image, mode="text", prompt="What does the small text say?", focus=True # Enable focus/zoom ) ``` ## Architecture ``` Image → DINOv3 ────┐ ├→ Fusion → Projector → 64 tokens × 1536D ───┐ Image → SigLIP2 ──┘ │ ↓ ┌─────────────────────────────────┐ │ │ ↓ ↓ LM Head Task Heads │ │ ↓ ↓ Text/Caption/VQA Point/Box/Polygon ``` ## Model Details | Component | Size | Description | |-----------|------|-------------| | DINOv3 Encoder | 1.0B | Semantic visual features | | SigLIP2 Encoder | 400M | Vision-language aligned features | | Projector | 160M | Vision-to-language bridge | | Detection Head | 12M | Bounding box prediction | | Point Head | 8M | Point localization | | Segmentation Head | 24M | Mask prediction | | **Total** | **~1.6B** | Full model | ## Training The model components were trained in stages: 1. **Projector**: Trained on COCO Captions (5k paired images) for 3 epochs. 2. **Detection Heads**: Trained on COCO Detection for 5+ epochs using GIoU and Focal Loss. ## Benchmarks & Evaluation We use a comprehensive benchmark suite `eval_benchmarks.py` covering: - **COCO Detection**: mAP evaluation - **Car Part Damage**: Specialized evaluation on HuggingFace `moondream/car_part_damage` dataset - **Counting**: Accuracy on Pixmo-style counting tasks - **VQA**: Open-ended question answering accuracy To run benchmarks: ```bash python eval_benchmarks.py --model checkpoints/oculus_detection_v2/final ``` ## 🔌 Python API Usage To use Oculus in your own applications, simply import the `OculusPredictor`: ```python from oculus_inference import OculusPredictor # Initialize (automatically loads best checkpoint) model = OculusPredictor() # 1. Object Detection results = model.detect("image.jpg") print(f"Found {len(results['boxes'])} objects") # 2. Visual Question Answering (Reasoning) answer = model.ask("image.jpg", "What is the person holding?") print(f"Answer: {answer}") # 3. Captioning caption = model.caption("image.jpg") print(f"Caption: {caption}") ``` ## Requirements ```bash pip install transformers torch pillow numpy ``` For Apple Silicon: ```bash pip install mlx ``` ## Citation ```bibtex @misc{oculus2025, title={Oculus: Unified Vision-Language Model with Multi-Modal Reasoning}, author={OceanirAI}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/OceanirAI/oculus-0.2} } ``` ## License CC-BY-NC-4.0 ## Contact - **Organization**: OceanirAI - **GitHub**: [github.com/Oceanir](https://github.com/Oceanir)