Oculus / README.md
kobiakor15's picture
Upload README.md with huggingface_hub
79310dc verified
---
license: other
license_name: oceanir-research-license
license_link: LICENSE
language:
- en
library_name: oceanir
pipeline_tag: image-text-to-text
tags:
- vision
- multimodal
- vision-language
- vqa
- image-captioning
- object-detection
- oculus
- research
- training
base_model:
- facebook/dinov3-vith16plus-pretrain-lvd1689m
- google/siglip2-base-patch16-224
- LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16
---
# Oculus - Complete Training Repository
This repository contains the complete Oculus vision-language model including all training code, checkpoints, and documentation.
## Quick Links
| Model | Description | Link |
|-------|-------------|------|
| **Oculus-0.1-Instruct** | Instruction-tuned for VQA/captioning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct) |
| **Oculus-0.1-Reasoning** | Chain-of-thought reasoning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning) |
| **oceanir** | Python SDK | [PyPI](https://pypi.org/project/oceanir/) |
## Installation
```bash
pip install oceanir
```
```python
from oceanir import Oculus
model = Oculus.from_pretrained("OceanirAI/Oculus-0.1-Instruct")
answer = model.ask("image.jpg", "What is this?")
```
## Architecture
Oculus combines state-of-the-art vision encoders with a powerful language model:
### Vision Encoders
- **DINOv3 ViT-H/16+** (`facebook/dinov3-vith16plus-pretrain-lvd1689m`)
- Self-supervised vision transformer trained on LVD-1689M
- 1024 hidden, 24 layers, 16 heads
- **SigLIP2** (`google/siglip2-base-patch16-224`)
- Vision-language contrastive model
- 1152 hidden, 27 layers, 16 heads
### Language Model
- **LiquidAI LFM 2.5 1.2B Instruct** (`LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16`)
- 1.2B parameters, 1536 embedding dim
- 131K vocab, 32K context window
### Architecture Specs
| Component | Specification |
|-----------|--------------|
| DINOv3 | ViT-H/16+, 1024D, 24L, 16H |
| SigLIP2 | Base, 1152D, 27L, 16H |
| Fusion | Concatenation β†’ 2176D |
| Projector | 2176 β†’ 4352 β†’ 1536 |
| LFM 2.5 | 1.2B params, 1536D, 16L, 24H |
| Detection | 80 classes (COCO) |
| Segmentation | 150 classes (ADE20K) |
## Repository Structure
```
OceanirAI/Oculus/
β”œβ”€β”€ config.json # Main model config
β”œβ”€β”€ README.md # This file
β”‚
β”œβ”€β”€ oculus_unified_model/ # Model implementation
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ modeling_oculus.py # OculusForConditionalGeneration
β”‚ β”œβ”€β”€ configuration_oculus.py # OculusConfig
β”‚ └── processing_oculus.py # OculusProcessor
β”‚
β”œβ”€β”€ training/ # Training scripts
β”‚ β”œβ”€β”€ train_oculus.py # Base projector training
β”‚ β”œβ”€β”€ train_detection.py # Detection head training
β”‚ β”œβ”€β”€ train_detection_extended.py
β”‚ β”œβ”€β”€ train_instruction_tuning.py # Instruct variant
β”‚ β”œβ”€β”€ train_reasoning_v2.py # Reasoning variant
β”‚ └── train_oculus_coco.py # COCO training
β”‚
β”œβ”€β”€ logs/ # Training logs
β”‚ β”œβ”€β”€ training_instruct_v1.log
β”‚ β”œβ”€β”€ training_reasoning_v2.log
β”‚ └── training_v2_final.log
β”‚
β”œβ”€β”€ checkpoints/ # Model checkpoints
β”‚ β”œβ”€β”€ oculus/final/ # Base projector
β”‚ β”‚ β”œβ”€β”€ projector.npz # Vision projector weights (~822MB)
β”‚ β”‚ └── config.json
β”‚ β”‚
β”‚ β”œβ”€β”€ oculus_detection/final/ # Detection checkpoint
β”‚ β”‚ β”œβ”€β”€ projector.npz # Projector weights (~800MB)
β”‚ β”‚ β”œβ”€β”€ heads.pth # Detection heads (~35MB)
β”‚ β”‚ └── benchmark_results.json
β”‚ β”‚
β”‚ β”œβ”€β”€ oculus_instruct_v1/ # Instruction-tuned VQA
β”‚ β”‚ └── vqa_model/
β”‚ β”‚ β”œβ”€β”€ model.safetensors # BLIP VQA weights (~1.5GB)
β”‚ β”‚ β”œβ”€β”€ tokenizer.json
β”‚ β”‚ └── config.json
β”‚ β”‚
β”‚ └── oculus_reasoning_v2/ # Reasoning VQA
β”‚ └── vqa_model/
β”‚ β”œβ”€β”€ model.safetensors # BLIP VQA weights (~1.5GB)
β”‚ β”œβ”€β”€ tokenizer.json
β”‚ └── config.json
β”‚
β”œβ”€β”€ docs/ # Documentation
β”‚ β”œβ”€β”€ ARCHITECTURE.md
β”‚ β”œβ”€β”€ BENCHMARK_README.md
β”‚ └── TRAINING_ROADMAP.md
β”‚
β”œβ”€β”€ oculus_inference.py # Inference script
β”œβ”€β”€ demo_oculus.py # Demo script
β”œβ”€β”€ benchmark_vlm.py # Benchmarking
└── eval_benchmarks.py # Evaluation
```
## Training
### Base Projector Training
```bash
python training/train_oculus.py
```
### Detection Head Training
```bash
python training/train_detection.py
```
### Instruction Tuning
```bash
python training/train_instruction_tuning.py
```
### Reasoning Training
```bash
python training/train_reasoning_v2.py
```
## Features
- **Visual Question Answering (VQA)** - Answer questions about images
- **Image Captioning** - Generate natural descriptions
- **Object Detection** - Detect with bounding boxes (80 COCO classes)
- **Object Counting** - Count objects via point prediction
- **Semantic Segmentation** - Pixel-level understanding (150 ADE20K classes)
- **Chain-of-Thought Reasoning** - Step-by-step thinking traces
## License
**Oceanir Research License v1.0**
**Permitted:**
- Academic research
- Educational use
- Publishing papers with results
- Personal experimentation
**Not Permitted:**
- Commercial use
- Training commercial models
- Commercial products/services
For commercial licensing: licensing@oceanir.ai
## Citation
```bibtex
@software{oculus2026,
title={Oculus Vision-Language Model},
author={OceanirAI},
year={2026},
url={https://huggingface.co/OceanirAI/Oculus}
}
```
## Links
- [Oculus-0.1-Instruct](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct)
- [Oculus-0.1-Reasoning](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning)
- [Oceanir SDK (PyPI)](https://pypi.org/project/oceanir/)
- [GitHub](https://github.com/OceanirAI/oceanir)