--- license: other license_name: oceanir-research-license license_link: LICENSE language: - en library_name: oceanir pipeline_tag: image-text-to-text tags: - vision - multimodal - vision-language - vqa - image-captioning - object-detection - oculus - research - training base_model: - facebook/dinov3-vith16plus-pretrain-lvd1689m - google/siglip2-base-patch16-224 - LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16 --- # Oculus - Complete Training Repository This repository contains the complete Oculus vision-language model including all training code, checkpoints, and documentation. ## Quick Links | Model | Description | Link | |-------|-------------|------| | **Oculus-0.1-Instruct** | Instruction-tuned for VQA/captioning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct) | | **Oculus-0.1-Reasoning** | Chain-of-thought reasoning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning) | | **oceanir** | Python SDK | [PyPI](https://pypi.org/project/oceanir/) | ## Installation ```bash pip install oceanir ``` ```python from oceanir import Oculus model = Oculus.from_pretrained("OceanirAI/Oculus-0.1-Instruct") answer = model.ask("image.jpg", "What is this?") ``` ## Architecture Oculus combines state-of-the-art vision encoders with a powerful language model: ### Vision Encoders - **DINOv3 ViT-H/16+** (`facebook/dinov3-vith16plus-pretrain-lvd1689m`) - Self-supervised vision transformer trained on LVD-1689M - 1024 hidden, 24 layers, 16 heads - **SigLIP2** (`google/siglip2-base-patch16-224`) - Vision-language contrastive model - 1152 hidden, 27 layers, 16 heads ### Language Model - **LiquidAI LFM 2.5 1.2B Instruct** (`LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16`) - 1.2B parameters, 1536 embedding dim - 131K vocab, 32K context window ### Architecture Specs | Component | Specification | |-----------|--------------| | DINOv3 | ViT-H/16+, 1024D, 24L, 16H | | SigLIP2 | Base, 1152D, 27L, 16H | | Fusion | Concatenation → 2176D | | Projector | 2176 → 4352 → 1536 | | LFM 2.5 | 1.2B params, 1536D, 16L, 24H | | Detection | 80 classes (COCO) | | Segmentation | 150 classes (ADE20K) | ## Repository Structure ``` OceanirAI/Oculus/ ├── config.json # Main model config ├── README.md # This file │ ├── oculus_unified_model/ # Model implementation │ ├── __init__.py │ ├── modeling_oculus.py # OculusForConditionalGeneration │ ├── configuration_oculus.py # OculusConfig │ └── processing_oculus.py # OculusProcessor │ ├── training/ # Training scripts │ ├── train_oculus.py # Base projector training │ ├── train_detection.py # Detection head training │ ├── train_detection_extended.py │ ├── train_instruction_tuning.py # Instruct variant │ ├── train_reasoning_v2.py # Reasoning variant │ └── train_oculus_coco.py # COCO training │ ├── logs/ # Training logs │ ├── training_instruct_v1.log │ ├── training_reasoning_v2.log │ └── training_v2_final.log │ ├── checkpoints/ # Model checkpoints │ ├── oculus/final/ # Base projector │ │ ├── projector.npz # Vision projector weights (~822MB) │ │ └── config.json │ │ │ ├── oculus_detection/final/ # Detection checkpoint │ │ ├── projector.npz # Projector weights (~800MB) │ │ ├── heads.pth # Detection heads (~35MB) │ │ └── benchmark_results.json │ │ │ ├── oculus_instruct_v1/ # Instruction-tuned VQA │ │ └── vqa_model/ │ │ ├── model.safetensors # BLIP VQA weights (~1.5GB) │ │ ├── tokenizer.json │ │ └── config.json │ │ │ └── oculus_reasoning_v2/ # Reasoning VQA │ └── vqa_model/ │ ├── model.safetensors # BLIP VQA weights (~1.5GB) │ ├── tokenizer.json │ └── config.json │ ├── docs/ # Documentation │ ├── ARCHITECTURE.md │ ├── BENCHMARK_README.md │ └── TRAINING_ROADMAP.md │ ├── oculus_inference.py # Inference script ├── demo_oculus.py # Demo script ├── benchmark_vlm.py # Benchmarking └── eval_benchmarks.py # Evaluation ``` ## Training ### Base Projector Training ```bash python training/train_oculus.py ``` ### Detection Head Training ```bash python training/train_detection.py ``` ### Instruction Tuning ```bash python training/train_instruction_tuning.py ``` ### Reasoning Training ```bash python training/train_reasoning_v2.py ``` ## Features - **Visual Question Answering (VQA)** - Answer questions about images - **Image Captioning** - Generate natural descriptions - **Object Detection** - Detect with bounding boxes (80 COCO classes) - **Object Counting** - Count objects via point prediction - **Semantic Segmentation** - Pixel-level understanding (150 ADE20K classes) - **Chain-of-Thought Reasoning** - Step-by-step thinking traces ## License **Oceanir Research License v1.0** **Permitted:** - Academic research - Educational use - Publishing papers with results - Personal experimentation **Not Permitted:** - Commercial use - Training commercial models - Commercial products/services For commercial licensing: licensing@oceanir.ai ## Citation ```bibtex @software{oculus2026, title={Oculus Vision-Language Model}, author={OceanirAI}, year={2026}, url={https://huggingface.co/OceanirAI/Oculus} } ``` ## Links - [Oculus-0.1-Instruct](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct) - [Oculus-0.1-Reasoning](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning) - [Oceanir SDK (PyPI)](https://pypi.org/project/oceanir/) - [GitHub](https://github.com/OceanirAI/oceanir)