Oculus - Complete Training Repository

This repository contains the complete Oculus vision-language model including all training code, checkpoints, and documentation.

Quick Links

Model Description Link
Oculus-0.1-Instruct Instruction-tuned for VQA/captioning HuggingFace
Oculus-0.1-Reasoning Chain-of-thought reasoning HuggingFace
oceanir Python SDK PyPI

Installation

pip install oceanir
from oceanir import Oculus

model = Oculus.from_pretrained("OceanirAI/Oculus-0.1-Instruct")
answer = model.ask("image.jpg", "What is this?")

Architecture

Oculus combines state-of-the-art vision encoders with a powerful language model:

Vision Encoders

  • DINOv3 ViT-H/16+ (facebook/dinov3-vith16plus-pretrain-lvd1689m)

    • Self-supervised vision transformer trained on LVD-1689M
    • 1024 hidden, 24 layers, 16 heads
  • SigLIP2 (google/siglip2-base-patch16-224)

    • Vision-language contrastive model
    • 1152 hidden, 27 layers, 16 heads

Language Model

  • LiquidAI LFM 2.5 1.2B Instruct (LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16)
    • 1.2B parameters, 1536 embedding dim
    • 131K vocab, 32K context window

Architecture Specs

Component Specification
DINOv3 ViT-H/16+, 1024D, 24L, 16H
SigLIP2 Base, 1152D, 27L, 16H
Fusion Concatenation β†’ 2176D
Projector 2176 β†’ 4352 β†’ 1536
LFM 2.5 1.2B params, 1536D, 16L, 24H
Detection 80 classes (COCO)
Segmentation 150 classes (ADE20K)

Repository Structure

OceanirAI/Oculus/
β”œβ”€β”€ config.json                    # Main model config
β”œβ”€β”€ README.md                      # This file
β”‚
β”œβ”€β”€ oculus_unified_model/          # Model implementation
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ modeling_oculus.py         # OculusForConditionalGeneration
β”‚   β”œβ”€β”€ configuration_oculus.py    # OculusConfig
β”‚   └── processing_oculus.py       # OculusProcessor
β”‚
β”œβ”€β”€ training/                      # Training scripts
β”‚   β”œβ”€β”€ train_oculus.py            # Base projector training
β”‚   β”œβ”€β”€ train_detection.py         # Detection head training
β”‚   β”œβ”€β”€ train_detection_extended.py
β”‚   β”œβ”€β”€ train_instruction_tuning.py # Instruct variant
β”‚   β”œβ”€β”€ train_reasoning_v2.py      # Reasoning variant
β”‚   └── train_oculus_coco.py       # COCO training
β”‚
β”œβ”€β”€ logs/                          # Training logs
β”‚   β”œβ”€β”€ training_instruct_v1.log
β”‚   β”œβ”€β”€ training_reasoning_v2.log
β”‚   └── training_v2_final.log
β”‚
β”œβ”€β”€ checkpoints/                   # Model checkpoints
β”‚   β”œβ”€β”€ oculus/final/              # Base projector
β”‚   β”‚   β”œβ”€β”€ projector.npz          # Vision projector weights (~822MB)
β”‚   β”‚   └── config.json
β”‚   β”‚
β”‚   β”œβ”€β”€ oculus_detection/final/    # Detection checkpoint
β”‚   β”‚   β”œβ”€β”€ projector.npz          # Projector weights (~800MB)
β”‚   β”‚   β”œβ”€β”€ heads.pth              # Detection heads (~35MB)
β”‚   β”‚   └── benchmark_results.json
β”‚   β”‚
β”‚   β”œβ”€β”€ oculus_instruct_v1/        # Instruction-tuned VQA
β”‚   β”‚   └── vqa_model/
β”‚   β”‚       β”œβ”€β”€ model.safetensors  # BLIP VQA weights (~1.5GB)
β”‚   β”‚       β”œβ”€β”€ tokenizer.json
β”‚   β”‚       └── config.json
β”‚   β”‚
β”‚   └── oculus_reasoning_v2/       # Reasoning VQA
β”‚       └── vqa_model/
β”‚           β”œβ”€β”€ model.safetensors  # BLIP VQA weights (~1.5GB)
β”‚           β”œβ”€β”€ tokenizer.json
β”‚           └── config.json
β”‚
β”œβ”€β”€ docs/                          # Documentation
β”‚   β”œβ”€β”€ ARCHITECTURE.md
β”‚   β”œβ”€β”€ BENCHMARK_README.md
β”‚   └── TRAINING_ROADMAP.md
β”‚
β”œβ”€β”€ oculus_inference.py            # Inference script
β”œβ”€β”€ demo_oculus.py                 # Demo script
β”œβ”€β”€ benchmark_vlm.py               # Benchmarking
└── eval_benchmarks.py             # Evaluation

Training

Base Projector Training

python training/train_oculus.py

Detection Head Training

python training/train_detection.py

Instruction Tuning

python training/train_instruction_tuning.py

Reasoning Training

python training/train_reasoning_v2.py

Features

  • Visual Question Answering (VQA) - Answer questions about images
  • Image Captioning - Generate natural descriptions
  • Object Detection - Detect with bounding boxes (80 COCO classes)
  • Object Counting - Count objects via point prediction
  • Semantic Segmentation - Pixel-level understanding (150 ADE20K classes)
  • Chain-of-Thought Reasoning - Step-by-step thinking traces

License

Oceanir Research License v1.0

Permitted:

  • Academic research
  • Educational use
  • Publishing papers with results
  • Personal experimentation

Not Permitted:

  • Commercial use
  • Training commercial models
  • Commercial products/services

For commercial licensing: licensing@oceanir.ai

Citation

@software{oculus2026,
  title={Oculus Vision-Language Model},
  author={OceanirAI},
  year={2026},
  url={https://huggingface.co/OceanirAI/Oculus}
}

Links

Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for OceanirAI/Oculus