Oculus / README.md

Upload README.md with huggingface_hub

79310dc verified 6 days ago

6.11 kB

	---
	license: other
	license_name: oceanir-research-license
	license_link: LICENSE
	language:
	- en
	library_name: oceanir
	pipeline_tag: image-text-to-text
	tags:
	- vision
	- multimodal
	- vision-language
	- vqa
	- image-captioning
	- object-detection
	- oculus
	- research
	- training
	base_model:
	- facebook/dinov3-vith16plus-pretrain-lvd1689m
	- google/siglip2-base-patch16-224
	- LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16
	---

	# Oculus - Complete Training Repository

	This repository contains the complete Oculus vision-language model including all training code, checkpoints, and documentation.

	## Quick Links

	\| Model \| Description \| Link \|
	\|-------\|-------------\|------\|
	\| Oculus-0.1-Instruct \| Instruction-tuned for VQA/captioning \| [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct) \|
	\| Oculus-0.1-Reasoning \| Chain-of-thought reasoning \| [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning) \|
	\| oceanir \| Python SDK \| [PyPI](https://pypi.org/project/oceanir/) \|

	## Installation

	```bash
	pip install oceanir
	```

	```python
	from oceanir import Oculus

	model = Oculus.from_pretrained("OceanirAI/Oculus-0.1-Instruct")
	answer = model.ask("image.jpg", "What is this?")
	```

	## Architecture

	Oculus combines state-of-the-art vision encoders with a powerful language model:

	### Vision Encoders
	- DINOv3 ViT-H/16+ (`facebook/dinov3-vith16plus-pretrain-lvd1689m`)
	- Self-supervised vision transformer trained on LVD-1689M
	- 1024 hidden, 24 layers, 16 heads

	- SigLIP2 (`google/siglip2-base-patch16-224`)
	- Vision-language contrastive model
	- 1152 hidden, 27 layers, 16 heads

	### Language Model
	- LiquidAI LFM 2.5 1.2B Instruct (`LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16`)
	- 1.2B parameters, 1536 embedding dim
	- 131K vocab, 32K context window

	### Architecture Specs

	\| Component \| Specification \|
	\|-----------\|--------------\|
	\| DINOv3 \| ViT-H/16+, 1024D, 24L, 16H \|
	\| SigLIP2 \| Base, 1152D, 27L, 16H \|
	\| Fusion \| Concatenation → 2176D \|
	\| Projector \| 2176 → 4352 → 1536 \|
	\| LFM 2.5 \| 1.2B params, 1536D, 16L, 24H \|
	\| Detection \| 80 classes (COCO) \|
	\| Segmentation \| 150 classes (ADE20K) \|

	## Repository Structure

	```
	OceanirAI/Oculus/
	├── config.json # Main model config
	├── README.md # This file
	│
	├── oculus_unified_model/ # Model implementation
	│ ├── __init__.py
	│ ├── modeling_oculus.py # OculusForConditionalGeneration
	│ ├── configuration_oculus.py # OculusConfig
	│ └── processing_oculus.py # OculusProcessor
	│
	├── training/ # Training scripts
	│ ├── train_oculus.py # Base projector training
	│ ├── train_detection.py # Detection head training
	│ ├── train_detection_extended.py
	│ ├── train_instruction_tuning.py # Instruct variant
	│ ├── train_reasoning_v2.py # Reasoning variant
	│ └── train_oculus_coco.py # COCO training
	│
	├── logs/ # Training logs
	│ ├── training_instruct_v1.log
	│ ├── training_reasoning_v2.log
	│ └── training_v2_final.log
	│
	├── checkpoints/ # Model checkpoints
	│ ├── oculus/final/ # Base projector
	│ │ ├── projector.npz # Vision projector weights (~822MB)
	│ │ └── config.json
	│ │
	│ ├── oculus_detection/final/ # Detection checkpoint
	│ │ ├── projector.npz # Projector weights (~800MB)
	│ │ ├── heads.pth # Detection heads (~35MB)
	│ │ └── benchmark_results.json
	│ │
	│ ├── oculus_instruct_v1/ # Instruction-tuned VQA
	│ │ └── vqa_model/
	│ │ ├── model.safetensors # BLIP VQA weights (~1.5GB)
	│ │ ├── tokenizer.json
	│ │ └── config.json
	│ │
	│ └── oculus_reasoning_v2/ # Reasoning VQA
	│ └── vqa_model/
	│ ├── model.safetensors # BLIP VQA weights (~1.5GB)
	│ ├── tokenizer.json
	│ └── config.json
	│
	├── docs/ # Documentation
	│ ├── ARCHITECTURE.md
	│ ├── BENCHMARK_README.md
	│ └── TRAINING_ROADMAP.md
	│
	├── oculus_inference.py # Inference script
	├── demo_oculus.py # Demo script
	├── benchmark_vlm.py # Benchmarking
	└── eval_benchmarks.py # Evaluation
	```

	## Training

	### Base Projector Training
	```bash
	python training/train_oculus.py
	```

	### Detection Head Training
	```bash
	python training/train_detection.py
	```

	### Instruction Tuning
	```bash
	python training/train_instruction_tuning.py
	```

	### Reasoning Training
	```bash
	python training/train_reasoning_v2.py
	```

	## Features

	- Visual Question Answering (VQA) - Answer questions about images
	- Image Captioning - Generate natural descriptions
	- Object Detection - Detect with bounding boxes (80 COCO classes)
	- Object Counting - Count objects via point prediction
	- Semantic Segmentation - Pixel-level understanding (150 ADE20K classes)
	- Chain-of-Thought Reasoning - Step-by-step thinking traces

	## License

	Oceanir Research License v1.0

	Permitted:
	- Academic research
	- Educational use
	- Publishing papers with results
	- Personal experimentation

	Not Permitted:
	- Commercial use
	- Training commercial models
	- Commercial products/services

	For commercial licensing: licensing@oceanir.ai

	## Citation

	```bibtex
	@software{oculus2026,
	title={Oculus Vision-Language Model},
	author={OceanirAI},
	year={2026},
	url={https://huggingface.co/OceanirAI/Oculus}
	}
	```

	## Links

	- [Oculus-0.1-Instruct](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct)
	- [Oculus-0.1-Reasoning](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning)
	- [Oceanir SDK (PyPI)](https://pypi.org/project/oceanir/)
	- [GitHub](https://github.com/OceanirAI/oceanir)