File size: 6,108 Bytes
d933e76
79310dc
 
 
d933e76
 
79310dc
d933e76
 
 
 
 
79310dc
 
 
 
 
 
d933e76
 
79310dc
 
d933e76
 
79310dc
d933e76
79310dc
d933e76
79310dc
d933e76
79310dc
 
 
 
 
d933e76
79310dc
d933e76
79310dc
 
d933e76
 
 
79310dc
d933e76
79310dc
 
d933e76
 
79310dc
d933e76
79310dc
d933e76
79310dc
 
 
 
d933e76
79310dc
 
 
d933e76
79310dc
 
 
 
d933e76
79310dc
d933e76
79310dc
 
 
 
 
 
 
 
 
d933e76
79310dc
d933e76
 
79310dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d933e76
 
79310dc
d933e76
79310dc
 
 
d933e76
 
79310dc
 
 
d933e76
 
79310dc
d933e76
79310dc
d933e76
 
79310dc
 
 
 
d933e76
79310dc
d933e76
79310dc
 
 
 
 
 
d933e76
79310dc
d933e76
79310dc
d933e76
79310dc
 
 
 
 
d933e76
79310dc
 
 
 
d933e76
79310dc
d933e76
79310dc
d933e76
79310dc
 
 
 
 
 
 
 
d933e76
79310dc
d933e76
79310dc
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
license: other
license_name: oceanir-research-license
license_link: LICENSE
language:
- en
library_name: oceanir
pipeline_tag: image-text-to-text
tags:
- vision
- multimodal
- vision-language
- vqa
- image-captioning
- object-detection
- oculus
- research
- training
base_model:
- facebook/dinov3-vith16plus-pretrain-lvd1689m
- google/siglip2-base-patch16-224
- LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16
---

# Oculus - Complete Training Repository

This repository contains the complete Oculus vision-language model including all training code, checkpoints, and documentation.

## Quick Links

| Model | Description | Link |
|-------|-------------|------|
| **Oculus-0.1-Instruct** | Instruction-tuned for VQA/captioning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct) |
| **Oculus-0.1-Reasoning** | Chain-of-thought reasoning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning) |
| **oceanir** | Python SDK | [PyPI](https://pypi.org/project/oceanir/) |

## Installation

```bash
pip install oceanir
```

```python
from oceanir import Oculus

model = Oculus.from_pretrained("OceanirAI/Oculus-0.1-Instruct")
answer = model.ask("image.jpg", "What is this?")
```

## Architecture

Oculus combines state-of-the-art vision encoders with a powerful language model:

### Vision Encoders
- **DINOv3 ViT-H/16+** (`facebook/dinov3-vith16plus-pretrain-lvd1689m`)
  - Self-supervised vision transformer trained on LVD-1689M
  - 1024 hidden, 24 layers, 16 heads

- **SigLIP2** (`google/siglip2-base-patch16-224`)
  - Vision-language contrastive model
  - 1152 hidden, 27 layers, 16 heads

### Language Model
- **LiquidAI LFM 2.5 1.2B Instruct** (`LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16`)
  - 1.2B parameters, 1536 embedding dim
  - 131K vocab, 32K context window

### Architecture Specs

| Component | Specification |
|-----------|--------------|
| DINOv3 | ViT-H/16+, 1024D, 24L, 16H |
| SigLIP2 | Base, 1152D, 27L, 16H |
| Fusion | Concatenation β†’ 2176D |
| Projector | 2176 β†’ 4352 β†’ 1536 |
| LFM 2.5 | 1.2B params, 1536D, 16L, 24H |
| Detection | 80 classes (COCO) |
| Segmentation | 150 classes (ADE20K) |

## Repository Structure

```
OceanirAI/Oculus/
β”œβ”€β”€ config.json                    # Main model config
β”œβ”€β”€ README.md                      # This file
β”‚
β”œβ”€β”€ oculus_unified_model/          # Model implementation
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ modeling_oculus.py         # OculusForConditionalGeneration
β”‚   β”œβ”€β”€ configuration_oculus.py    # OculusConfig
β”‚   └── processing_oculus.py       # OculusProcessor
β”‚
β”œβ”€β”€ training/                      # Training scripts
β”‚   β”œβ”€β”€ train_oculus.py            # Base projector training
β”‚   β”œβ”€β”€ train_detection.py         # Detection head training
β”‚   β”œβ”€β”€ train_detection_extended.py
β”‚   β”œβ”€β”€ train_instruction_tuning.py # Instruct variant
β”‚   β”œβ”€β”€ train_reasoning_v2.py      # Reasoning variant
β”‚   └── train_oculus_coco.py       # COCO training
β”‚
β”œβ”€β”€ logs/                          # Training logs
β”‚   β”œβ”€β”€ training_instruct_v1.log
β”‚   β”œβ”€β”€ training_reasoning_v2.log
β”‚   └── training_v2_final.log
β”‚
β”œβ”€β”€ checkpoints/                   # Model checkpoints
β”‚   β”œβ”€β”€ oculus/final/              # Base projector
β”‚   β”‚   β”œβ”€β”€ projector.npz          # Vision projector weights (~822MB)
β”‚   β”‚   └── config.json
β”‚   β”‚
β”‚   β”œβ”€β”€ oculus_detection/final/    # Detection checkpoint
β”‚   β”‚   β”œβ”€β”€ projector.npz          # Projector weights (~800MB)
β”‚   β”‚   β”œβ”€β”€ heads.pth              # Detection heads (~35MB)
β”‚   β”‚   └── benchmark_results.json
β”‚   β”‚
β”‚   β”œβ”€β”€ oculus_instruct_v1/        # Instruction-tuned VQA
β”‚   β”‚   └── vqa_model/
β”‚   β”‚       β”œβ”€β”€ model.safetensors  # BLIP VQA weights (~1.5GB)
β”‚   β”‚       β”œβ”€β”€ tokenizer.json
β”‚   β”‚       └── config.json
β”‚   β”‚
β”‚   └── oculus_reasoning_v2/       # Reasoning VQA
β”‚       └── vqa_model/
β”‚           β”œβ”€β”€ model.safetensors  # BLIP VQA weights (~1.5GB)
β”‚           β”œβ”€β”€ tokenizer.json
β”‚           └── config.json
β”‚
β”œβ”€β”€ docs/                          # Documentation
β”‚   β”œβ”€β”€ ARCHITECTURE.md
β”‚   β”œβ”€β”€ BENCHMARK_README.md
β”‚   └── TRAINING_ROADMAP.md
β”‚
β”œβ”€β”€ oculus_inference.py            # Inference script
β”œβ”€β”€ demo_oculus.py                 # Demo script
β”œβ”€β”€ benchmark_vlm.py               # Benchmarking
└── eval_benchmarks.py             # Evaluation
```

## Training

### Base Projector Training
```bash
python training/train_oculus.py
```

### Detection Head Training
```bash
python training/train_detection.py
```

### Instruction Tuning
```bash
python training/train_instruction_tuning.py
```

### Reasoning Training
```bash
python training/train_reasoning_v2.py
```

## Features

- **Visual Question Answering (VQA)** - Answer questions about images
- **Image Captioning** - Generate natural descriptions
- **Object Detection** - Detect with bounding boxes (80 COCO classes)
- **Object Counting** - Count objects via point prediction
- **Semantic Segmentation** - Pixel-level understanding (150 ADE20K classes)
- **Chain-of-Thought Reasoning** - Step-by-step thinking traces

## License

**Oceanir Research License v1.0**

**Permitted:**
- Academic research
- Educational use
- Publishing papers with results
- Personal experimentation

**Not Permitted:**
- Commercial use
- Training commercial models
- Commercial products/services

For commercial licensing: licensing@oceanir.ai

## Citation

```bibtex
@software{oculus2026,
  title={Oculus Vision-Language Model},
  author={OceanirAI},
  year={2026},
  url={https://huggingface.co/OceanirAI/Oculus}
}
```

## Links

- [Oculus-0.1-Instruct](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct)
- [Oculus-0.1-Reasoning](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning)
- [Oceanir SDK (PyPI)](https://pypi.org/project/oceanir/)
- [GitHub](https://github.com/OceanirAI/oceanir)