kobiakor15 commited on
Commit
79310dc
Β·
verified Β·
1 Parent(s): a3d5104

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +150 -198
README.md CHANGED
@@ -1,251 +1,203 @@
1
  ---
2
- license: cc-by-nc-4.0
 
 
3
  language:
4
  - en
 
5
  pipeline_tag: image-text-to-text
6
  tags:
7
  - vision
8
  - multimodal
9
  - vision-language
10
- - segmentation
11
- - detection
12
- - ocr
13
- - dinov3
14
- - siglip2
15
- - lfm2.5
16
  base_model:
17
  - facebook/dinov3-vith16plus-pretrain-lvd1689m
18
- - google/siglip2-so400m-patch16-naflex
19
- - LiquidAI/LFM2.5-1.2B-Base
20
  ---
21
 
22
- # Oculus 0.1
23
 
24
- A multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.
25
 
26
- ## What is this?
27
 
28
- Oculus is a universal vision-language model for:
29
- - **Image Captioning**: Generate natural language descriptions
30
- - **Visual Question Answering**: Answer questions about images
31
- - **Semantic Segmentation**: Pixel-level class prediction
32
- - **Image Classification**: Global image classification
33
- - **Object Detection**: Bounding box prediction
34
- - **OCR**: Text detection and recognition
35
 
36
- ## Model Architecture
37
 
38
- ```
39
- Image (224Γ—224) ──→ DINOv3 ViT-L/16 ──┐
40
- β”œβ”€β”€β†’ Concatenate ──→ Projector ──→ LFM2.5-1.2B
41
- Image (384Γ—384) ──→ SigLIP2 SO400M β”€β”€β”˜ β”‚
42
- β”œβ”€β”€β†’ Text Output (Caption/VQA)
43
- Segmentation Head ──→ Segmentation Map
44
- Classification Head ──→ Class Label
45
- Detection Head ──→ Boxes + Classes
46
- OCR Head ──→ Text + Geometry
47
- ```
48
-
49
- ## Components
50
-
51
- | Component | Model | Parameters | Input | Output |
52
- |-----------|-------|------------|-------|--------|
53
- | Vision Encoder 1 | DINOv3 ViT-H/16+ | 1.7B | 224Γ—224 | 256Γ—1280 |
54
- | Vision Encoder 2 | SigLIP2 SO400M | 400M | 384Γ—384 | 576Γ—1152 |
55
- | Fusion | Concatenation | - | 2432D | 2432D |
56
- | Projector | 2-layer MLP | ~5M | 2432D | 1536D |
57
- | Language Model | LFM2.5-1.2B | 1.2B | 1536D | Text |
58
- | Segmentation Head | MLP | ~0.5M | 2432D | 14Γ—14Γ—150 |
59
- | Classification Head | MLP | ~0.3M | 2432D | 1000 |
60
- | Detection Head | MLP | ~0.5M | 2432D | Boxes + Classes |
61
- | OCR Head | CNN + MLP | ~0.3M | 2432D | Text + Geometry |
62
-
63
- **Total: ~4.5B parameters**
64
-
65
- ## Usage
66
-
67
- ### Basic Language Generation
68
-
69
- ```python
70
- from oculus import create_oculus_model
71
- import mx
72
-
73
- model = create_oculus_model(num_classes=150)
74
-
75
- dinov3_image = mx.random.normal((1, 3, 224, 224))
76
- siglip2_image = mx.random.normal((1, 3, 384, 384))
77
- prompt = mx.array([[1, 2, 3, 4, 5]]) # Tokenized text
78
-
79
- generated = model.generate(
80
- input_ids=prompt,
81
- x_dinov3=dinov3_image,
82
- x_siglip2=siglip2_image,
83
- max_new_tokens=512,
84
- temperature=0.7,
85
- )
86
- print(f"Generated: {generated.tolist()}")
87
  ```
88
 
89
- ### Visual Question Answering
90
-
91
  ```python
92
- from oculus import create_oculus_model
93
- import mx
94
-
95
- model = create_oculus_model()
96
-
97
- dinov3_image = mx.random.normal((1, 3, 224, 224))
98
- siglip2_image = mx.random.normal((1, 3, 384, 384))
99
 
100
- question = mx.array([[1, 2, 3, 4, 5, 6, 7, 8]]) # "What is in the image?"
101
-
102
- answer = model.generate(
103
- input_ids=question,
104
- x_dinov3=dinov3_image,
105
- x_siglip2=siglip2_image,
106
- max_new_tokens=100,
107
- )
108
- print(f"Answer: {answer.tolist()}")
109
  ```
110
 
111
- ### Semantic Segmentation
112
-
113
- ```python
114
- from oculus import create_oculus_model
115
- import mx
116
 
117
- model = create_oculus_model(num_classes=150) # ADE20K
118
 
119
- dinov3_image = mx.random.normal((1, 3, 224, 224))
120
- siglip2_image = mx.random.normal((1, 3, 384, 384))
 
 
121
 
122
- predictions = model.segment(dinov3_image, siglip2_image)
123
- print(f"Segmentation shape: {predictions.shape}") # (1, 14, 14)
124
- ```
125
 
126
- ### Image Classification
 
 
 
127
 
128
- ```python
129
- from oculus import create_oculus_model
130
- import mx
131
 
132
- model = create_oculus_model(num_classes=1000)
 
 
 
 
 
 
 
 
133
 
134
- dinov3_image = mx.random.normal((4, 3, 224, 224))
135
- siglip2_image = mx.random.normal((4, 3, 384, 384))
136
 
137
- class_id = model.classify(dinov3_image, siglip2_image)
138
- print(f"Predicted classes: {class_id.tolist()}")
139
  ```
140
-
141
- ### Object Detection
142
-
143
- ```python
144
- from oculus import create_oculus_model
145
- import mx
146
-
147
- model = create_oculus_model(num_classes=80) # COCO
148
-
149
- dinov3_image = mx.random.normal((1, 3, 224, 224))
150
- siglip2_image = mx.random.normal((1, 3, 384, 384))
151
-
152
- cls_logits, bbox_preds = model.detect(dinov3_image, siglip2_image)
153
- print(f"Class logits: {cls_logits.shape}") # (1, 196, 9, 80)
154
- print(f"Box predictions: {bbox_preds.shape}") # (1, 196, 9, 4)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  ```
156
 
157
- ### OCR
158
-
159
- ```python
160
- from oculus import create_oculus_model
161
- import mx
162
-
163
- model = create_oculus_model()
164
-
165
- dinov3_image = mx.random.normal((1, 3, 224, 224))
166
- siglip2_image = mx.random.normal((1, 3, 384, 384))
167
 
168
- text_logits, geo_preds = model.ocr(dinov3_image, siglip2_image)
169
- print(f"Text logits: {text_logits.shape}") # (14, 14, max_seq_len)
170
- print(f"Geometry: {geo_preds.shape}") # (196, 4)
171
  ```
172
 
173
- ## Loading Pretrained Weights
174
-
175
- ```python
176
- import os
177
- from oculus import (
178
- create_oculus_model,
179
- load_dinov3_from_hf,
180
- load_siglip2_from_hf,
181
- load_lfm2_from_hf,
182
- )
183
-
184
- model = create_oculus_model(num_classes=150)
185
-
186
- token = os.getenv("HF_TOKEN")
187
-
188
- load_dinov3_from_hf(
189
- model.dinov3_encoder,
190
- repo_id="facebook/dinov3-vith16plus-pretrain-lvd1689m",
191
- token=token,
192
- )
193
-
194
- load_siglip2_from_hf(
195
- model.siglip2_encoder,
196
- repo_id="google/siglip2-so400m-patch16-naflex",
197
- token=token,
198
- )
199
-
200
- load_lfm2_from_hf(
201
- model.language_model,
202
- repo_id="LiquidAI/LFM2.5-1.2B-Base",
203
- token=token,
204
- )
205
  ```
206
 
207
- ## Running Examples
208
-
209
  ```bash
210
- cd Oculus/src/models
211
- python oculus_example.py
212
  ```
213
 
214
- ## Performance
 
 
 
215
 
216
- | Task | Dataset | Metric | Expected |
217
- |------|---------|--------|----------|
218
- | Image Classification | ImageNet | Top-1 | ~75% |
219
- | Semantic Segmentation | ADE20K | mIoU | ~45% |
220
- | Object Detection | COCO | mAP | ~45% |
221
- | VQA | VQA2.0 | Accuracy | ~65% |
222
 
223
- ## Memory Requirements
 
 
 
 
 
224
 
225
- | Mode | Memory |
226
- |------|--------|
227
- | Inference | ~10 GB |
228
- | Training (frozen encoders) | ~12 GB |
229
- | Training (full) | ~30 GB |
230
 
231
- ## Requirements
232
 
233
- ```bash
234
- pip install mlx
235
- pip install huggingface_hub # for pretrained weights
236
- ```
 
237
 
238
- ## Model Sources
 
 
 
239
 
240
- - DINOv3: [facebook/dinov3-vith16plus-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
241
- - SigLIP2: [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
242
- - LFM2.5: [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
243
 
244
- ## License
245
 
246
- CC-BY-NC-4.0
 
 
 
 
 
 
 
247
 
248
- ## Contact
249
 
250
- - Organization: OceanirAI
251
- - GitHub: github.com/Oceanir
 
 
 
1
  ---
2
+ license: other
3
+ license_name: oceanir-research-license
4
+ license_link: LICENSE
5
  language:
6
  - en
7
+ library_name: oceanir
8
  pipeline_tag: image-text-to-text
9
  tags:
10
  - vision
11
  - multimodal
12
  - vision-language
13
+ - vqa
14
+ - image-captioning
15
+ - object-detection
16
+ - oculus
17
+ - research
18
+ - training
19
  base_model:
20
  - facebook/dinov3-vith16plus-pretrain-lvd1689m
21
+ - google/siglip2-base-patch16-224
22
+ - LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16
23
  ---
24
 
25
+ # Oculus - Complete Training Repository
26
 
27
+ This repository contains the complete Oculus vision-language model including all training code, checkpoints, and documentation.
28
 
29
+ ## Quick Links
30
 
31
+ | Model | Description | Link |
32
+ |-------|-------------|------|
33
+ | **Oculus-0.1-Instruct** | Instruction-tuned for VQA/captioning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct) |
34
+ | **Oculus-0.1-Reasoning** | Chain-of-thought reasoning | [HuggingFace](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning) |
35
+ | **oceanir** | Python SDK | [PyPI](https://pypi.org/project/oceanir/) |
 
 
36
 
37
+ ## Installation
38
 
39
+ ```bash
40
+ pip install oceanir
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
42
 
 
 
43
  ```python
44
+ from oceanir import Oculus
 
 
 
 
 
 
45
 
46
+ model = Oculus.from_pretrained("OceanirAI/Oculus-0.1-Instruct")
47
+ answer = model.ask("image.jpg", "What is this?")
 
 
 
 
 
 
 
48
  ```
49
 
50
+ ## Architecture
 
 
 
 
51
 
52
+ Oculus combines state-of-the-art vision encoders with a powerful language model:
53
 
54
+ ### Vision Encoders
55
+ - **DINOv3 ViT-H/16+** (`facebook/dinov3-vith16plus-pretrain-lvd1689m`)
56
+ - Self-supervised vision transformer trained on LVD-1689M
57
+ - 1024 hidden, 24 layers, 16 heads
58
 
59
+ - **SigLIP2** (`google/siglip2-base-patch16-224`)
60
+ - Vision-language contrastive model
61
+ - 1152 hidden, 27 layers, 16 heads
62
 
63
+ ### Language Model
64
+ - **LiquidAI LFM 2.5 1.2B Instruct** (`LiquidAI/LFM2.5-1.2B-Instruct-MLX-bf16`)
65
+ - 1.2B parameters, 1536 embedding dim
66
+ - 131K vocab, 32K context window
67
 
68
+ ### Architecture Specs
 
 
69
 
70
+ | Component | Specification |
71
+ |-----------|--------------|
72
+ | DINOv3 | ViT-H/16+, 1024D, 24L, 16H |
73
+ | SigLIP2 | Base, 1152D, 27L, 16H |
74
+ | Fusion | Concatenation β†’ 2176D |
75
+ | Projector | 2176 β†’ 4352 β†’ 1536 |
76
+ | LFM 2.5 | 1.2B params, 1536D, 16L, 24H |
77
+ | Detection | 80 classes (COCO) |
78
+ | Segmentation | 150 classes (ADE20K) |
79
 
80
+ ## Repository Structure
 
81
 
 
 
82
  ```
83
+ OceanirAI/Oculus/
84
+ β”œβ”€β”€ config.json # Main model config
85
+ β”œβ”€β”€ README.md # This file
86
+ β”‚
87
+ β”œβ”€β”€ oculus_unified_model/ # Model implementation
88
+ β”‚ β”œβ”€β”€ __init__.py
89
+ β”‚ β”œβ”€β”€ modeling_oculus.py # OculusForConditionalGeneration
90
+ β”‚ β”œβ”€β”€ configuration_oculus.py # OculusConfig
91
+ β”‚ └── processing_oculus.py # OculusProcessor
92
+ β”‚
93
+ β”œβ”€β”€ training/ # Training scripts
94
+ β”‚ β”œβ”€β”€ train_oculus.py # Base projector training
95
+ β”‚ β”œβ”€β”€ train_detection.py # Detection head training
96
+ β”‚ β”œβ”€β”€ train_detection_extended.py
97
+ β”‚ β”œβ”€β”€ train_instruction_tuning.py # Instruct variant
98
+ β”‚ β”œβ”€β”€ train_reasoning_v2.py # Reasoning variant
99
+ β”‚ └── train_oculus_coco.py # COCO training
100
+ β”‚
101
+ β”œβ”€β”€ logs/ # Training logs
102
+ β”‚ β”œβ”€β”€ training_instruct_v1.log
103
+ β”‚ β”œβ”€β”€ training_reasoning_v2.log
104
+ β”‚ └── training_v2_final.log
105
+ β”‚
106
+ β”œβ”€β”€ checkpoints/ # Model checkpoints
107
+ β”‚ β”œβ”€β”€ oculus/final/ # Base projector
108
+ β”‚ β”‚ β”œβ”€β”€ projector.npz # Vision projector weights (~822MB)
109
+ β”‚ β”‚ └── config.json
110
+ β”‚ β”‚
111
+ β”‚ β”œβ”€β”€ oculus_detection/final/ # Detection checkpoint
112
+ β”‚ β”‚ β”œβ”€β”€ projector.npz # Projector weights (~800MB)
113
+ β”‚ β”‚ β”œβ”€β”€ heads.pth # Detection heads (~35MB)
114
+ β”‚ β”‚ └── benchmark_results.json
115
+ β”‚ β”‚
116
+ β”‚ β”œβ”€β”€ oculus_instruct_v1/ # Instruction-tuned VQA
117
+ β”‚ β”‚ └── vqa_model/
118
+ β”‚ β”‚ β”œβ”€β”€ model.safetensors # BLIP VQA weights (~1.5GB)
119
+ β”‚ β”‚ β”œβ”€β”€ tokenizer.json
120
+ β”‚ β”‚ └── config.json
121
+ β”‚ β”‚
122
+ β”‚ └── oculus_reasoning_v2/ # Reasoning VQA
123
+ β”‚ └── vqa_model/
124
+ β”‚ β”œβ”€β”€ model.safetensors # BLIP VQA weights (~1.5GB)
125
+ β”‚ β”œβ”€β”€ tokenizer.json
126
+ β”‚ └── config.json
127
+ β”‚
128
+ β”œβ”€β”€ docs/ # Documentation
129
+ β”‚ β”œβ”€β”€ ARCHITECTURE.md
130
+ β”‚ β”œβ”€β”€ BENCHMARK_README.md
131
+ β”‚ └── TRAINING_ROADMAP.md
132
+ β”‚
133
+ β”œβ”€β”€ oculus_inference.py # Inference script
134
+ β”œβ”€β”€ demo_oculus.py # Demo script
135
+ β”œβ”€β”€ benchmark_vlm.py # Benchmarking
136
+ └── eval_benchmarks.py # Evaluation
137
  ```
138
 
139
+ ## Training
 
 
 
 
 
 
 
 
 
140
 
141
+ ### Base Projector Training
142
+ ```bash
143
+ python training/train_oculus.py
144
  ```
145
 
146
+ ### Detection Head Training
147
+ ```bash
148
+ python training/train_detection.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  ```
150
 
151
+ ### Instruction Tuning
 
152
  ```bash
153
+ python training/train_instruction_tuning.py
 
154
  ```
155
 
156
+ ### Reasoning Training
157
+ ```bash
158
+ python training/train_reasoning_v2.py
159
+ ```
160
 
161
+ ## Features
 
 
 
 
 
162
 
163
+ - **Visual Question Answering (VQA)** - Answer questions about images
164
+ - **Image Captioning** - Generate natural descriptions
165
+ - **Object Detection** - Detect with bounding boxes (80 COCO classes)
166
+ - **Object Counting** - Count objects via point prediction
167
+ - **Semantic Segmentation** - Pixel-level understanding (150 ADE20K classes)
168
+ - **Chain-of-Thought Reasoning** - Step-by-step thinking traces
169
 
170
+ ## License
 
 
 
 
171
 
172
+ **Oceanir Research License v1.0**
173
 
174
+ **Permitted:**
175
+ - Academic research
176
+ - Educational use
177
+ - Publishing papers with results
178
+ - Personal experimentation
179
 
180
+ **Not Permitted:**
181
+ - Commercial use
182
+ - Training commercial models
183
+ - Commercial products/services
184
 
185
+ For commercial licensing: licensing@oceanir.ai
 
 
186
 
187
+ ## Citation
188
 
189
+ ```bibtex
190
+ @software{oculus2026,
191
+ title={Oculus Vision-Language Model},
192
+ author={OceanirAI},
193
+ year={2026},
194
+ url={https://huggingface.co/OceanirAI/Oculus}
195
+ }
196
+ ```
197
 
198
+ ## Links
199
 
200
+ - [Oculus-0.1-Instruct](https://huggingface.co/OceanirAI/Oculus-0.1-Instruct)
201
+ - [Oculus-0.1-Reasoning](https://huggingface.co/OceanirAI/Oculus-0.1-Reasoning)
202
+ - [Oceanir SDK (PyPI)](https://pypi.org/project/oceanir/)
203
+ - [GitHub](https://github.com/OceanirAI/oceanir)