kobiakor15 commited on
Commit
d933e76
Β·
verified Β·
1 Parent(s): 4145f82

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +251 -0
README.md ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - vision
8
+ - multimodal
9
+ - vision-language
10
+ - segmentation
11
+ - detection
12
+ - ocr
13
+ - dinov3
14
+ - siglip2
15
+ - lfm2.5
16
+ base_model:
17
+ - facebook/dinov3-vith16plus-pretrain-lvd1689m
18
+ - google/siglip2-so400m-patch16-naflex
19
+ - LiquidAI/LFM2.5-1.2B-Base
20
+ ---
21
+
22
+ # Oculus 0.1
23
+
24
+ A multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.
25
+
26
+ ## What is this?
27
+
28
+ Oculus is a universal vision-language model for:
29
+ - **Image Captioning**: Generate natural language descriptions
30
+ - **Visual Question Answering**: Answer questions about images
31
+ - **Semantic Segmentation**: Pixel-level class prediction
32
+ - **Image Classification**: Global image classification
33
+ - **Object Detection**: Bounding box prediction
34
+ - **OCR**: Text detection and recognition
35
+
36
+ ## Model Architecture
37
+
38
+ ```
39
+ Image (224Γ—224) ──→ DINOv3 ViT-L/16 ──┐
40
+ β”œβ”€β”€β†’ Concatenate ──→ Projector ──→ LFM2.5-1.2B
41
+ Image (384Γ—384) ──→ SigLIP2 SO400M β”€β”€β”˜ β”‚
42
+ β”œβ”€β”€β†’ Text Output (Caption/VQA)
43
+ Segmentation Head ──→ Segmentation Map
44
+ Classification Head ──→ Class Label
45
+ Detection Head ──→ Boxes + Classes
46
+ OCR Head ──→ Text + Geometry
47
+ ```
48
+
49
+ ## Components
50
+
51
+ | Component | Model | Parameters | Input | Output |
52
+ |-----------|-------|------------|-------|--------|
53
+ | Vision Encoder 1 | DINOv3 ViT-H/16+ | 1.7B | 224Γ—224 | 256Γ—1280 |
54
+ | Vision Encoder 2 | SigLIP2 SO400M | 400M | 384Γ—384 | 576Γ—1152 |
55
+ | Fusion | Concatenation | - | 2432D | 2432D |
56
+ | Projector | 2-layer MLP | ~5M | 2432D | 1536D |
57
+ | Language Model | LFM2.5-1.2B | 1.2B | 1536D | Text |
58
+ | Segmentation Head | MLP | ~0.5M | 2432D | 14Γ—14Γ—150 |
59
+ | Classification Head | MLP | ~0.3M | 2432D | 1000 |
60
+ | Detection Head | MLP | ~0.5M | 2432D | Boxes + Classes |
61
+ | OCR Head | CNN + MLP | ~0.3M | 2432D | Text + Geometry |
62
+
63
+ **Total: ~4.5B parameters**
64
+
65
+ ## Usage
66
+
67
+ ### Basic Language Generation
68
+
69
+ ```python
70
+ from oculus import create_oculus_model
71
+ import mx
72
+
73
+ model = create_oculus_model(num_classes=150)
74
+
75
+ dinov3_image = mx.random.normal((1, 3, 224, 224))
76
+ siglip2_image = mx.random.normal((1, 3, 384, 384))
77
+ prompt = mx.array([[1, 2, 3, 4, 5]]) # Tokenized text
78
+
79
+ generated = model.generate(
80
+ input_ids=prompt,
81
+ x_dinov3=dinov3_image,
82
+ x_siglip2=siglip2_image,
83
+ max_new_tokens=512,
84
+ temperature=0.7,
85
+ )
86
+ print(f"Generated: {generated.tolist()}")
87
+ ```
88
+
89
+ ### Visual Question Answering
90
+
91
+ ```python
92
+ from oculus import create_oculus_model
93
+ import mx
94
+
95
+ model = create_oculus_model()
96
+
97
+ dinov3_image = mx.random.normal((1, 3, 224, 224))
98
+ siglip2_image = mx.random.normal((1, 3, 384, 384))
99
+
100
+ question = mx.array([[1, 2, 3, 4, 5, 6, 7, 8]]) # "What is in the image?"
101
+
102
+ answer = model.generate(
103
+ input_ids=question,
104
+ x_dinov3=dinov3_image,
105
+ x_siglip2=siglip2_image,
106
+ max_new_tokens=100,
107
+ )
108
+ print(f"Answer: {answer.tolist()}")
109
+ ```
110
+
111
+ ### Semantic Segmentation
112
+
113
+ ```python
114
+ from oculus import create_oculus_model
115
+ import mx
116
+
117
+ model = create_oculus_model(num_classes=150) # ADE20K
118
+
119
+ dinov3_image = mx.random.normal((1, 3, 224, 224))
120
+ siglip2_image = mx.random.normal((1, 3, 384, 384))
121
+
122
+ predictions = model.segment(dinov3_image, siglip2_image)
123
+ print(f"Segmentation shape: {predictions.shape}") # (1, 14, 14)
124
+ ```
125
+
126
+ ### Image Classification
127
+
128
+ ```python
129
+ from oculus import create_oculus_model
130
+ import mx
131
+
132
+ model = create_oculus_model(num_classes=1000)
133
+
134
+ dinov3_image = mx.random.normal((4, 3, 224, 224))
135
+ siglip2_image = mx.random.normal((4, 3, 384, 384))
136
+
137
+ class_id = model.classify(dinov3_image, siglip2_image)
138
+ print(f"Predicted classes: {class_id.tolist()}")
139
+ ```
140
+
141
+ ### Object Detection
142
+
143
+ ```python
144
+ from oculus import create_oculus_model
145
+ import mx
146
+
147
+ model = create_oculus_model(num_classes=80) # COCO
148
+
149
+ dinov3_image = mx.random.normal((1, 3, 224, 224))
150
+ siglip2_image = mx.random.normal((1, 3, 384, 384))
151
+
152
+ cls_logits, bbox_preds = model.detect(dinov3_image, siglip2_image)
153
+ print(f"Class logits: {cls_logits.shape}") # (1, 196, 9, 80)
154
+ print(f"Box predictions: {bbox_preds.shape}") # (1, 196, 9, 4)
155
+ ```
156
+
157
+ ### OCR
158
+
159
+ ```python
160
+ from oculus import create_oculus_model
161
+ import mx
162
+
163
+ model = create_oculus_model()
164
+
165
+ dinov3_image = mx.random.normal((1, 3, 224, 224))
166
+ siglip2_image = mx.random.normal((1, 3, 384, 384))
167
+
168
+ text_logits, geo_preds = model.ocr(dinov3_image, siglip2_image)
169
+ print(f"Text logits: {text_logits.shape}") # (14, 14, max_seq_len)
170
+ print(f"Geometry: {geo_preds.shape}") # (196, 4)
171
+ ```
172
+
173
+ ## Loading Pretrained Weights
174
+
175
+ ```python
176
+ import os
177
+ from oculus import (
178
+ create_oculus_model,
179
+ load_dinov3_from_hf,
180
+ load_siglip2_from_hf,
181
+ load_lfm2_from_hf,
182
+ )
183
+
184
+ model = create_oculus_model(num_classes=150)
185
+
186
+ token = os.getenv("HF_TOKEN")
187
+
188
+ load_dinov3_from_hf(
189
+ model.dinov3_encoder,
190
+ repo_id="facebook/dinov3-vith16plus-pretrain-lvd1689m",
191
+ token=token,
192
+ )
193
+
194
+ load_siglip2_from_hf(
195
+ model.siglip2_encoder,
196
+ repo_id="google/siglip2-so400m-patch16-naflex",
197
+ token=token,
198
+ )
199
+
200
+ load_lfm2_from_hf(
201
+ model.language_model,
202
+ repo_id="LiquidAI/LFM2.5-1.2B-Base",
203
+ token=token,
204
+ )
205
+ ```
206
+
207
+ ## Running Examples
208
+
209
+ ```bash
210
+ cd Oculus/src/models
211
+ python oculus_example.py
212
+ ```
213
+
214
+ ## Performance
215
+
216
+ | Task | Dataset | Metric | Expected |
217
+ |------|---------|--------|----------|
218
+ | Image Classification | ImageNet | Top-1 | ~75% |
219
+ | Semantic Segmentation | ADE20K | mIoU | ~45% |
220
+ | Object Detection | COCO | mAP | ~45% |
221
+ | VQA | VQA2.0 | Accuracy | ~65% |
222
+
223
+ ## Memory Requirements
224
+
225
+ | Mode | Memory |
226
+ |------|--------|
227
+ | Inference | ~10 GB |
228
+ | Training (frozen encoders) | ~12 GB |
229
+ | Training (full) | ~30 GB |
230
+
231
+ ## Requirements
232
+
233
+ ```bash
234
+ pip install mlx
235
+ pip install huggingface_hub # for pretrained weights
236
+ ```
237
+
238
+ ## Model Sources
239
+
240
+ - DINOv3: [facebook/dinov3-vith16plus-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
241
+ - SigLIP2: [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
242
+ - LFM2.5: [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
243
+
244
+ ## License
245
+
246
+ CC-BY-NC-4.0
247
+
248
+ ## Contact
249
+
250
+ - Organization: OceanirAI
251
+ - GitHub: github.com/Oceanir