kobiakor15 commited on
Commit
f214cd0
Β·
verified Β·
1 Parent(s): 8f0eac4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +51 -32
README.md CHANGED
@@ -14,24 +14,47 @@ tags:
14
  - reasoning
15
  - chain-of-thought
16
  - instruction-following
 
 
 
 
 
 
17
  - oculus
18
  - standalone
 
 
 
 
19
  ---
20
 
21
- # Oculus 0.1 (Unified ~8GB)
22
 
23
- **Complete standalone vision-language model with both instruction-following and chain-of-thought reasoning.**
 
 
 
 
 
 
 
24
 
25
- Oculus 0.1 combines the best of both worlds:
26
  - **Instruct**: Natural instruction following, image captioning, VQA
27
  - **Reasoning**: Chain-of-thought thinking with `<think>...</think>` tokens
 
 
 
28
 
29
- This package includes ALL model weights bundled together:
30
- - DINOv3-Large vision encoder (~2.3GB)
31
- - SigLIP vision encoder (~1.1GB)
32
- - BLIP language models (~3GB)
33
- - Trained projector & heads (~835MB)
34
- - Unified VQA model (~1.5GB)
 
 
 
 
35
 
36
  ## Installation
37
 
@@ -64,33 +87,29 @@ caption = model.caption("image.jpg")
64
  results = model.detect("image.jpg")
65
  ```
66
 
67
- ## Capabilities
68
 
69
- | Task | Method | Description |
70
- |------|--------|-------------|
71
- | VQA | `model.ask(image, question)` | Answer questions about images |
72
- | Reasoning | `model.ask(image, question, think=True)` | Chain-of-thought reasoning |
73
- | Captioning | `model.caption(image)` | Generate image descriptions |
74
- | Detection | `model.detect(image)` | Object detection (80 COCO classes) |
 
 
75
 
76
- ## Model Structure
77
 
78
- ```
79
- Oculus-0.1/
80
- β”œβ”€β”€ config.json
81
- β”œβ”€β”€ vision_encoders/
82
- β”‚ β”œβ”€β”€ dinov3-large/ # DINOv3 ViT-L (~2.3GB)
83
- β”‚ └── siglip-base/ # SigLIP (~1.1GB)
84
- β”œβ”€β”€ language_model/
85
- β”‚ β”œβ”€β”€ blip-captioning/ # BLIP captioning
86
- β”‚ └── blip-vqa-finetuned/ # Unified VQA (~1.5GB)
87
- β”œβ”€β”€ trained_components/
88
- β”‚ β”œβ”€β”€ projector.npz # Vision projector (~800MB)
89
- β”‚ └── heads.pth # Detection heads (~35MB)
90
- └── oculus_unified_model/ # Model code
91
- ```
92
 
93
- ## Total Size: ~8GB
 
 
94
 
95
  ## License
96
 
 
14
  - reasoning
15
  - chain-of-thought
16
  - instruction-following
17
+ - segmentation
18
+ - detection
19
+ - ocr
20
+ - dinov3
21
+ - siglip2
22
+ - lfm2.5
23
  - oculus
24
  - standalone
25
+ base_model:
26
+ - facebook/dinov3-vith16plus-pretrain-lvd1689m
27
+ - google/siglip2-so400m-patch16-naflex
28
+ - LiquidAI/LFM2.5-1.2B-Base
29
  ---
30
 
31
+ # Oculus 0.1 (~4.5B params)
32
 
33
+ **Multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.**
34
+
35
+ Oculus 0.1 combines:
36
+ - **DINOv3 ViT-H/16+**: Universal vision backbone (~1.7B params)
37
+ - **SigLIP2 SO400M**: Vision-language understanding (~400M params)
38
+ - **LFM2.5-1.2B**: Liquid AI's language model (~1.2B params)
39
+
40
+ ## Capabilities
41
 
 
42
  - **Instruct**: Natural instruction following, image captioning, VQA
43
  - **Reasoning**: Chain-of-thought thinking with `<think>...</think>` tokens
44
+ - **Segmentation**: Pixel-level class prediction
45
+ - **Detection**: Object detection (80 COCO classes)
46
+ - **OCR**: Text detection and recognition
47
 
48
+ ## Architecture
49
+
50
+ ```
51
+ Image (224x224) --> DINOv3 ViT-H/16+ --\
52
+ +--> Concat --> Projector --> LFM2.5-1.2B --> Text
53
+ Image (384x384) --> SigLIP2 SO400M ----/ |
54
+ +--> Segmentation Head
55
+ +--> Detection Head
56
+ +--> OCR Head
57
+ ```
58
 
59
  ## Installation
60
 
 
87
  results = model.detect("image.jpg")
88
  ```
89
 
90
+ ## Model Components
91
 
92
+ | Component | Model | Parameters |
93
+ |-----------|-------|------------|
94
+ | Vision Encoder 1 | DINOv3 ViT-H/16+ | ~1.7B |
95
+ | Vision Encoder 2 | SigLIP2 SO400M | ~400M |
96
+ | Projector | 2-layer MLP | ~5M |
97
+ | Language Model | LFM2.5-1.2B (Liquid AI) | ~1.2B |
98
+ | Task Heads | Seg/Det/OCR | ~1.5M |
99
+ | **Total** | | **~4.5B** |
100
 
101
+ ## Why LFM2.5?
102
 
103
+ - 3x faster training than Qwen3 on CPU
104
+ - 2x faster inference on CPU
105
+ - Native MLX support
106
+ - Optimized for edge devices
107
+
108
+ ## Model Sources
 
 
 
 
 
 
 
 
109
 
110
+ - DINOv3: [facebook/dinov3-vith16plus-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
111
+ - SigLIP2: [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
112
+ - LFM2.5: [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
113
 
114
  ## License
115