Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +51 -32

README.md CHANGED Viewed

@@ -14,24 +14,47 @@ tags:
 - reasoning
 - chain-of-thought
 - instruction-following
 - oculus
 - standalone
 ---
-# Oculus 0.1 (Unified ~8GB)
-**Complete standalone vision-language model with both instruction-following and chain-of-thought reasoning.**
-Oculus 0.1 combines the best of both worlds:
 - **Instruct**: Natural instruction following, image captioning, VQA
 - **Reasoning**: Chain-of-thought thinking with `<think>...</think>` tokens
-This package includes ALL model weights bundled together:
-- DINOv3-Large vision encoder (~2.3GB)
-- SigLIP vision encoder (~1.1GB)
-- BLIP language models (~3GB)
-- Trained projector & heads (~835MB)
-- Unified VQA model (~1.5GB)
 ## Installation
@@ -64,33 +87,29 @@ caption = model.caption("image.jpg")
 results = model.detect("image.jpg")
 ```
-## Capabilities
-| Task | Method | Description |
-|------|--------|-------------|
-| VQA | `model.ask(image, question)` | Answer questions about images |
-| Reasoning | `model.ask(image, question, think=True)` | Chain-of-thought reasoning |
-| Captioning | `model.caption(image)` | Generate image descriptions |
-| Detection | `model.detect(image)` | Object detection (80 COCO classes) |
-## Model Structure
-```
-Oculus-0.1/
-├── config.json
-├── vision_encoders/
-│   ├── dinov3-large/      # DINOv3 ViT-L (~2.3GB)
-│   └── siglip-base/       # SigLIP (~1.1GB)
-├── language_model/
-│   ├── blip-captioning/   # BLIP captioning
-│   └── blip-vqa-finetuned/ # Unified VQA (~1.5GB)
-├── trained_components/
-│   ├── projector.npz      # Vision projector (~800MB)
-│   └── heads.pth          # Detection heads (~35MB)
-└── oculus_unified_model/  # Model code
-```
-## Total Size: ~8GB
 ## License

 - reasoning
 - chain-of-thought
 - instruction-following
+- segmentation
+- detection
+- ocr
+- dinov3
+- siglip2
+- lfm2.5
 - oculus
 - standalone
+base_model:
+- facebook/dinov3-vith16plus-pretrain-lvd1689m
+- google/siglip2-so400m-patch16-naflex
+- LiquidAI/LFM2.5-1.2B-Base
 ---
+# Oculus 0.1 (~4.5B params)
+**Multimodal vision-language model combining DINOv3, SigLIP2, and LFM2.5.**
+Oculus 0.1 combines:
+- **DINOv3 ViT-H/16+**: Universal vision backbone (~1.7B params)
+- **SigLIP2 SO400M**: Vision-language understanding (~400M params)
+- **LFM2.5-1.2B**: Liquid AI's language model (~1.2B params)
+## Capabilities
 - **Instruct**: Natural instruction following, image captioning, VQA
 - **Reasoning**: Chain-of-thought thinking with `<think>...</think>` tokens
+- **Segmentation**: Pixel-level class prediction
+- **Detection**: Object detection (80 COCO classes)
+- **OCR**: Text detection and recognition
+## Architecture
+```
+Image (224x224) --> DINOv3 ViT-H/16+ --\
+                                        +--> Concat --> Projector --> LFM2.5-1.2B --> Text
+Image (384x384) --> SigLIP2 SO400M ----/                    |
+                                                            +--> Segmentation Head
+                                                            +--> Detection Head
+                                                            +--> OCR Head
+```
 ## Installation
 results = model.detect("image.jpg")
 ```
+## Model Components
+| Component | Model | Parameters |
+|-----------|-------|------------|
+| Vision Encoder 1 | DINOv3 ViT-H/16+ | ~1.7B |
+| Vision Encoder 2 | SigLIP2 SO400M | ~400M |
+| Projector | 2-layer MLP | ~5M |
+| Language Model | LFM2.5-1.2B (Liquid AI) | ~1.2B |
+| Task Heads | Seg/Det/OCR | ~1.5M |
+| **Total** | | **~4.5B** |
+## Why LFM2.5?
+- 3x faster training than Qwen3 on CPU
+- 2x faster inference on CPU
+- Native MLX support
+- Optimized for edge devices
+## Model Sources
+- DINOv3: [facebook/dinov3-vith16plus-pretrain-lvd1689m](https://huggingface.co/facebook/dinov3-vith16plus-pretrain-lvd1689m)
+- SigLIP2: [google/siglip2-so400m-patch16-naflex](https://huggingface.co/google/siglip2-so400m-patch16-naflex)
+- LFM2.5: [LiquidAI/LFM2.5-1.2B-Base](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Base)
 ## License