Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,13 @@
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
license: apache-2.0
|
| 9 |
---
|
|
|
|
| 10 |
<div align="center">
|
| 11 |
|
| 12 |
<br>
|
|
@@ -15,101 +16,191 @@ license: apache-2.0
|
|
| 15 |
|
| 16 |
<br><br>
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
# Inject Vision Into Any Language Model.
|
| 19 |
|
| 20 |
**Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
|
| 21 |
-
**
|
| 22 |
|
|
|
|
| 23 |
|
| 24 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
<br>
|
| 27 |
|
|
|
|
|
|
|
| 28 |
</div>
|
| 29 |
|
| 30 |
## What is OpenLLaVA?
|
| 31 |
|
| 32 |
-
**OpenLLaVA** is
|
| 33 |
|
| 34 |
-
The framework
|
| 35 |
|
| 36 |
The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
---
|
| 41 |
|
| 42 |
<br>
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
## Quickstart
|
| 47 |
|
| 48 |
-
</div>
|
| 49 |
-
|
| 50 |
-
<br>
|
| 51 |
-
|
| 52 |
```bash
|
| 53 |
-
pip install openllava
|
|
|
|
|
|
|
|
|
|
| 54 |
```
|
| 55 |
|
| 56 |
-
|
| 57 |
-
from openllava import patch_model
|
| 58 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
tokenizer = AutoTokenizer.from_pretrained("your-org/your-llm")
|
| 63 |
|
| 64 |
-
model =
|
| 65 |
-
|
| 66 |
vision_encoder="google/siglip2-so400m-patch14-384",
|
| 67 |
-
|
| 68 |
)
|
| 69 |
```
|
| 70 |
|
| 71 |
That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
|
| 72 |
|
| 73 |
-
|
| 74 |
|
| 75 |
-
|
|
|
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
|
|
|
| 80 |
|
| 81 |
-
##
|
| 82 |
|
| 83 |
-
|
|
|
|
| 84 |
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
<td width="50%" valign="top">
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
<br>
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
<
|
|
|
|
| 102 |
<td width="50%" valign="top">
|
| 103 |
|
| 104 |
-
**Model
|
| 105 |
-
|
| 106 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
-
<
|
| 109 |
-
|
| 110 |
-
**Training Engine**
|
| 111 |
|
| 112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
</td>
|
| 115 |
</tr>
|
|
@@ -119,94 +210,135 @@ Two-phase training built in. Phase 1: projector warmup with frozen LLM. Phase 2:
|
|
| 119 |
|
| 120 |
---
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
-
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
-
|
| 131 |
|
| 132 |
| Layer | Technology | Purpose |
|
| 133 |
|:------|:----------:|:--------|
|
| 134 |
-
| CUDA Kernels | C/CUDA | Fused projector ops,
|
| 135 |
-
| Core | C++ | Memory management, tensor routing |
|
| 136 |
| Bindings | pybind11 | C++ β Python bridge |
|
| 137 |
-
|
|
| 138 |
-
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
<br>
|
| 141 |
|
| 142 |
---
|
| 143 |
|
| 144 |
-
|
| 145 |
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
-
|
| 151 |
|
| 152 |
-
|
|
|
|
|
|
|
| 153 |
|
| 154 |
```python
|
| 155 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
-
|
| 158 |
-
|
|
|
|
|
|
|
| 159 |
vision_encoder="google/siglip2-so400m-patch14-384",
|
| 160 |
-
pretrain_dataset="liuhaotian/LLaVA-Pretrain", # Phase 1
|
| 161 |
-
instruct_dataset="liuhaotian/LLaVA-Instruct-150K", # Phase 2
|
| 162 |
-
lora_r=64,
|
| 163 |
-
lora_alpha=128,
|
| 164 |
-
lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
|
| 165 |
)
|
| 166 |
-
|
| 167 |
-
trainer.train() # Handles both phases automatically
|
| 168 |
```
|
| 169 |
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
<br>
|
| 173 |
|
| 174 |
---
|
| 175 |
|
| 176 |
-
<br>
|
| 177 |
-
|
| 178 |
-
<div align="center">
|
| 179 |
-
|
| 180 |
## OpceanAI Vision Models
|
| 181 |
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
<br>
|
| 185 |
-
|
| 186 |
-
OpceanAI uses OpenLLaVA to publish vision versions of new language models as they release. These are the models built with the framework:
|
| 187 |
-
|
| 188 |
-
<br>
|
| 189 |
|
| 190 |
<table>
|
| 191 |
<tr>
|
| 192 |
-
<td width="
|
| 193 |
|
| 194 |
-
**Yaki
|
| 195 |
|
| 196 |
-
Vision-language model
|
| 197 |
|
| 198 |
-
|
|
|
|
|
|
|
| 199 |
|
| 200 |
-
[ | In development |
|
| 291 |
-
| Two-phase training engine | In development |
|
| 292 |
-
| Fused CUDA projector kernel | Planned |
|
| 293 |
-
| C++ memory core | Planned |
|
| 294 |
-
| GGUF vision export | Planned |
|
| 295 |
-
| Multi-encoder support (BRAVE-style) | Planned |
|
| 296 |
-
|
| 297 |
-
</td>
|
| 298 |
-
<td width="50%" valign="top">
|
| 299 |
-
|
| 300 |
-
**Vision Models**
|
| 301 |
-
|
| 302 |
-
| Model | Status |
|
| 303 |
-
|:------|:------:|
|
| 304 |
-
| Yuuki NxG VL | Released |
|
| 305 |
-
| Yaki YuuKi+ Vision (8B) | In development |
|
| 306 |
-
| Community model pipeline | Planned |
|
| 307 |
-
|
| 308 |
-
</td>
|
| 309 |
-
</tr>
|
| 310 |
-
</table>
|
| 311 |
|
| 312 |
<br>
|
| 313 |
|
| 314 |
---
|
| 315 |
|
| 316 |
-
<br>
|
| 317 |
-
|
| 318 |
-
<div align="center">
|
| 319 |
-
|
| 320 |
-
## Contributing
|
| 321 |
-
|
| 322 |
-
</div>
|
| 323 |
-
|
| 324 |
-
<br>
|
| 325 |
-
|
| 326 |
-
OpenLLaVA is built to be extended. If you patch a model family that isn't supported yet, the contribution belongs in the framework. If you find a faster kernel implementation, open a PR.
|
| 327 |
-
|
| 328 |
-
The project is maintained by OpceanAI but owned by the community.
|
| 329 |
-
|
| 330 |
-
```bash
|
| 331 |
-
git clone https://github.com/OpceanAI/openllava
|
| 332 |
-
cd openllava
|
| 333 |
-
pip install -e ".[dev]"
|
| 334 |
-
```
|
| 335 |
-
|
| 336 |
-
<br>
|
| 337 |
-
|
| 338 |
-
---
|
| 339 |
-
|
| 340 |
-
<br>
|
| 341 |
-
|
| 342 |
<div align="center">
|
| 343 |
|
| 344 |
## Built by OpceanAI
|
| 345 |
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
<br>
|
| 349 |
-
|
| 350 |
-
<div align="center">
|
| 351 |
-
|
| 352 |
-
OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) β an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on Google Colab Pro and validated on consumer hardware.
|
| 353 |
|
| 354 |
<br>
|
| 355 |
|
| 356 |
[](https://huggingface.co/OpceanAI)
|
| 357 |
|
| 358 |
-
[](https://github.com/sponsors/aguitauwu)
|
| 361 |
|
|
@@ -363,16 +413,10 @@ OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.c
|
|
| 363 |
|
| 364 |
---
|
| 365 |
|
| 366 |
-
<br>
|
| 367 |
-
|
| 368 |
**Open framework. Open models. Zero budget. Measurable results.**
|
| 369 |
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
[](https://github.com/OpceanAI/openllava)
|
| 373 |
|
| 374 |
-
|
| 375 |
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
</div>
|
|
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
+
emoji: ποΈ
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: indigo
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
license: apache-2.0
|
| 9 |
---
|
| 10 |
+
|
| 11 |
<div align="center">
|
| 12 |
|
| 13 |
<br>
|
|
|
|
| 16 |
|
| 17 |
<br><br>
|
| 18 |
|
| 19 |
+
<img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
|
| 20 |
+
|
| 21 |
+
<img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
|
| 22 |
+
|
| 23 |
+
<img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&labelColor=0A0A0A&logo=python&logoColor=3776AB" alt="Python">
|
| 24 |
+
|
| 25 |
+
<img src="https://img.shields.io/badge/PyTorch-2.3+-EE4C2C?style=for-the-badge&labelColor=0A0A0A&logo=pytorch&logoColor=EE4C2C" alt="PyTorch">
|
| 26 |
+
|
| 27 |
+
<br><br>
|
| 28 |
+
|
| 29 |
+
<img src="https://img.shields.io/badge/CUDA-8.0%2B-76B900?style=for-the-badge&labelColor=0A0A0A&logo=nvidia&logoColor=76B900" alt="CUDA">
|
| 30 |
+
|
| 31 |
+
<img src="https://img.shields.io/badge/ROCm-AMD-ED2B23?style=for-the-badge&labelColor=0A0A0A" alt="ROCm">
|
| 32 |
+
|
| 33 |
+
<img src="https://img.shields.io/badge/TPU-Google-4285F4?style=for-the-badge&labelColor=0A0A0A" alt="TPU">
|
| 34 |
+
|
| 35 |
+
<img src="https://img.shields.io/badge/MLX-Apple-555555?style=for-the-badge&labelColor=0A0A0A&logo=apple&logoColor=white" alt="MLX">
|
| 36 |
+
|
| 37 |
+
<img src="https://img.shields.io/badge/XPU-Intel-0071C5?style=for-the-badge&labelColor=0A0A0A&logo=intel&logoColor=0071C5" alt="XPU">
|
| 38 |
+
|
| 39 |
+
<br><br>
|
| 40 |
+
|
| 41 |
# Inject Vision Into Any Language Model.
|
| 42 |
|
| 43 |
**Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
|
| 44 |
+
**Architecture-agnostic. Multi-backend. Production-ready. Built by [OpceanAI](https://huggingface.co/OpceanAI).**
|
| 45 |
|
| 46 |
+
<br>
|
| 47 |
|
| 48 |
+
[](https://github.com/OpceanAI/openllava)
|
| 49 |
+
|
| 50 |
+
[](https://huggingface.co/Openllava)
|
| 51 |
+
|
| 52 |
+
[](https://github.com/sponsors/aguitauwu)
|
| 53 |
|
| 54 |
<br>
|
| 55 |
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
</div>
|
| 59 |
|
| 60 |
## What is OpenLLaVA?
|
| 61 |
|
| 62 |
+
**OpenLLaVA** is a comprehensive open-source framework for injecting vision capabilities into any language model. It provides a complete pipeline β from model construction through training, inference, serving, export, and evaluation β all accessible through a unified Python API and CLI.
|
| 63 |
|
| 64 |
+
The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, and more) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the full training and inference pipelines.
|
| 65 |
|
| 66 |
The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**
|
| 67 |
|
| 68 |
+
> OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU β with automatic hardware detection and optimal configuration selection.
|
|
|
|
|
|
|
| 69 |
|
| 70 |
<br>
|
| 71 |
|
| 72 |
+
---
|
| 73 |
|
| 74 |
## Quickstart
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
```bash
|
| 77 |
+
pip install openllava # Core
|
| 78 |
+
pip install openllava[cli] # With CLI tools
|
| 79 |
+
pip install openllava[serve] # With serving
|
| 80 |
+
pip install openllava[all] # Full installation
|
| 81 |
```
|
| 82 |
|
| 83 |
+
### Inject Vision Into Any LLM
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
```python
|
| 86 |
+
from openllava import OpenLLaVA, Backend
|
|
|
|
| 87 |
|
| 88 |
+
model = OpenLLaVA(
|
| 89 |
+
llm="meta-llama/Llama-3-8B",
|
| 90 |
vision_encoder="google/siglip2-so400m-patch14-384",
|
| 91 |
+
backend=Backend.AUTO,
|
| 92 |
)
|
| 93 |
```
|
| 94 |
|
| 95 |
That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
|
| 96 |
|
| 97 |
+
### Train with LoRA
|
| 98 |
|
| 99 |
+
```python
|
| 100 |
+
model.lora(r=64, alpha=128, dropout=0.05)
|
| 101 |
|
| 102 |
+
model.train(
|
| 103 |
+
phase1=dict(dataset="liuhaotian/LLaVA-Pretrain", samples=100_000),
|
| 104 |
+
phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K", learning_rate=2e-4),
|
| 105 |
+
resume=True,
|
| 106 |
+
)
|
| 107 |
|
| 108 |
+
model.push("my-org/my-vision-model")
|
| 109 |
+
```
|
| 110 |
|
| 111 |
+
### FastVisionModel API
|
| 112 |
|
| 113 |
+
```python
|
| 114 |
+
from openllava.api import FastVisionModel
|
| 115 |
|
| 116 |
+
model, tokenizer = FastVisionModel.from_pretrained(
|
| 117 |
+
"Openllava/Yaki",
|
| 118 |
+
max_seq_length=2048,
|
| 119 |
+
load_in_4bit=True,
|
| 120 |
+
)
|
| 121 |
|
| 122 |
+
model = FastVisionModel.get_peft_model(model, r=16, alpha=32)
|
| 123 |
+
```
|
|
|
|
| 124 |
|
| 125 |
+
### Serve as OpenAI-Compatible API
|
| 126 |
|
| 127 |
+
```bash
|
| 128 |
+
openllava serve Openllava/Yaki --port 8000
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
from openai import OpenAI
|
| 133 |
+
|
| 134 |
+
client = OpenAI(api_key="openllava", base_url="http://localhost:8000/v1")
|
| 135 |
+
|
| 136 |
+
response = client.chat.completions.create(
|
| 137 |
+
model="yaki",
|
| 138 |
+
messages=[{
|
| 139 |
+
"role": "user",
|
| 140 |
+
"content": [
|
| 141 |
+
{"type": "text", "text": "What is in this image?"},
|
| 142 |
+
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
|
| 143 |
+
],
|
| 144 |
+
}],
|
| 145 |
+
)
|
| 146 |
+
```
|
| 147 |
|
| 148 |
<br>
|
| 149 |
|
| 150 |
+
---
|
| 151 |
|
| 152 |
+
## Key Features
|
| 153 |
|
| 154 |
+
<table>
|
| 155 |
+
<tr>
|
| 156 |
<td width="50%" valign="top">
|
| 157 |
|
| 158 |
+
**Model Construction**
|
| 159 |
+
- Vision injection into any HuggingFace LLM in 3 lines
|
| 160 |
+
- AnyRes dynamic high-resolution with patch grouping
|
| 161 |
+
- YakiProjector: configurable MLP alignment
|
| 162 |
+
- Auto-detects hidden dims, attention heads, vocab size
|
| 163 |
+
- Supports LoRA-patched models
|
| 164 |
+
|
| 165 |
+
**Training Pipeline**
|
| 166 |
+
- 3-phase training: alignment β instruction β RL
|
| 167 |
+
- LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
|
| 168 |
+
- BitNet ternary training (b1.58)
|
| 169 |
+
- MoE + LoRA fusion
|
| 170 |
+
- FP8 training on H100
|
| 171 |
+
- Padding-free + sequence packing
|
| 172 |
+
- Curriculum learning
|
| 173 |
+
|
| 174 |
+
**RL Alignment**
|
| 175 |
+
- DPO, GRPO, ORPO, PPO
|
| 176 |
+
- Composable reward functions
|
| 177 |
+
- Visual reasoning reward support
|
| 178 |
|
| 179 |
+
</td>
|
| 180 |
+
<td width="50%" valign="top">
|
|
|
|
| 181 |
|
| 182 |
+
**Inference & Serving**
|
| 183 |
+
- Continuous batching
|
| 184 |
+
- PagedAttention (4x memory efficiency)
|
| 185 |
+
- Speculative decoding (Eagle, Medusa, NGram)
|
| 186 |
+
- KV cache: quantization, eviction, compression
|
| 187 |
+
- OpenAI-compatible FastAPI server
|
| 188 |
+
- Streaming support
|
| 189 |
+
|
| 190 |
+
**40+ Optimizations**
|
| 191 |
+
- torch.compile full-graph
|
| 192 |
+
- GPTQ / AWQ / FP4 / NVFP4
|
| 193 |
+
- GaLore gradient projection
|
| 194 |
+
- torchao integration
|
| 195 |
+
- EMA training stability
|
| 196 |
+
- Selective activation checkpointing
|
| 197 |
+
|
| 198 |
+
**Distributed Training**
|
| 199 |
+
- FSDP2, DeepSpeed ZeRO (0-3)
|
| 200 |
+
- Tensor, Pipeline, Expert parallelism
|
| 201 |
+
- Ring Attention (long context)
|
| 202 |
+
- Heterogeneous GPU+CPU+TPU training
|
| 203 |
+
- Auto-parallelism detection
|
| 204 |
|
| 205 |
</td>
|
| 206 |
</tr>
|
|
|
|
| 210 |
|
| 211 |
---
|
| 212 |
|
| 213 |
+
## Multi-Backend Support
|
| 214 |
|
| 215 |
+
| Backend | Hardware | Status |
|
| 216 |
+
|:--------|:---------|:-------|
|
| 217 |
+
| CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) | β
Production |
|
| 218 |
+
| ROCm | AMD GPUs (MI250, MI300X, RX 7000) | β
Production |
|
| 219 |
+
| CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) | β
Production |
|
| 220 |
+
| TPU (XLA/SPMD) | Google TPU v3-v5 | πΆ Beta |
|
| 221 |
+
| MLX | Apple Silicon M1-M4 | πΆ Beta |
|
| 222 |
+
| XPU | Intel Arc, Data Center GPU | πΆ Beta |
|
| 223 |
+
| Heterogeneous | GPU + CPU + TPU mixed | πΆ Beta |
|
| 224 |
|
| 225 |
+
<br>
|
| 226 |
|
| 227 |
+
---
|
| 228 |
|
| 229 |
+
## Stack
|
| 230 |
|
| 231 |
| Layer | Technology | Purpose |
|
| 232 |
|:------|:----------:|:--------|
|
| 233 |
+
| CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
|
| 234 |
+
| Core | C++ | Memory management, tensor routing, async streams |
|
| 235 |
| Bindings | pybind11 | C++ β Python bridge |
|
| 236 |
+
| Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
|
| 237 |
+
| API | Python | Public interface, FastVisionModel, Trainer |
|
| 238 |
+
| Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
|
| 239 |
+
| Export | GGUF/ONNX/SafeTensors/vLLM/MLX | Deployment formats |
|
| 240 |
|
| 241 |
<br>
|
| 242 |
|
| 243 |
---
|
| 244 |
|
| 245 |
+
## Architecture
|
| 246 |
|
| 247 |
+
```
|
| 248 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 249 |
+
β OpenLLaVA Framework β
|
| 250 |
+
ββββββββββββββββββββββββββββββββββββοΏ½οΏ½οΏ½ββββββββββββββββββββββββββ€
|
| 251 |
+
β β
|
| 252 |
+
β Input: Image + Text β
|
| 253 |
+
β β β
|
| 254 |
+
β ββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 255 |
+
β β Vision Encoder (SigLIP2, CLIP, DINOv2, any HF) β β
|
| 256 |
+
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
|
| 257 |
+
β β patch features β
|
| 258 |
+
β ββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ β
|
| 259 |
+
β β YakiProjector β Patch Grouping 3Γ3 + MLP 2-layer β β
|
| 260 |
+
β β [vision_dim Γ 9] β [llm_dim] β β
|
| 261 |
+
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
|
| 262 |
+
β β vision embeddings β
|
| 263 |
+
β ββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ β
|
| 264 |
+
β β Language Model (any AutoModelForCausalLM) β β
|
| 265 |
+
β β QLoRA 4-bit NF4 Β· LoRA r=64 Β· Flash Attention β β
|
| 266 |
+
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
|
| 267 |
+
β β β
|
| 268 |
+
β Output: Text + <think> reasoning blocks β
|
| 269 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 270 |
+
```
|
| 271 |
|
| 272 |
+
<br>
|
| 273 |
|
| 274 |
+
---
|
| 275 |
|
| 276 |
+
## Yadis Architecture
|
| 277 |
+
|
| 278 |
+
Yadis is OpenLLaVA's flagship multimodal architecture β the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.
|
| 279 |
|
| 280 |
```python
|
| 281 |
+
# Yadis Routing β multiple vision experts with MoE router
|
| 282 |
+
from openllava import OpenLLaVA, experts
|
| 283 |
+
|
| 284 |
+
model = OpenLLaVA(
|
| 285 |
+
llm="OpceanAI/OwO-32B",
|
| 286 |
+
architecture="yadis_routing",
|
| 287 |
+
experts=[
|
| 288 |
+
experts.Visual("google/siglip2-so400m-patch14-384"),
|
| 289 |
+
experts.OCR("deepseek-ai/DeepSeek-OCR-2"),
|
| 290 |
+
],
|
| 291 |
+
)
|
| 292 |
|
| 293 |
+
# Yadis Full β discrete tokens + cross-attention per layer
|
| 294 |
+
model = OpenLLaVA(
|
| 295 |
+
llm="OpceanAI/OwO-32B",
|
| 296 |
+
architecture="yadis_full",
|
| 297 |
vision_encoder="google/siglip2-so400m-patch14-384",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 298 |
)
|
|
|
|
|
|
|
| 299 |
```
|
| 300 |
|
| 301 |
+
| Mode | Description |
|
| 302 |
+
|------|-------------|
|
| 303 |
+
| `llava` | LLaVA-style MLP projection (default) |
|
| 304 |
+
| `yadis_routing` | Multiple expert encoders + MoE router |
|
| 305 |
+
| `yadis_full` | Discrete tokens + cross-attention per layer |
|
| 306 |
|
| 307 |
<br>
|
| 308 |
|
| 309 |
---
|
| 310 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 311 |
## OpceanAI Vision Models
|
| 312 |
|
| 313 |
+
OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 314 |
|
| 315 |
<table>
|
| 316 |
<tr>
|
| 317 |
+
<td width="33%" valign="top">
|
| 318 |
|
| 319 |
+
**Yaki v1**
|
| 320 |
|
| 321 |
+
Vision-language model on Yuuki RxG 8B. Complex visual reasoning, bilingual ES/EN, preserves `<think>` chain-of-thought for multimodal tasks.
|
| 322 |
|
| 323 |
+
Base: DeepSeek-R1-Qwen3-8B finetune<br>
|
| 324 |
+
Encoder: SigLIP 2 SO400M<br>
|
| 325 |
+
LoRA: r=64, alpha=128
|
| 326 |
|
| 327 |
+
[](https://huggingface.co/Openllava/Yaki)
|
| 328 |
|
| 329 |
</td>
|
| 330 |
+
<td width="33%" valign="top">
|
| 331 |
+
|
| 332 |
+
**Yaki v2** *(planned)*
|
| 333 |
+
|
| 334 |
+
Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4).
|
| 335 |
|
| 336 |
+
</td>
|
| 337 |
+
<td width="33%" valign="top">
|
| 338 |
|
| 339 |
+
**Yaki v3** *(planned)*
|
| 340 |
|
| 341 |
+
Built on OwO 32B with full Yadis routing architecture. OCR + visual experts.
|
| 342 |
|
| 343 |
</td>
|
| 344 |
</tr>
|
|
|
|
| 348 |
|
| 349 |
---
|
| 350 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 351 |
## Philosophy
|
| 352 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 353 |
<table>
|
| 354 |
<tr>
|
| 355 |
<td width="50%" valign="top">
|
| 356 |
|
| 357 |
+
**Architecture Agnostic by Design**
|
| 358 |
|
| 359 |
+
Every existing multimodal framework is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.
|
|
|
|
|
|
|
|
|
|
| 360 |
|
| 361 |
**Speed Over Ceremony**
|
| 362 |
|
| 363 |
+
When a new model drops, the window to publish a vision version is 48β72 hours. OpenLLaVA is designed for that constraint β minimal configuration, automated phase management, one-command training.
|
| 364 |
|
| 365 |
</td>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 366 |
<td width="50%" valign="top">
|
| 367 |
|
| 368 |
**Low Level Where It Matters**
|
| 369 |
|
| 370 |
+
The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget lab.
|
|
|
|
|
|
|
|
|
|
| 371 |
|
| 372 |
**Fully Open**
|
| 373 |
|
|
|
|
| 381 |
|
| 382 |
---
|
| 383 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 384 |
## Roadmap
|
| 385 |
|
| 386 |
+
| Version | Features | Status |
|
| 387 |
+
|---------|----------|--------|
|
| 388 |
+
| **v1-v3** | LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend | β
Done |
|
| 389 |
+
| **v4-v5** | CUDA kernels, GGUF vision export, CPU offloading, cross-attention | π Active |
|
| 390 |
+
| **v6-v7** | Discrete visual tokens (VQ-VAE), multi-expert routing | π Planned |
|
| 391 |
+
| **v8-v9** | Video support, hybrid architectures | π Planned |
|
| 392 |
+
| **v10** | Yadis complete, omnimodal prep | π Planned |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 393 |
|
| 394 |
<br>
|
| 395 |
|
| 396 |
---
|
| 397 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 398 |
<div align="center">
|
| 399 |
|
| 400 |
## Built by OpceanAI
|
| 401 |
|
| 402 |
+
OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) β an independent AI research organization built from zero budget, consumer hardware, and measurable results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 403 |
|
| 404 |
<br>
|
| 405 |
|
| 406 |
[](https://huggingface.co/OpceanAI)
|
| 407 |
|
| 408 |
+
[](https://github.com/OpceanAI/openllava)
|
| 409 |
|
| 410 |
[](https://github.com/sponsors/aguitauwu)
|
| 411 |
|
|
|
|
| 413 |
|
| 414 |
---
|
| 415 |
|
|
|
|
|
|
|
| 416 |
**Open framework. Open models. Zero budget. Measurable results.**
|
| 417 |
|
| 418 |
+
[](https://github.com/OpceanAI/openllava)
|
|
|
|
|
|
|
| 419 |
|
| 420 |
+
*Inject vision into any language model.*
|
| 421 |
|
| 422 |
+
</div>
|
|
|
|
|
|