Update README.md
Browse files
README.md
CHANGED
|
@@ -5,6 +5,385 @@ colorFrom: red
|
|
| 5 |
colorTo: pink
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
|
|
|
| 8 |
---
|
|
|
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
colorTo: pink
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
+
license: apache-2.0
|
| 9 |
---
|
| 10 |
+
<div align="center">
|
| 11 |
|
| 12 |
+
<br>
|
| 13 |
+
|
| 14 |
+
<img src="https://img.shields.io/badge/%F0%9F%91%81%EF%B8%8F-OPENLLAVA-0D1117?style=for-the-badge&labelColor=0D1117" alt="OpenLLaVA" height="60">
|
| 15 |
+
|
| 16 |
+
<br><br>
|
| 17 |
+
|
| 18 |
+
# Inject Vision Into Any Language Model.
|
| 19 |
+
|
| 20 |
+
**Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
|
| 21 |
+
**Low-level. Fast. Free. Built by [OpceanAI](https://huggingface.co/OpceanAI).**
|
| 22 |
+
|
| 23 |
+
<br>
|
| 24 |
+
|
| 25 |
+
[](https://pypi.org/project/openllava)
|
| 26 |
+
|
| 27 |
+
[](https://huggingface.co/OpenLLaVA)
|
| 28 |
+
|
| 29 |
+
[](https://opensource.org/licenses/Apache-2.0)
|
| 30 |
+
|
| 31 |
+
[](https://discord.gg/openllava)
|
| 32 |
+
|
| 33 |
+
<br>
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
<br>
|
| 38 |
+
|
| 39 |
+
</div>
|
| 40 |
+
|
| 41 |
+
## What is OpenLLaVA?
|
| 42 |
+
|
| 43 |
+
**OpenLLaVA** is an open-source framework that injects vision capabilities into any language model — no architecture restrictions, no hardcoded backends, no compromises. Built on the LLaVA-style projection architecture and extended with custom CUDA kernels, a C++ core, and a clean Python API.
|
| 44 |
+
|
| 45 |
+
The framework is developed and maintained by **OpceanAI** as infrastructure for their vision model pipeline. Every model OpceanAI releases through OpenLLaVA feeds improvements back into the framework.
|
| 46 |
+
|
| 47 |
+
The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**
|
| 48 |
+
|
| 49 |
+
<br>
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
<br>
|
| 54 |
+
|
| 55 |
+
<div align="center">
|
| 56 |
+
|
| 57 |
+
## Quickstart
|
| 58 |
+
|
| 59 |
+
</div>
|
| 60 |
+
|
| 61 |
+
<br>
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
pip install openllava
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
```python
|
| 68 |
+
from openllava import patch_model
|
| 69 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 70 |
+
|
| 71 |
+
# Any HuggingFace model. Any vision encoder.
|
| 72 |
+
model = AutoModelForCausalLM.from_pretrained("your-org/your-llm")
|
| 73 |
+
tokenizer = AutoTokenizer.from_pretrained("your-org/your-llm")
|
| 74 |
+
|
| 75 |
+
model = patch_model(
|
| 76 |
+
model,
|
| 77 |
+
vision_encoder="google/siglip2-so400m-patch14-384",
|
| 78 |
+
projector_layers=3,
|
| 79 |
+
)
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
|
| 83 |
+
|
| 84 |
+
<br>
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
<br>
|
| 89 |
+
|
| 90 |
+
<div align="center">
|
| 91 |
+
|
| 92 |
+
## Architecture
|
| 93 |
+
|
| 94 |
+
</div>
|
| 95 |
+
|
| 96 |
+
<br>
|
| 97 |
+
|
| 98 |
+
<table>
|
| 99 |
+
<tr>
|
| 100 |
+
<td width="50%" valign="top">
|
| 101 |
+
|
| 102 |
+
**Vision Encoder**
|
| 103 |
+
|
| 104 |
+
Any encoder from HuggingFace — SigLIP 2, CLIP, EVA-CLIP, InternViT. OpenLLaVA auto-reads the output dimension and handles tokenization regardless of encoder architecture.
|
| 105 |
+
|
| 106 |
+
<br>
|
| 107 |
+
|
| 108 |
+
**Projector Engine**
|
| 109 |
+
|
| 110 |
+
3-layer MLP with GELU activation, implemented as a fused CUDA kernel. Faster than PyTorch naive by design. Hidden dimension auto-computed from encoder output → LLM input.
|
| 111 |
+
|
| 112 |
+
</td>
|
| 113 |
+
<td width="50%" valign="top">
|
| 114 |
+
|
| 115 |
+
**Model Patcher**
|
| 116 |
+
|
| 117 |
+
Patches any HuggingFace causal LM to accept vision tokens. Adds `<image>` special token, extends the embedding layer, and wires the projector output into the LLM input stream. Supports LoRA-patched models.
|
| 118 |
+
|
| 119 |
+
<br>
|
| 120 |
+
|
| 121 |
+
**Training Engine**
|
| 122 |
+
|
| 123 |
+
Two-phase training built in. Phase 1: projector warmup with frozen LLM. Phase 2: joint fine-tuning with LoRA. Gradient checkpointing, Flash Attention 2, and bfloat16 enabled by default.
|
| 124 |
+
|
| 125 |
+
</td>
|
| 126 |
+
</tr>
|
| 127 |
+
</table>
|
| 128 |
+
|
| 129 |
+
<br>
|
| 130 |
+
|
| 131 |
+
---
|
| 132 |
+
|
| 133 |
+
<br>
|
| 134 |
+
|
| 135 |
+
<div align="center">
|
| 136 |
+
|
| 137 |
+
## Stack
|
| 138 |
+
|
| 139 |
+
</div>
|
| 140 |
+
|
| 141 |
+
<br>
|
| 142 |
+
|
| 143 |
+
| Layer | Technology | Purpose |
|
| 144 |
+
|:------|:----------:|:--------|
|
| 145 |
+
| CUDA Kernels | C/CUDA | Fused projector ops, vision token attention |
|
| 146 |
+
| Core | C++ | Memory management, tensor routing |
|
| 147 |
+
| Bindings | pybind11 | C++ → Python bridge |
|
| 148 |
+
| API | Python | Public interface |
|
| 149 |
+
| Export | HuggingFace | Standard model format + GGUF |
|
| 150 |
+
|
| 151 |
+
<br>
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
<br>
|
| 156 |
+
|
| 157 |
+
<div align="center">
|
| 158 |
+
|
| 159 |
+
## Training Pipeline
|
| 160 |
+
|
| 161 |
+
</div>
|
| 162 |
+
|
| 163 |
+
<br>
|
| 164 |
+
|
| 165 |
+
```python
|
| 166 |
+
from openllava import OpenLLaVATrainer
|
| 167 |
+
|
| 168 |
+
trainer = OpenLLaVATrainer(
|
| 169 |
+
model=model,
|
| 170 |
+
vision_encoder="google/siglip2-so400m-patch14-384",
|
| 171 |
+
pretrain_dataset="liuhaotian/LLaVA-Pretrain", # Phase 1
|
| 172 |
+
instruct_dataset="liuhaotian/LLaVA-Instruct-150K", # Phase 2
|
| 173 |
+
lora_r=64,
|
| 174 |
+
lora_alpha=128,
|
| 175 |
+
lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
|
| 176 |
+
)
|
| 177 |
+
|
| 178 |
+
trainer.train() # Handles both phases automatically
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
OpenLLaVA manages phase transitions, learning rate schedules, and checkpoint saving. You run one command.
|
| 182 |
+
|
| 183 |
+
<br>
|
| 184 |
+
|
| 185 |
+
---
|
| 186 |
+
|
| 187 |
+
<br>
|
| 188 |
+
|
| 189 |
+
<div align="center">
|
| 190 |
+
|
| 191 |
+
## OpceanAI Vision Models
|
| 192 |
+
|
| 193 |
+
</div>
|
| 194 |
+
|
| 195 |
+
<br>
|
| 196 |
+
|
| 197 |
+
OpceanAI uses OpenLLaVA to publish vision versions of new language models as they release. These are the models built with the framework:
|
| 198 |
+
|
| 199 |
+
<br>
|
| 200 |
+
|
| 201 |
+
<table>
|
| 202 |
+
<tr>
|
| 203 |
+
<td width="50%" valign="top">
|
| 204 |
+
|
| 205 |
+
**Yaki YuuKi+ Vision** *(in development)*
|
| 206 |
+
|
| 207 |
+
Vision-language model built on Yuuki RxG 8B (DeepSeek-R1-Qwen2.5-8B fine-tune). Complex visual reasoning, bilingual (ES/EN), preserves the Yuuki `<think>` chain-of-thought behavior for multimodal tasks.
|
| 208 |
+
|
| 209 |
+
Vision encoder: SigLIP 2 SO400M · LoRA r=64
|
| 210 |
+
|
| 211 |
+
[](https://huggingface.co/OpceanAI)
|
| 212 |
+
|
| 213 |
+
</td>
|
| 214 |
+
<td width="50%" valign="top">
|
| 215 |
+
|
| 216 |
+
**Yuuki NxG VL**
|
| 217 |
+
|
| 218 |
+
7B vision-language model fine-tuned from Qwen2.5-VL-7B-Instruct. Extends the NxG model family to multimodal tasks. The first OpceanAI vision model and the validation case for the OpenLLaVA pipeline.
|
| 219 |
+
|
| 220 |
+
[](https://huggingface.co/OpceanAI/Yuuki-NxG-vl)
|
| 221 |
+
|
| 222 |
+
</td>
|
| 223 |
+
</tr>
|
| 224 |
+
</table>
|
| 225 |
+
|
| 226 |
+
<br>
|
| 227 |
+
|
| 228 |
+
---
|
| 229 |
+
|
| 230 |
+
<br>
|
| 231 |
+
|
| 232 |
+
<div align="center">
|
| 233 |
+
|
| 234 |
+
## Philosophy
|
| 235 |
+
|
| 236 |
+
</div>
|
| 237 |
+
|
| 238 |
+
<br>
|
| 239 |
+
|
| 240 |
+
<table>
|
| 241 |
+
<tr>
|
| 242 |
+
<td width="50%" valign="top">
|
| 243 |
+
|
| 244 |
+
**Model-Agnostic by Design**
|
| 245 |
+
|
| 246 |
+
Every major framework for multimodal training — LLaVA, LLaVA-Next, InstructBLIP — is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.
|
| 247 |
+
|
| 248 |
+
</td>
|
| 249 |
+
<td width="50%" valign="top">
|
| 250 |
+
|
| 251 |
+
**Speed Over Ceremony**
|
| 252 |
+
|
| 253 |
+
When a new language model drops, the window to publish a vision version is 48–72 hours before the ecosystem moves on. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.
|
| 254 |
+
|
| 255 |
+
</td>
|
| 256 |
+
</tr>
|
| 257 |
+
</table>
|
| 258 |
+
|
| 259 |
+
<table>
|
| 260 |
+
<tr>
|
| 261 |
+
<td width="50%" valign="top">
|
| 262 |
+
|
| 263 |
+
**Low Level Where It Matters**
|
| 264 |
+
|
| 265 |
+
The projector is the critical path. Everything else can be Python. The CUDA kernel for the fused MLP op and the C++ memory manager exist because training throughput on a single A100 is the binding constraint for a zero-budget lab.
|
| 266 |
+
|
| 267 |
+
</td>
|
| 268 |
+
<td width="50%" valign="top">
|
| 269 |
+
|
| 270 |
+
**Fully Open**
|
| 271 |
+
|
| 272 |
+
Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher — with any model, any hardware, any budget — can build a competitive vision-language model.
|
| 273 |
+
|
| 274 |
+
</td>
|
| 275 |
+
</tr>
|
| 276 |
+
</table>
|
| 277 |
+
|
| 278 |
+
<br>
|
| 279 |
+
|
| 280 |
+
---
|
| 281 |
+
|
| 282 |
+
<br>
|
| 283 |
+
|
| 284 |
+
<div align="center">
|
| 285 |
+
|
| 286 |
+
## Roadmap
|
| 287 |
+
|
| 288 |
+
</div>
|
| 289 |
+
|
| 290 |
+
<br>
|
| 291 |
+
|
| 292 |
+
<table>
|
| 293 |
+
<tr>
|
| 294 |
+
<td width="50%" valign="top">
|
| 295 |
+
|
| 296 |
+
**Framework**
|
| 297 |
+
|
| 298 |
+
| Feature | Status |
|
| 299 |
+
|:--------|:------:|
|
| 300 |
+
| Python API + model patcher | In development |
|
| 301 |
+
| MLP projector (PyTorch) | In development |
|
| 302 |
+
| Two-phase training engine | In development |
|
| 303 |
+
| Fused CUDA projector kernel | Planned |
|
| 304 |
+
| C++ memory core | Planned |
|
| 305 |
+
| GGUF vision export | Planned |
|
| 306 |
+
| Multi-encoder support (BRAVE-style) | Planned |
|
| 307 |
+
|
| 308 |
+
</td>
|
| 309 |
+
<td width="50%" valign="top">
|
| 310 |
+
|
| 311 |
+
**Vision Models**
|
| 312 |
+
|
| 313 |
+
| Model | Status |
|
| 314 |
+
|:------|:------:|
|
| 315 |
+
| Yuuki NxG VL | Released |
|
| 316 |
+
| Yaki YuuKi+ Vision (8B) | In development |
|
| 317 |
+
| Community model pipeline | Planned |
|
| 318 |
+
|
| 319 |
+
</td>
|
| 320 |
+
</tr>
|
| 321 |
+
</table>
|
| 322 |
+
|
| 323 |
+
<br>
|
| 324 |
+
|
| 325 |
+
---
|
| 326 |
+
|
| 327 |
+
<br>
|
| 328 |
+
|
| 329 |
+
<div align="center">
|
| 330 |
+
|
| 331 |
+
## Contributing
|
| 332 |
+
|
| 333 |
+
</div>
|
| 334 |
+
|
| 335 |
+
<br>
|
| 336 |
+
|
| 337 |
+
OpenLLaVA is built to be extended. If you patch a model family that isn't supported yet, the contribution belongs in the framework. If you find a faster kernel implementation, open a PR.
|
| 338 |
+
|
| 339 |
+
The project is maintained by OpceanAI but owned by the community.
|
| 340 |
+
|
| 341 |
+
```bash
|
| 342 |
+
git clone https://github.com/OpceanAI/openllava
|
| 343 |
+
cd openllava
|
| 344 |
+
pip install -e ".[dev]"
|
| 345 |
+
```
|
| 346 |
+
|
| 347 |
+
<br>
|
| 348 |
+
|
| 349 |
+
---
|
| 350 |
+
|
| 351 |
+
<br>
|
| 352 |
+
|
| 353 |
+
<div align="center">
|
| 354 |
+
|
| 355 |
+
## Built by OpceanAI
|
| 356 |
+
|
| 357 |
+
</div>
|
| 358 |
+
|
| 359 |
+
<br>
|
| 360 |
+
|
| 361 |
+
<div align="center">
|
| 362 |
+
|
| 363 |
+
OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) — an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on Google Colab Pro and validated on consumer hardware.
|
| 364 |
+
|
| 365 |
+
<br>
|
| 366 |
+
|
| 367 |
+
[](https://huggingface.co/OpceanAI)
|
| 368 |
+
|
| 369 |
+
[](https://huggingface.co/OpceanAI)
|
| 370 |
+
|
| 371 |
+
[](https://github.com/sponsors/aguitauwu)
|
| 372 |
+
|
| 373 |
+
<br>
|
| 374 |
+
|
| 375 |
+
---
|
| 376 |
+
|
| 377 |
+
<br>
|
| 378 |
+
|
| 379 |
+
**Open framework. Open models. Zero budget. Measurable results.**
|
| 380 |
+
|
| 381 |
+
<br>
|
| 382 |
+
|
| 383 |
+
[](https://github.com/OpceanAI/openllava)
|
| 384 |
+
|
| 385 |
+
<br>
|
| 386 |
+
|
| 387 |
+
*The fastest path from any language model to a vision-language model.*
|
| 388 |
+
|
| 389 |
+
</div>
|