File size: 12,137 Bytes
47ea145 13dc5f5 47ea145 ef301a2 47ea145 13dc5f5 ef301a2 47ea145 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 56d2bb9 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 56d2bb9 13dc5f5 56d2bb9 13dc5f5 56d2bb9 13dc5f5 ef301a2 13dc5f5 ef301a2 56d2bb9 13dc5f5 56d2bb9 13dc5f5 56d2bb9 13dc5f5 56d2bb9 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 56d2bb9 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 56d2bb9 13dc5f5 ef301a2 13dc5f5 ef301a2 56d2bb9 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 56d2bb9 13dc5f5 56d2bb9 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 56d2bb9 ef301a2 56d2bb9 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 56d2bb9 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 56d2bb9 ef301a2 56d2bb9 ef301a2 13dc5f5 56d2bb9 ef301a2 56d2bb9 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 ef301a2 13dc5f5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 | ---
title: README
colorFrom: purple
colorTo: indigo
sdk: static
pinned: false
license: apache-2.0
---
<div align="center">
<br>
<img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
<img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
<img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&labelColor=0A0A0A&logo=python&logoColor=3776AB" alt="Python">
<img src="https://img.shields.io/badge/PyTorch-2.3+-EE4C2C?style=for-the-badge&labelColor=0A0A0A&logo=pytorch&logoColor=EE4C2C" alt="PyTorch">
<br><br>
<img src="https://img.shields.io/badge/CUDA-8.0%2B-76B900?style=for-the-badge&labelColor=0A0A0A&logo=nvidia&logoColor=76B900" alt="CUDA">
<img src="https://img.shields.io/badge/ROCm-AMD-ED2B23?style=for-the-badge&labelColor=0A0A0A" alt="ROCm">
<img src="https://img.shields.io/badge/TPU-Google-4285F4?style=for-the-badge&labelColor=0A0A0A" alt="TPU">
<img src="https://img.shields.io/badge/MLX-Apple-555555?style=for-the-badge&labelColor=0A0A0A&logo=apple&logoColor=white" alt="MLX">
<img src="https://img.shields.io/badge/XPU-Intel-0071C5?style=for-the-badge&labelColor=0A0A0A&logo=intel&logoColor=0071C5" alt="XPU">
<br><br>
# Inject Vision Into Any Language Model.
**Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
**Architecture-agnostic. Multi-backend. Production-ready. Built by [OpceanAI](https://huggingface.co/OpceanAI).**
<br>
[](https://github.com/OpceanAI/openllava)
[](https://huggingface.co/Openllava)
[](https://github.com/sponsors/aguitauwu)
<br>
</div>
## What is OpenLLaVA?
**OpenLLaVA** is a comprehensive open-source framework for injecting vision capabilities into any language model. It provides a complete pipeline β from model construction through training, inference, serving, export, and evaluation β all accessible through a unified Python API and CLI.
The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, and more) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the full training and inference pipelines.
The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**
> OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU β with automatic hardware detection and optimal configuration selection.
<br>
## Quickstart
```bash
pip install openllava # Core
pip install openllava[cli] # With CLI tools
pip install openllava[serve] # With serving
pip install openllava[all] # Full installation
```
### Inject Vision Into Any LLM
```python
from openllava import OpenLLaVA, Backend
model = OpenLLaVA(
llm="meta-llama/Llama-3-8B",
vision_encoder="google/siglip2-so400m-patch14-384",
backend=Backend.AUTO,
)
```
OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
### Train with LoRA
```python
model.lora(r=64, alpha=128, dropout=0.05)
model.train(
phase1=dict(dataset="liuhaotian/LLaVA-Pretrain", samples=100_000),
phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K", learning_rate=2e-4),
resume=True,
)
model.push("my-org/my-vision-model")
```
### FastVisionModel API
```python
from openllava.api import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
"Openllava/Yaki",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastVisionModel.get_peft_model(model, r=16, alpha=32)
```
### Serve as OpenAI-Compatible API
```bash
openllava serve Openllava/Yaki --port 8000
```
```python
from openai import OpenAI
client = OpenAI(api_key="openllava", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="yaki",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
],
}],
)
```
<br>
## Key Features
<table>
<tr>
<td width="50%" valign="top">
**Model Construction**
- Vision injection into any HuggingFace LLM in 3 lines
- AnyRes dynamic high-resolution with patch grouping
- YakiProjector: configurable MLP alignment
- Auto-detects hidden dimensions, attention heads, vocabulary size
- Supports LoRA-patched models
**Training Pipeline**
- 3-phase training: alignment, instruction tuning, RL alignment
- LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
- BitNet ternary training (b1.58)
- MoE + LoRA fusion
- FP8 training on H100
- Padding-free and sequence packing
- Curriculum learning
**RL Alignment**
- DPO, GRPO, ORPO, PPO
- Composable reward functions
- Visual reasoning reward support
</td>
<td width="50%" valign="top">
**Inference and Serving**
- Continuous batching
- PagedAttention (4x memory efficiency)
- Speculative decoding (Eagle, Medusa, NGram)
- KV cache: quantization, eviction, compression
- OpenAI-compatible FastAPI server
- Streaming support
**Optimization Suite (40+)**
- torch.compile full-graph compilation
- GPTQ / AWQ / FP4 / NVFP4 quantization
- GaLore gradient projection
- torchao integration
- EMA training stability
- Selective activation checkpointing
**Distributed Training**
- FSDP2, DeepSpeed ZeRO (stages 0-3)
- Tensor, Pipeline, Expert parallelism
- Ring Attention for long context
- Heterogeneous GPU + CPU + TPU training
- Auto-parallelism detection
</td>
</tr>
</table>
<br>
## Multi-Backend Support
| Backend | Hardware | Status |
|:--------|:---------|:-------|
| CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) | Production |
| ROCm | AMD GPUs (MI250, MI300X, RX 7000) | Production |
| CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) | Production |
| TPU (XLA/SPMD) | Google TPU v3-v5 | Beta |
| MLX | Apple Silicon M1-M4 | Beta |
| XPU | Intel Arc, Data Center GPU | Beta |
| Heterogeneous | GPU + CPU + TPU mixed | Beta |
<br>
## Stack
| Layer | Technology | Purpose |
|:------|:----------:|:--------|
| CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
| Core | C++ | Memory management, tensor routing, async streams |
| Bindings | pybind11 | C++ to Python bridge |
| Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
| API | Python | Public interface, FastVisionModel, Trainer |
| Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
| Export | GGUF/ONNX/SafeTensors/vLLM/MLX | Deployment formats |
<br>
## Architecture
**Image + Text** feeds into a **Vision Encoder** (SigLIP2, CLIP, DINOv2, or any HuggingFace encoder), whose patch features are passed through the **YakiProjector** (Patch Grouping 3x3 + MLP 2-layer, mapping `vision_dim x 9` to `llm_dim`). The projected embeddings are merged with text embeddings and passed to the **Language Model** (any `AutoModelForCausalLM`, with QLoRA 4-bit NF4 and LoRA r=64), which generates text output including `<think>` reasoning blocks when applicable.
<br>
## Yadis Architecture
Yadis is OpenLLaVA's flagship multimodal architecture β the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.
```python
# Yadis Routing β multiple vision experts with MoE router
from openllava import OpenLLaVA, experts
model = OpenLLaVA(
llm="OpceanAI/OwO-32B",
architecture="yadis_routing",
experts=[
experts.Visual("google/siglip2-so400m-patch14-384"),
experts.OCR("deepseek-ai/DeepSeek-OCR-2"),
],
)
# Yadis Full β discrete tokens + cross-attention per layer
model = OpenLLaVA(
llm="OpceanAI/OwO-32B",
architecture="yadis_full",
vision_encoder="google/siglip2-so400m-patch14-384",
)
```
| Mode | Description |
|:-----|:------------|
| `llava` | LLaVA-style MLP projection (default) |
| `yadis_routing` | Multiple expert encoders with MoE router |
| `yadis_full` | Discrete visual tokens with cross-attention per layer |
<br>
## OpceanAI Vision Models
OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.
<table>
<tr>
<td width="33%" valign="top">
**Yaki v1**
Vision-language model built on Yuuki RxG 8B. Designed for complex visual reasoning with bilingual support (ES/EN). Preserves the `<think>` chain-of-thought behavior of the base model for multimodal tasks.
Base: DeepSeek-R1-Qwen3-8B fine-tune<br>
Encoder: SigLIP 2 SO400M<br>
LoRA: r=64, alpha=128
[](https://huggingface.co/Openllava/Yaki)
</td>
<td width="33%" valign="top">
**Yaki v2** *(planned)*
Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4).
</td>
<td width="33%" valign="top">
**Yaki v3** *(planned)*
Built on OwO 32B with full Yadis routing architecture, combining visual and OCR expert encoders.
</td>
</tr>
</table>
<br>
## Philosophy
<table>
<tr>
<td width="50%" valign="top">
**Architecture Agnostic by Design**
Every existing multimodal framework is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.
**Speed Over Ceremony**
When a new model is released, the window to publish a vision version is 48 to 72 hours. OpenLLaVA is designed for that constraint β minimal configuration, automated phase management, one-command training.
</td>
<td width="50%" valign="top">
**Low Level Where It Matters**
The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget research organization.
**Fully Open**
Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher β with any model, any hardware, any budget β can build a competitive vision-language model.
</td>
</tr>
</table>
<br>
## Roadmap
| Version | Features | Status |
|:--------|:---------|:-------|
| v1 - v3 | LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend | Released |
| v4 - v5 | CUDA kernels, GGUF vision export, CPU offloading, cross-attention | Active |
| v6 - v7 | Discrete visual tokens (VQ-VAE), multi-expert routing | Planned |
| v8 - v9 | Video support, hybrid architectures | Planned |
| v10 | Yadis complete, omnimodal preparation | Planned |
<br>
<div align="center">
## Built by OpceanAI
OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) β an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on consumer hardware and validated on standard benchmarks.
<br>
[](https://huggingface.co/OpceanAI)
[](https://github.com/OpceanAI/openllava)
[](https://github.com/sponsors/aguitauwu)
<br>
**Open framework. Open models. Zero budget. Measurable results.**
[](https://github.com/OpceanAI/openllava)
*Inject vision into any language model.*
</div>
|