File size: 12,137 Bytes
47ea145
 
13dc5f5
 
47ea145
 
ef301a2
47ea145
13dc5f5
ef301a2
47ea145
ef301a2
 
13dc5f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef301a2
 
 
13dc5f5
ef301a2
13dc5f5
ef301a2
13dc5f5
 
 
 
 
ef301a2
 
 
 
 
 
 
13dc5f5
ef301a2
13dc5f5
ef301a2
 
 
13dc5f5
ef301a2
 
 
 
 
 
13dc5f5
 
 
 
ef301a2
 
13dc5f5
ef301a2
13dc5f5
 
ef301a2
13dc5f5
 
ef301a2
13dc5f5
ef301a2
 
 
56d2bb9
ef301a2
13dc5f5
ef301a2
13dc5f5
 
ef301a2
13dc5f5
 
 
 
 
ef301a2
13dc5f5
 
ef301a2
13dc5f5
ef301a2
13dc5f5
 
ef301a2
13dc5f5
 
 
 
 
ef301a2
13dc5f5
 
ef301a2
13dc5f5
ef301a2
13dc5f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef301a2
 
 
13dc5f5
ef301a2
13dc5f5
 
ef301a2
 
13dc5f5
 
 
 
56d2bb9
13dc5f5
 
 
56d2bb9
13dc5f5
 
 
 
56d2bb9
13dc5f5
 
 
 
 
 
ef301a2
13dc5f5
 
ef301a2
56d2bb9
13dc5f5
 
 
 
 
 
 
56d2bb9
 
 
13dc5f5
 
 
 
 
 
56d2bb9
13dc5f5
56d2bb9
 
13dc5f5
ef301a2
 
 
 
 
 
 
13dc5f5
ef301a2
13dc5f5
 
56d2bb9
 
 
 
 
 
 
ef301a2
13dc5f5
ef301a2
13dc5f5
ef301a2
 
 
13dc5f5
 
56d2bb9
13dc5f5
 
 
 
ef301a2
 
 
13dc5f5
ef301a2
56d2bb9
ef301a2
13dc5f5
ef301a2
13dc5f5
 
 
ef301a2
 
13dc5f5
 
 
 
 
 
 
 
 
 
 
ef301a2
13dc5f5
 
 
 
ef301a2
 
 
 
13dc5f5
56d2bb9
13dc5f5
56d2bb9
 
ef301a2
 
 
 
 
13dc5f5
ef301a2
 
 
13dc5f5
ef301a2
13dc5f5
ef301a2
56d2bb9
ef301a2
56d2bb9
13dc5f5
 
ef301a2
13dc5f5
ef301a2
 
13dc5f5
 
 
 
 
ef301a2
13dc5f5
 
ef301a2
13dc5f5
ef301a2
56d2bb9
ef301a2
 
 
 
 
 
 
 
 
 
 
 
 
13dc5f5
ef301a2
13dc5f5
ef301a2
 
 
56d2bb9
ef301a2
 
 
 
 
 
56d2bb9
ef301a2
 
 
 
 
 
 
 
 
 
 
 
 
13dc5f5
56d2bb9
 
 
 
 
 
ef301a2
 
 
 
 
 
 
56d2bb9
ef301a2
 
 
 
 
13dc5f5
ef301a2
 
 
 
 
 
 
13dc5f5
ef301a2
13dc5f5
ef301a2
13dc5f5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: README
colorFrom: purple
colorTo: indigo
sdk: static
pinned: false
license: apache-2.0
---

<div align="center">

<br>

<img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
&nbsp;
<img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
&nbsp;
<img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&labelColor=0A0A0A&logo=python&logoColor=3776AB" alt="Python">
&nbsp;
<img src="https://img.shields.io/badge/PyTorch-2.3+-EE4C2C?style=for-the-badge&labelColor=0A0A0A&logo=pytorch&logoColor=EE4C2C" alt="PyTorch">

<br><br>

<img src="https://img.shields.io/badge/CUDA-8.0%2B-76B900?style=for-the-badge&labelColor=0A0A0A&logo=nvidia&logoColor=76B900" alt="CUDA">
&nbsp;
<img src="https://img.shields.io/badge/ROCm-AMD-ED2B23?style=for-the-badge&labelColor=0A0A0A" alt="ROCm">
&nbsp;
<img src="https://img.shields.io/badge/TPU-Google-4285F4?style=for-the-badge&labelColor=0A0A0A" alt="TPU">
&nbsp;
<img src="https://img.shields.io/badge/MLX-Apple-555555?style=for-the-badge&labelColor=0A0A0A&logo=apple&logoColor=white" alt="MLX">
&nbsp;
<img src="https://img.shields.io/badge/XPU-Intel-0071C5?style=for-the-badge&labelColor=0A0A0A&logo=intel&logoColor=0071C5" alt="XPU">

<br><br>

# Inject Vision Into Any Language Model.

**Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
**Architecture-agnostic. Multi-backend. Production-ready. Built by [OpceanAI](https://huggingface.co/OpceanAI).**

<br>

[![GitHub](https://img.shields.io/badge/GitHub-OpceanAI%2Fopenllava-0D1117?style=for-the-badge&logo=github)](https://github.com/OpceanAI/openllava)
&nbsp;
[![HuggingFace](https://img.shields.io/badge/Models-Hugging_Face-ffd21e?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/Openllava)
&nbsp;
[![Sponsor](https://img.shields.io/badge/Sponsor-GitHub_Sponsors-ea4aaa?style=for-the-badge&logo=githubsponsors&logoColor=white)](https://github.com/sponsors/aguitauwu)

<br>

</div>

## What is OpenLLaVA?

**OpenLLaVA** is a comprehensive open-source framework for injecting vision capabilities into any language model. It provides a complete pipeline β€” from model construction through training, inference, serving, export, and evaluation β€” all accessible through a unified Python API and CLI.

The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, and more) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the full training and inference pipelines.

The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**

> OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU β€” with automatic hardware detection and optimal configuration selection.

<br>

## Quickstart

```bash
pip install openllava        # Core
pip install openllava[cli]   # With CLI tools
pip install openllava[serve] # With serving
pip install openllava[all]   # Full installation
```

### Inject Vision Into Any LLM

```python
from openllava import OpenLLaVA, Backend

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
    backend=Backend.AUTO,
)
```

OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.

### Train with LoRA

```python
model.lora(r=64, alpha=128, dropout=0.05)

model.train(
    phase1=dict(dataset="liuhaotian/LLaVA-Pretrain", samples=100_000),
    phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K", learning_rate=2e-4),
    resume=True,
)

model.push("my-org/my-vision-model")
```

### FastVisionModel API

```python
from openllava.api import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "Openllava/Yaki",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastVisionModel.get_peft_model(model, r=16, alpha=32)
```

### Serve as OpenAI-Compatible API

```bash
openllava serve Openllava/Yaki --port 8000
```

```python
from openai import OpenAI

client = OpenAI(api_key="openllava", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="yaki",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
        ],
    }],
)
```

<br>

## Key Features

<table>
<tr>
<td width="50%" valign="top">

**Model Construction**
- Vision injection into any HuggingFace LLM in 3 lines
- AnyRes dynamic high-resolution with patch grouping
- YakiProjector: configurable MLP alignment
- Auto-detects hidden dimensions, attention heads, vocabulary size
- Supports LoRA-patched models

**Training Pipeline**
- 3-phase training: alignment, instruction tuning, RL alignment
- LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
- BitNet ternary training (b1.58)
- MoE + LoRA fusion
- FP8 training on H100
- Padding-free and sequence packing
- Curriculum learning

**RL Alignment**
- DPO, GRPO, ORPO, PPO
- Composable reward functions
- Visual reasoning reward support

</td>
<td width="50%" valign="top">

**Inference and Serving**
- Continuous batching
- PagedAttention (4x memory efficiency)
- Speculative decoding (Eagle, Medusa, NGram)
- KV cache: quantization, eviction, compression
- OpenAI-compatible FastAPI server
- Streaming support

**Optimization Suite (40+)**
- torch.compile full-graph compilation
- GPTQ / AWQ / FP4 / NVFP4 quantization
- GaLore gradient projection
- torchao integration
- EMA training stability
- Selective activation checkpointing

**Distributed Training**
- FSDP2, DeepSpeed ZeRO (stages 0-3)
- Tensor, Pipeline, Expert parallelism
- Ring Attention for long context
- Heterogeneous GPU + CPU + TPU training
- Auto-parallelism detection

</td>
</tr>
</table>

<br>

## Multi-Backend Support

| Backend | Hardware | Status |
|:--------|:---------|:-------|
| CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) | Production |
| ROCm | AMD GPUs (MI250, MI300X, RX 7000) | Production |
| CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) | Production |
| TPU (XLA/SPMD) | Google TPU v3-v5 | Beta |
| MLX | Apple Silicon M1-M4 | Beta |
| XPU | Intel Arc, Data Center GPU | Beta |
| Heterogeneous | GPU + CPU + TPU mixed | Beta |

<br>

## Stack

| Layer | Technology | Purpose |
|:------|:----------:|:--------|
| CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
| Core | C++ | Memory management, tensor routing, async streams |
| Bindings | pybind11 | C++ to Python bridge |
| Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
| API | Python | Public interface, FastVisionModel, Trainer |
| Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
| Export | GGUF/ONNX/SafeTensors/vLLM/MLX | Deployment formats |

<br>

## Architecture

**Image + Text** feeds into a **Vision Encoder** (SigLIP2, CLIP, DINOv2, or any HuggingFace encoder), whose patch features are passed through the **YakiProjector** (Patch Grouping 3x3 + MLP 2-layer, mapping `vision_dim x 9` to `llm_dim`). The projected embeddings are merged with text embeddings and passed to the **Language Model** (any `AutoModelForCausalLM`, with QLoRA 4-bit NF4 and LoRA r=64), which generates text output including `<think>` reasoning blocks when applicable.

<br>

## Yadis Architecture

Yadis is OpenLLaVA's flagship multimodal architecture β€” the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.

```python
# Yadis Routing β€” multiple vision experts with MoE router
from openllava import OpenLLaVA, experts

model = OpenLLaVA(
    llm="OpceanAI/OwO-32B",
    architecture="yadis_routing",
    experts=[
        experts.Visual("google/siglip2-so400m-patch14-384"),
        experts.OCR("deepseek-ai/DeepSeek-OCR-2"),
    ],
)

# Yadis Full β€” discrete tokens + cross-attention per layer
model = OpenLLaVA(
    llm="OpceanAI/OwO-32B",
    architecture="yadis_full",
    vision_encoder="google/siglip2-so400m-patch14-384",
)
```

| Mode | Description |
|:-----|:------------|
| `llava` | LLaVA-style MLP projection (default) |
| `yadis_routing` | Multiple expert encoders with MoE router |
| `yadis_full` | Discrete visual tokens with cross-attention per layer |

<br>

## OpceanAI Vision Models

OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.

<table>
<tr>
<td width="33%" valign="top">

**Yaki v1**

Vision-language model built on Yuuki RxG 8B. Designed for complex visual reasoning with bilingual support (ES/EN). Preserves the `<think>` chain-of-thought behavior of the base model for multimodal tasks.

Base: DeepSeek-R1-Qwen3-8B fine-tune<br>
Encoder: SigLIP 2 SO400M<br>
LoRA: r=64, alpha=128

[![Status](https://img.shields.io/badge/Status-Training-orange?style=flat-square)](https://huggingface.co/Openllava/Yaki)

</td>
<td width="33%" valign="top">

**Yaki v2** *(planned)*

Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4).

</td>
<td width="33%" valign="top">

**Yaki v3** *(planned)*

Built on OwO 32B with full Yadis routing architecture, combining visual and OCR expert encoders.

</td>
</tr>
</table>

<br>

## Philosophy

<table>
<tr>
<td width="50%" valign="top">

**Architecture Agnostic by Design**

Every existing multimodal framework is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.

**Speed Over Ceremony**

When a new model is released, the window to publish a vision version is 48 to 72 hours. OpenLLaVA is designed for that constraint β€” minimal configuration, automated phase management, one-command training.

</td>
<td width="50%" valign="top">

**Low Level Where It Matters**

The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget research organization.

**Fully Open**

Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher β€” with any model, any hardware, any budget β€” can build a competitive vision-language model.

</td>
</tr>
</table>

<br>

## Roadmap

| Version | Features | Status |
|:--------|:---------|:-------|
| v1 - v3 | LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend | Released |
| v4 - v5 | CUDA kernels, GGUF vision export, CPU offloading, cross-attention | Active |
| v6 - v7 | Discrete visual tokens (VQ-VAE), multi-expert routing | Planned |
| v8 - v9 | Video support, hybrid architectures | Planned |
| v10 | Yadis complete, omnimodal preparation | Planned |

<br>

<div align="center">

## Built by OpceanAI

OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) β€” an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on consumer hardware and validated on standard benchmarks.

<br>

[![OpceanAI](https://img.shields.io/badge/OpceanAI-Research-0D1117?style=for-the-badge)](https://huggingface.co/OpceanAI)
&nbsp;
[![GitHub](https://img.shields.io/badge/GitHub-OpceanAI-0D1117?style=for-the-badge&logo=github)](https://github.com/OpceanAI/openllava)
&nbsp;
[![Sponsor](https://img.shields.io/badge/Sponsor-GitHub_Sponsors-ea4aaa?style=for-the-badge&logo=githubsponsors&logoColor=white)](https://github.com/sponsors/aguitauwu)

<br>

**Open framework. Open models. Zero budget. Measurable results.**

[![OpenLLaVA](https://img.shields.io/badge/OpenLLaVA-v3.0.0-0D1117?style=for-the-badge)](https://github.com/OpceanAI/openllava)

*Inject vision into any language model.*

</div>