OpceanAI commited on
Commit
13dc5f5
Β·
verified Β·
1 Parent(s): 2505b94

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +240 -196
README.md CHANGED
@@ -1,12 +1,13 @@
1
  ---
2
  title: README
3
- emoji: πŸ‘€
4
- colorFrom: red
5
- colorTo: pink
6
  sdk: static
7
  pinned: false
8
  license: apache-2.0
9
  ---
 
10
  <div align="center">
11
 
12
  <br>
@@ -15,101 +16,191 @@ license: apache-2.0
15
 
16
  <br><br>
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  # Inject Vision Into Any Language Model.
19
 
20
  **Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
21
- **Low-level. Fast. Free. Built by [OpceanAI](https://huggingface.co/OpceanAI).**
22
 
 
23
 
24
- ---
 
 
 
 
25
 
26
  <br>
27
 
 
 
28
  </div>
29
 
30
  ## What is OpenLLaVA?
31
 
32
- **OpenLLaVA** is an open-source framework that injects vision capabilities into any language model β€” no architecture restrictions, no hardcoded backends, no compromises. Built on the LLaVA-style projection architecture and extended with custom CUDA kernels, a C++ core, and a clean Python API.
33
 
34
- The framework is developed and maintained by **OpceanAI** as infrastructure for their vision model pipeline. Every model OpceanAI releases through OpenLLaVA feeds improvements back into the framework.
35
 
36
  The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**
37
 
38
- <br>
39
-
40
- ---
41
 
42
  <br>
43
 
44
- <div align="center">
45
 
46
  ## Quickstart
47
 
48
- </div>
49
-
50
- <br>
51
-
52
  ```bash
53
- pip install openllava
 
 
 
54
  ```
55
 
56
- ```python
57
- from openllava import patch_model
58
- from transformers import AutoModelForCausalLM, AutoTokenizer
59
 
60
- # Any HuggingFace model. Any vision encoder.
61
- model = AutoModelForCausalLM.from_pretrained("your-org/your-llm")
62
- tokenizer = AutoTokenizer.from_pretrained("your-org/your-llm")
63
 
64
- model = patch_model(
65
- model,
66
  vision_encoder="google/siglip2-so400m-patch14-384",
67
- projector_layers=3,
68
  )
69
  ```
70
 
71
  That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
72
 
73
- <br>
74
 
75
- ---
 
76
 
77
- <br>
 
 
 
 
78
 
79
- <div align="center">
 
80
 
81
- ## Architecture
82
 
83
- </div>
 
84
 
85
- <br>
 
 
 
 
86
 
87
- <table>
88
- <tr>
89
- <td width="50%" valign="top">
90
 
91
- **Vision Encoder**
92
 
93
- Any encoder from HuggingFace β€” SigLIP 2, CLIP, EVA-CLIP, InternViT. OpenLLaVA auto-reads the output dimension and handles tokenization regardless of encoder architecture.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  <br>
96
 
97
- **Projector Engine**
98
 
99
- 3-layer MLP with GELU activation, implemented as a fused CUDA kernel. Faster than PyTorch naive by design. Hidden dimension auto-computed from encoder output β†’ LLM input.
100
 
101
- </td>
 
102
  <td width="50%" valign="top">
103
 
104
- **Model Patcher**
105
-
106
- Patches any HuggingFace causal LM to accept vision tokens. Adds `<image>` special token, extends the embedding layer, and wires the projector output into the LLM input stream. Supports LoRA-patched models.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
- <br>
109
-
110
- **Training Engine**
111
 
112
- Two-phase training built in. Phase 1: projector warmup with frozen LLM. Phase 2: joint fine-tuning with LoRA. Gradient checkpointing, Flash Attention 2, and bfloat16 enabled by default.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  </td>
115
  </tr>
@@ -119,94 +210,135 @@ Two-phase training built in. Phase 1: projector warmup with frozen LLM. Phase 2:
119
 
120
  ---
121
 
122
- <br>
123
 
124
- <div align="center">
 
 
 
 
 
 
 
 
125
 
126
- ## Stack
127
 
128
- </div>
129
 
130
- <br>
131
 
132
  | Layer | Technology | Purpose |
133
  |:------|:----------:|:--------|
134
- | CUDA Kernels | C/CUDA | Fused projector ops, vision token attention |
135
- | Core | C++ | Memory management, tensor routing |
136
  | Bindings | pybind11 | C++ β†’ Python bridge |
137
- | API | Python | Public interface |
138
- | Export | HuggingFace | Standard model format + GGUF |
 
 
139
 
140
  <br>
141
 
142
  ---
143
 
144
- <br>
145
 
146
- <div align="center">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
 
148
- ## Training Pipeline
149
 
150
- </div>
151
 
152
- <br>
 
 
153
 
154
  ```python
155
- from openllava import OpenLLaVATrainer
 
 
 
 
 
 
 
 
 
 
156
 
157
- trainer = OpenLLaVATrainer(
158
- model=model,
 
 
159
  vision_encoder="google/siglip2-so400m-patch14-384",
160
- pretrain_dataset="liuhaotian/LLaVA-Pretrain", # Phase 1
161
- instruct_dataset="liuhaotian/LLaVA-Instruct-150K", # Phase 2
162
- lora_r=64,
163
- lora_alpha=128,
164
- lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
165
  )
166
-
167
- trainer.train() # Handles both phases automatically
168
  ```
169
 
170
- OpenLLaVA manages phase transitions, learning rate schedules, and checkpoint saving. You run one command.
 
 
 
 
171
 
172
  <br>
173
 
174
  ---
175
 
176
- <br>
177
-
178
- <div align="center">
179
-
180
  ## OpceanAI Vision Models
181
 
182
- </div>
183
-
184
- <br>
185
-
186
- OpceanAI uses OpenLLaVA to publish vision versions of new language models as they release. These are the models built with the framework:
187
-
188
- <br>
189
 
190
  <table>
191
  <tr>
192
- <td width="50%" valign="top">
193
 
194
- **Yaki YuuKi+ Vision** *(in development)*
195
 
196
- Vision-language model built on Yuuki RxG 8B (DeepSeek-R1-Qwen2.5-8B fine-tune). Complex visual reasoning, bilingual (ES/EN), preserves the Yuuki `<think>` chain-of-thought behavior for multimodal tasks.
197
 
198
- Vision encoder: SigLIP 2 SO400M Β· LoRA r=64
 
 
199
 
200
- [![Status](https://img.shields.io/badge/Status-In_Development-orange?style=flat-square)](https://huggingface.co/OpceanAI)
201
 
202
  </td>
203
- <td width="50%" valign="top">
 
 
 
 
204
 
205
- **Yuuki NxG VL**
 
206
 
207
- 7B vision-language model fine-tuned from Qwen2.5-VL-7B-Instruct. Extends the NxG model family to multimodal tasks. The first OpceanAI vision model and the validation case for the OpenLLaVA pipeline.
208
 
209
- [![Model](https://img.shields.io/badge/Yuuki_NxG_VL-HuggingFace-ffd21e?style=flat-square&logo=huggingface&logoColor=black)](https://huggingface.co/OpceanAI/Yuuki-NxG-vl)
210
 
211
  </td>
212
  </tr>
@@ -216,45 +348,26 @@ Vision encoder: SigLIP 2 SO400M Β· LoRA r=64
216
 
217
  ---
218
 
219
- <br>
220
-
221
- <div align="center">
222
-
223
  ## Philosophy
224
 
225
- </div>
226
-
227
- <br>
228
-
229
  <table>
230
  <tr>
231
  <td width="50%" valign="top">
232
 
233
- **Model-Agnostic by Design**
234
 
235
- Every major framework for multimodal training β€” LLaVA, LLaVA-Next, InstructBLIP β€” is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.
236
-
237
- </td>
238
- <td width="50%" valign="top">
239
 
240
  **Speed Over Ceremony**
241
 
242
- When a new language model drops, the window to publish a vision version is 48–72 hours before the ecosystem moves on. OpenLLaVA is designed for that constraint β€” minimal configuration, automated phase management, one-command training.
243
 
244
  </td>
245
- </tr>
246
- </table>
247
-
248
- <table>
249
- <tr>
250
  <td width="50%" valign="top">
251
 
252
  **Low Level Where It Matters**
253
 
254
- The projector is the critical path. Everything else can be Python. The CUDA kernel for the fused MLP op and the C++ memory manager exist because training throughput on a single A100 is the binding constraint for a zero-budget lab.
255
-
256
- </td>
257
- <td width="50%" valign="top">
258
 
259
  **Fully Open**
260
 
@@ -268,94 +381,31 @@ Apache 2.0. No gating. No commercial restrictions. The framework exists so that
268
 
269
  ---
270
 
271
- <br>
272
-
273
- <div align="center">
274
-
275
  ## Roadmap
276
 
277
- </div>
278
-
279
- <br>
280
-
281
- <table>
282
- <tr>
283
- <td width="50%" valign="top">
284
-
285
- **Framework**
286
-
287
- | Feature | Status |
288
- |:--------|:------:|
289
- | Python API + model patcher | In development |
290
- | MLP projector (PyTorch) | In development |
291
- | Two-phase training engine | In development |
292
- | Fused CUDA projector kernel | Planned |
293
- | C++ memory core | Planned |
294
- | GGUF vision export | Planned |
295
- | Multi-encoder support (BRAVE-style) | Planned |
296
-
297
- </td>
298
- <td width="50%" valign="top">
299
-
300
- **Vision Models**
301
-
302
- | Model | Status |
303
- |:------|:------:|
304
- | Yuuki NxG VL | Released |
305
- | Yaki YuuKi+ Vision (8B) | In development |
306
- | Community model pipeline | Planned |
307
-
308
- </td>
309
- </tr>
310
- </table>
311
 
312
  <br>
313
 
314
  ---
315
 
316
- <br>
317
-
318
- <div align="center">
319
-
320
- ## Contributing
321
-
322
- </div>
323
-
324
- <br>
325
-
326
- OpenLLaVA is built to be extended. If you patch a model family that isn't supported yet, the contribution belongs in the framework. If you find a faster kernel implementation, open a PR.
327
-
328
- The project is maintained by OpceanAI but owned by the community.
329
-
330
- ```bash
331
- git clone https://github.com/OpceanAI/openllava
332
- cd openllava
333
- pip install -e ".[dev]"
334
- ```
335
-
336
- <br>
337
-
338
- ---
339
-
340
- <br>
341
-
342
  <div align="center">
343
 
344
  ## Built by OpceanAI
345
 
346
- </div>
347
-
348
- <br>
349
-
350
- <div align="center">
351
-
352
- OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) β€” an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on Google Colab Pro and validated on consumer hardware.
353
 
354
  <br>
355
 
356
  [![OpceanAI](https://img.shields.io/badge/OpceanAI-Research-0D1117?style=for-the-badge)](https://huggingface.co/OpceanAI)
357
  &nbsp;
358
- [![HuggingFace](https://img.shields.io/badge/Models-Hugging_Face-ffd21e?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/OpceanAI)
359
  &nbsp;
360
  [![Sponsor](https://img.shields.io/badge/Sponsor-GitHub_Sponsors-ea4aaa?style=for-the-badge&logo=githubsponsors&logoColor=white)](https://github.com/sponsors/aguitauwu)
361
 
@@ -363,16 +413,10 @@ OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.c
363
 
364
  ---
365
 
366
- <br>
367
-
368
  **Open framework. Open models. Zero budget. Measurable results.**
369
 
370
- <br>
371
-
372
- [![OpenLLaVA](https://img.shields.io/badge/OpenLLaVA-2026-0D1117?style=for-the-badge)](https://github.com/OpceanAI/openllava)
373
 
374
- <br>
375
 
376
- *The fastest path from any language model to a vision-language model.*
377
-
378
- </div>
 
1
  ---
2
  title: README
3
+ emoji: πŸ‘οΈ
4
+ colorFrom: purple
5
+ colorTo: indigo
6
  sdk: static
7
  pinned: false
8
  license: apache-2.0
9
  ---
10
+
11
  <div align="center">
12
 
13
  <br>
 
16
 
17
  <br><br>
18
 
19
+ <img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
20
+ &nbsp;
21
+ <img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
22
+ &nbsp;
23
+ <img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&labelColor=0A0A0A&logo=python&logoColor=3776AB" alt="Python">
24
+ &nbsp;
25
+ <img src="https://img.shields.io/badge/PyTorch-2.3+-EE4C2C?style=for-the-badge&labelColor=0A0A0A&logo=pytorch&logoColor=EE4C2C" alt="PyTorch">
26
+
27
+ <br><br>
28
+
29
+ <img src="https://img.shields.io/badge/CUDA-8.0%2B-76B900?style=for-the-badge&labelColor=0A0A0A&logo=nvidia&logoColor=76B900" alt="CUDA">
30
+ &nbsp;
31
+ <img src="https://img.shields.io/badge/ROCm-AMD-ED2B23?style=for-the-badge&labelColor=0A0A0A" alt="ROCm">
32
+ &nbsp;
33
+ <img src="https://img.shields.io/badge/TPU-Google-4285F4?style=for-the-badge&labelColor=0A0A0A" alt="TPU">
34
+ &nbsp;
35
+ <img src="https://img.shields.io/badge/MLX-Apple-555555?style=for-the-badge&labelColor=0A0A0A&logo=apple&logoColor=white" alt="MLX">
36
+ &nbsp;
37
+ <img src="https://img.shields.io/badge/XPU-Intel-0071C5?style=for-the-badge&labelColor=0A0A0A&logo=intel&logoColor=0071C5" alt="XPU">
38
+
39
+ <br><br>
40
+
41
  # Inject Vision Into Any Language Model.
42
 
43
  **Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
44
+ **Architecture-agnostic. Multi-backend. Production-ready. Built by [OpceanAI](https://huggingface.co/OpceanAI).**
45
 
46
+ <br>
47
 
48
+ [![GitHub](https://img.shields.io/badge/GitHub-OpceanAI%2Fopenllava-0D1117?style=for-the-badge&logo=github)](https://github.com/OpceanAI/openllava)
49
+ &nbsp;
50
+ [![HuggingFace](https://img.shields.io/badge/Models-Hugging_Face-ffd21e?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/Openllava)
51
+ &nbsp;
52
+ [![Sponsor](https://img.shields.io/badge/Sponsor-GitHub_Sponsors-ea4aaa?style=for-the-badge&logo=githubsponsors&logoColor=white)](https://github.com/sponsors/aguitauwu)
53
 
54
  <br>
55
 
56
+ ---
57
+
58
  </div>
59
 
60
  ## What is OpenLLaVA?
61
 
62
+ **OpenLLaVA** is a comprehensive open-source framework for injecting vision capabilities into any language model. It provides a complete pipeline β€” from model construction through training, inference, serving, export, and evaluation β€” all accessible through a unified Python API and CLI.
63
 
64
+ The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, and more) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the full training and inference pipelines.
65
 
66
  The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**
67
 
68
+ > OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU β€” with automatic hardware detection and optimal configuration selection.
 
 
69
 
70
  <br>
71
 
72
+ ---
73
 
74
  ## Quickstart
75
 
 
 
 
 
76
  ```bash
77
+ pip install openllava # Core
78
+ pip install openllava[cli] # With CLI tools
79
+ pip install openllava[serve] # With serving
80
+ pip install openllava[all] # Full installation
81
  ```
82
 
83
+ ### Inject Vision Into Any LLM
 
 
84
 
85
+ ```python
86
+ from openllava import OpenLLaVA, Backend
 
87
 
88
+ model = OpenLLaVA(
89
+ llm="meta-llama/Llama-3-8B",
90
  vision_encoder="google/siglip2-so400m-patch14-384",
91
+ backend=Backend.AUTO,
92
  )
93
  ```
94
 
95
  That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
96
 
97
+ ### Train with LoRA
98
 
99
+ ```python
100
+ model.lora(r=64, alpha=128, dropout=0.05)
101
 
102
+ model.train(
103
+ phase1=dict(dataset="liuhaotian/LLaVA-Pretrain", samples=100_000),
104
+ phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K", learning_rate=2e-4),
105
+ resume=True,
106
+ )
107
 
108
+ model.push("my-org/my-vision-model")
109
+ ```
110
 
111
+ ### FastVisionModel API
112
 
113
+ ```python
114
+ from openllava.api import FastVisionModel
115
 
116
+ model, tokenizer = FastVisionModel.from_pretrained(
117
+ "Openllava/Yaki",
118
+ max_seq_length=2048,
119
+ load_in_4bit=True,
120
+ )
121
 
122
+ model = FastVisionModel.get_peft_model(model, r=16, alpha=32)
123
+ ```
 
124
 
125
+ ### Serve as OpenAI-Compatible API
126
 
127
+ ```bash
128
+ openllava serve Openllava/Yaki --port 8000
129
+ ```
130
+
131
+ ```python
132
+ from openai import OpenAI
133
+
134
+ client = OpenAI(api_key="openllava", base_url="http://localhost:8000/v1")
135
+
136
+ response = client.chat.completions.create(
137
+ model="yaki",
138
+ messages=[{
139
+ "role": "user",
140
+ "content": [
141
+ {"type": "text", "text": "What is in this image?"},
142
+ {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
143
+ ],
144
+ }],
145
+ )
146
+ ```
147
 
148
  <br>
149
 
150
+ ---
151
 
152
+ ## Key Features
153
 
154
+ <table>
155
+ <tr>
156
  <td width="50%" valign="top">
157
 
158
+ **Model Construction**
159
+ - Vision injection into any HuggingFace LLM in 3 lines
160
+ - AnyRes dynamic high-resolution with patch grouping
161
+ - YakiProjector: configurable MLP alignment
162
+ - Auto-detects hidden dims, attention heads, vocab size
163
+ - Supports LoRA-patched models
164
+
165
+ **Training Pipeline**
166
+ - 3-phase training: alignment β†’ instruction β†’ RL
167
+ - LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
168
+ - BitNet ternary training (b1.58)
169
+ - MoE + LoRA fusion
170
+ - FP8 training on H100
171
+ - Padding-free + sequence packing
172
+ - Curriculum learning
173
+
174
+ **RL Alignment**
175
+ - DPO, GRPO, ORPO, PPO
176
+ - Composable reward functions
177
+ - Visual reasoning reward support
178
 
179
+ </td>
180
+ <td width="50%" valign="top">
 
181
 
182
+ **Inference & Serving**
183
+ - Continuous batching
184
+ - PagedAttention (4x memory efficiency)
185
+ - Speculative decoding (Eagle, Medusa, NGram)
186
+ - KV cache: quantization, eviction, compression
187
+ - OpenAI-compatible FastAPI server
188
+ - Streaming support
189
+
190
+ **40+ Optimizations**
191
+ - torch.compile full-graph
192
+ - GPTQ / AWQ / FP4 / NVFP4
193
+ - GaLore gradient projection
194
+ - torchao integration
195
+ - EMA training stability
196
+ - Selective activation checkpointing
197
+
198
+ **Distributed Training**
199
+ - FSDP2, DeepSpeed ZeRO (0-3)
200
+ - Tensor, Pipeline, Expert parallelism
201
+ - Ring Attention (long context)
202
+ - Heterogeneous GPU+CPU+TPU training
203
+ - Auto-parallelism detection
204
 
205
  </td>
206
  </tr>
 
210
 
211
  ---
212
 
213
+ ## Multi-Backend Support
214
 
215
+ | Backend | Hardware | Status |
216
+ |:--------|:---------|:-------|
217
+ | CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) | βœ… Production |
218
+ | ROCm | AMD GPUs (MI250, MI300X, RX 7000) | βœ… Production |
219
+ | CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) | βœ… Production |
220
+ | TPU (XLA/SPMD) | Google TPU v3-v5 | πŸ”Ά Beta |
221
+ | MLX | Apple Silicon M1-M4 | πŸ”Ά Beta |
222
+ | XPU | Intel Arc, Data Center GPU | πŸ”Ά Beta |
223
+ | Heterogeneous | GPU + CPU + TPU mixed | πŸ”Ά Beta |
224
 
225
+ <br>
226
 
227
+ ---
228
 
229
+ ## Stack
230
 
231
  | Layer | Technology | Purpose |
232
  |:------|:----------:|:--------|
233
+ | CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
234
+ | Core | C++ | Memory management, tensor routing, async streams |
235
  | Bindings | pybind11 | C++ β†’ Python bridge |
236
+ | Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
237
+ | API | Python | Public interface, FastVisionModel, Trainer |
238
+ | Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
239
+ | Export | GGUF/ONNX/SafeTensors/vLLM/MLX | Deployment formats |
240
 
241
  <br>
242
 
243
  ---
244
 
245
+ ## Architecture
246
 
247
+ ```
248
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
249
+ β”‚ OpenLLaVA Framework β”‚
250
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
251
+ β”‚ β”‚
252
+ β”‚ Input: Image + Text β”‚
253
+ β”‚ β”‚ β”‚
254
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
255
+ β”‚ β”‚ Vision Encoder (SigLIP2, CLIP, DINOv2, any HF) β”‚ β”‚
256
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
257
+ β”‚ β”‚ patch features β”‚
258
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
259
+ β”‚ β”‚ YakiProjector β€” Patch Grouping 3Γ—3 + MLP 2-layer β”‚ β”‚
260
+ β”‚ β”‚ [vision_dim Γ— 9] β†’ [llm_dim] β”‚ β”‚
261
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
262
+ β”‚ β”‚ vision embeddings β”‚
263
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
264
+ β”‚ β”‚ Language Model (any AutoModelForCausalLM) β”‚ β”‚
265
+ β”‚ β”‚ QLoRA 4-bit NF4 Β· LoRA r=64 Β· Flash Attention β”‚ β”‚
266
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
267
+ β”‚ β”‚ β”‚
268
+ β”‚ Output: Text + <think> reasoning blocks β”‚
269
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
270
+ ```
271
 
272
+ <br>
273
 
274
+ ---
275
 
276
+ ## Yadis Architecture
277
+
278
+ Yadis is OpenLLaVA's flagship multimodal architecture β€” the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.
279
 
280
  ```python
281
+ # Yadis Routing β€” multiple vision experts with MoE router
282
+ from openllava import OpenLLaVA, experts
283
+
284
+ model = OpenLLaVA(
285
+ llm="OpceanAI/OwO-32B",
286
+ architecture="yadis_routing",
287
+ experts=[
288
+ experts.Visual("google/siglip2-so400m-patch14-384"),
289
+ experts.OCR("deepseek-ai/DeepSeek-OCR-2"),
290
+ ],
291
+ )
292
 
293
+ # Yadis Full β€” discrete tokens + cross-attention per layer
294
+ model = OpenLLaVA(
295
+ llm="OpceanAI/OwO-32B",
296
+ architecture="yadis_full",
297
  vision_encoder="google/siglip2-so400m-patch14-384",
 
 
 
 
 
298
  )
 
 
299
  ```
300
 
301
+ | Mode | Description |
302
+ |------|-------------|
303
+ | `llava` | LLaVA-style MLP projection (default) |
304
+ | `yadis_routing` | Multiple expert encoders + MoE router |
305
+ | `yadis_full` | Discrete tokens + cross-attention per layer |
306
 
307
  <br>
308
 
309
  ---
310
 
 
 
 
 
311
  ## OpceanAI Vision Models
312
 
313
+ OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.
 
 
 
 
 
 
314
 
315
  <table>
316
  <tr>
317
+ <td width="33%" valign="top">
318
 
319
+ **Yaki v1**
320
 
321
+ Vision-language model on Yuuki RxG 8B. Complex visual reasoning, bilingual ES/EN, preserves `<think>` chain-of-thought for multimodal tasks.
322
 
323
+ Base: DeepSeek-R1-Qwen3-8B finetune<br>
324
+ Encoder: SigLIP 2 SO400M<br>
325
+ LoRA: r=64, alpha=128
326
 
327
+ [![Status](https://img.shields.io/badge/Status-Training-orange?style=flat-square)](https://huggingface.co/Openllava/Yaki)
328
 
329
  </td>
330
+ <td width="33%" valign="top">
331
+
332
+ **Yaki v2** *(planned)*
333
+
334
+ Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4).
335
 
336
+ </td>
337
+ <td width="33%" valign="top">
338
 
339
+ **Yaki v3** *(planned)*
340
 
341
+ Built on OwO 32B with full Yadis routing architecture. OCR + visual experts.
342
 
343
  </td>
344
  </tr>
 
348
 
349
  ---
350
 
 
 
 
 
351
  ## Philosophy
352
 
 
 
 
 
353
  <table>
354
  <tr>
355
  <td width="50%" valign="top">
356
 
357
+ **Architecture Agnostic by Design**
358
 
359
+ Every existing multimodal framework is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.
 
 
 
360
 
361
  **Speed Over Ceremony**
362
 
363
+ When a new model drops, the window to publish a vision version is 48–72 hours. OpenLLaVA is designed for that constraint β€” minimal configuration, automated phase management, one-command training.
364
 
365
  </td>
 
 
 
 
 
366
  <td width="50%" valign="top">
367
 
368
  **Low Level Where It Matters**
369
 
370
+ The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget lab.
 
 
 
371
 
372
  **Fully Open**
373
 
 
381
 
382
  ---
383
 
 
 
 
 
384
  ## Roadmap
385
 
386
+ | Version | Features | Status |
387
+ |---------|----------|--------|
388
+ | **v1-v3** | LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend | βœ… Done |
389
+ | **v4-v5** | CUDA kernels, GGUF vision export, CPU offloading, cross-attention | πŸ”„ Active |
390
+ | **v6-v7** | Discrete visual tokens (VQ-VAE), multi-expert routing | πŸ“‹ Planned |
391
+ | **v8-v9** | Video support, hybrid architectures | πŸ“‹ Planned |
392
+ | **v10** | Yadis complete, omnimodal prep | πŸ“‹ Planned |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
393
 
394
  <br>
395
 
396
  ---
397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
398
  <div align="center">
399
 
400
  ## Built by OpceanAI
401
 
402
+ OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) β€” an independent AI research organization built from zero budget, consumer hardware, and measurable results.
 
 
 
 
 
 
403
 
404
  <br>
405
 
406
  [![OpceanAI](https://img.shields.io/badge/OpceanAI-Research-0D1117?style=for-the-badge)](https://huggingface.co/OpceanAI)
407
  &nbsp;
408
+ [![GitHub](https://img.shields.io/badge/GitHub-OpceanAI-0D1117?style=for-the-badge&logo=github)](https://github.com/OpceanAI/openllava)
409
  &nbsp;
410
  [![Sponsor](https://img.shields.io/badge/Sponsor-GitHub_Sponsors-ea4aaa?style=for-the-badge&logo=githubsponsors&logoColor=white)](https://github.com/sponsors/aguitauwu)
411
 
 
413
 
414
  ---
415
 
 
 
416
  **Open framework. Open models. Zero budget. Measurable results.**
417
 
418
+ [![OpenLLaVA](https://img.shields.io/badge/OpenLLaVA-v3.0.0-0D1117?style=for-the-badge)](https://github.com/OpceanAI/openllava)
 
 
419
 
420
+ *Inject vision into any language model.*
421
 
422
+ </div>