OpceanAI commited on
Commit
ef301a2
·
verified ·
1 Parent(s): 47ea145

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +380 -1
README.md CHANGED
@@ -5,6 +5,385 @@ colorFrom: red
5
  colorTo: pink
6
  sdk: static
7
  pinned: false
 
8
  ---
 
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  colorTo: pink
6
  sdk: static
7
  pinned: false
8
+ license: apache-2.0
9
  ---
10
+ <div align="center">
11
 
12
+ <br>
13
+
14
+ <img src="https://img.shields.io/badge/%F0%9F%91%81%EF%B8%8F-OPENLLAVA-0D1117?style=for-the-badge&labelColor=0D1117" alt="OpenLLaVA" height="60">
15
+
16
+ <br><br>
17
+
18
+ # Inject Vision Into Any Language Model.
19
+
20
+ **Open-source framework for adding multimodal vision capabilities to any HuggingFace LLM.**<br>
21
+ **Low-level. Fast. Free. Built by [OpceanAI](https://huggingface.co/OpceanAI).**
22
+
23
+ <br>
24
+
25
+ [![PyPI](https://img.shields.io/badge/PyPI-openllava-3775a9?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/openllava)
26
+ &nbsp;
27
+ [![HuggingFace](https://img.shields.io/badge/Models-Hugging_Face-ffd21e?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/OpenLLaVA)
28
+ &nbsp;
29
+ [![License](https://img.shields.io/badge/License-Apache_2.0-green?style=for-the-badge)](https://opensource.org/licenses/Apache-2.0)
30
+ &nbsp;
31
+ [![Discord](https://img.shields.io/badge/Discord-Community-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/openllava)
32
+
33
+ <br>
34
+
35
+ ---
36
+
37
+ <br>
38
+
39
+ </div>
40
+
41
+ ## What is OpenLLaVA?
42
+
43
+ **OpenLLaVA** is an open-source framework that injects vision capabilities into any language model — no architecture restrictions, no hardcoded backends, no compromises. Built on the LLaVA-style projection architecture and extended with custom CUDA kernels, a C++ core, and a clean Python API.
44
+
45
+ The framework is developed and maintained by **OpceanAI** as infrastructure for their vision model pipeline. Every model OpceanAI releases through OpenLLaVA feeds improvements back into the framework.
46
+
47
+ The central design goal: **when a new language model drops, you should have a vision version in 48 hours.**
48
+
49
+ <br>
50
+
51
+ ---
52
+
53
+ <br>
54
+
55
+ <div align="center">
56
+
57
+ ## Quickstart
58
+
59
+ </div>
60
+
61
+ <br>
62
+
63
+ ```bash
64
+ pip install openllava
65
+ ```
66
+
67
+ ```python
68
+ from openllava import patch_model
69
+ from transformers import AutoModelForCausalLM, AutoTokenizer
70
+
71
+ # Any HuggingFace model. Any vision encoder.
72
+ model = AutoModelForCausalLM.from_pretrained("your-org/your-llm")
73
+ tokenizer = AutoTokenizer.from_pretrained("your-org/your-llm")
74
+
75
+ model = patch_model(
76
+ model,
77
+ vision_encoder="google/siglip2-so400m-patch14-384",
78
+ projector_layers=3,
79
+ )
80
+ ```
81
+
82
+ That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
83
+
84
+ <br>
85
+
86
+ ---
87
+
88
+ <br>
89
+
90
+ <div align="center">
91
+
92
+ ## Architecture
93
+
94
+ </div>
95
+
96
+ <br>
97
+
98
+ <table>
99
+ <tr>
100
+ <td width="50%" valign="top">
101
+
102
+ **Vision Encoder**
103
+
104
+ Any encoder from HuggingFace — SigLIP 2, CLIP, EVA-CLIP, InternViT. OpenLLaVA auto-reads the output dimension and handles tokenization regardless of encoder architecture.
105
+
106
+ <br>
107
+
108
+ **Projector Engine**
109
+
110
+ 3-layer MLP with GELU activation, implemented as a fused CUDA kernel. Faster than PyTorch naive by design. Hidden dimension auto-computed from encoder output → LLM input.
111
+
112
+ </td>
113
+ <td width="50%" valign="top">
114
+
115
+ **Model Patcher**
116
+
117
+ Patches any HuggingFace causal LM to accept vision tokens. Adds `<image>` special token, extends the embedding layer, and wires the projector output into the LLM input stream. Supports LoRA-patched models.
118
+
119
+ <br>
120
+
121
+ **Training Engine**
122
+
123
+ Two-phase training built in. Phase 1: projector warmup with frozen LLM. Phase 2: joint fine-tuning with LoRA. Gradient checkpointing, Flash Attention 2, and bfloat16 enabled by default.
124
+
125
+ </td>
126
+ </tr>
127
+ </table>
128
+
129
+ <br>
130
+
131
+ ---
132
+
133
+ <br>
134
+
135
+ <div align="center">
136
+
137
+ ## Stack
138
+
139
+ </div>
140
+
141
+ <br>
142
+
143
+ | Layer | Technology | Purpose |
144
+ |:------|:----------:|:--------|
145
+ | CUDA Kernels | C/CUDA | Fused projector ops, vision token attention |
146
+ | Core | C++ | Memory management, tensor routing |
147
+ | Bindings | pybind11 | C++ → Python bridge |
148
+ | API | Python | Public interface |
149
+ | Export | HuggingFace | Standard model format + GGUF |
150
+
151
+ <br>
152
+
153
+ ---
154
+
155
+ <br>
156
+
157
+ <div align="center">
158
+
159
+ ## Training Pipeline
160
+
161
+ </div>
162
+
163
+ <br>
164
+
165
+ ```python
166
+ from openllava import OpenLLaVATrainer
167
+
168
+ trainer = OpenLLaVATrainer(
169
+ model=model,
170
+ vision_encoder="google/siglip2-so400m-patch14-384",
171
+ pretrain_dataset="liuhaotian/LLaVA-Pretrain", # Phase 1
172
+ instruct_dataset="liuhaotian/LLaVA-Instruct-150K", # Phase 2
173
+ lora_r=64,
174
+ lora_alpha=128,
175
+ lora_target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
176
+ )
177
+
178
+ trainer.train() # Handles both phases automatically
179
+ ```
180
+
181
+ OpenLLaVA manages phase transitions, learning rate schedules, and checkpoint saving. You run one command.
182
+
183
+ <br>
184
+
185
+ ---
186
+
187
+ <br>
188
+
189
+ <div align="center">
190
+
191
+ ## OpceanAI Vision Models
192
+
193
+ </div>
194
+
195
+ <br>
196
+
197
+ OpceanAI uses OpenLLaVA to publish vision versions of new language models as they release. These are the models built with the framework:
198
+
199
+ <br>
200
+
201
+ <table>
202
+ <tr>
203
+ <td width="50%" valign="top">
204
+
205
+ **Yaki YuuKi+ Vision** *(in development)*
206
+
207
+ Vision-language model built on Yuuki RxG 8B (DeepSeek-R1-Qwen2.5-8B fine-tune). Complex visual reasoning, bilingual (ES/EN), preserves the Yuuki `<think>` chain-of-thought behavior for multimodal tasks.
208
+
209
+ Vision encoder: SigLIP 2 SO400M · LoRA r=64
210
+
211
+ [![Status](https://img.shields.io/badge/Status-In_Development-orange?style=flat-square)](https://huggingface.co/OpceanAI)
212
+
213
+ </td>
214
+ <td width="50%" valign="top">
215
+
216
+ **Yuuki NxG VL**
217
+
218
+ 7B vision-language model fine-tuned from Qwen2.5-VL-7B-Instruct. Extends the NxG model family to multimodal tasks. The first OpceanAI vision model and the validation case for the OpenLLaVA pipeline.
219
+
220
+ [![Model](https://img.shields.io/badge/Yuuki_NxG_VL-HuggingFace-ffd21e?style=flat-square&logo=huggingface&logoColor=black)](https://huggingface.co/OpceanAI/Yuuki-NxG-vl)
221
+
222
+ </td>
223
+ </tr>
224
+ </table>
225
+
226
+ <br>
227
+
228
+ ---
229
+
230
+ <br>
231
+
232
+ <div align="center">
233
+
234
+ ## Philosophy
235
+
236
+ </div>
237
+
238
+ <br>
239
+
240
+ <table>
241
+ <tr>
242
+ <td width="50%" valign="top">
243
+
244
+ **Model-Agnostic by Design**
245
+
246
+ Every major framework for multimodal training — LLaVA, LLaVA-Next, InstructBLIP — is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer.
247
+
248
+ </td>
249
+ <td width="50%" valign="top">
250
+
251
+ **Speed Over Ceremony**
252
+
253
+ When a new language model drops, the window to publish a vision version is 48–72 hours before the ecosystem moves on. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.
254
+
255
+ </td>
256
+ </tr>
257
+ </table>
258
+
259
+ <table>
260
+ <tr>
261
+ <td width="50%" valign="top">
262
+
263
+ **Low Level Where It Matters**
264
+
265
+ The projector is the critical path. Everything else can be Python. The CUDA kernel for the fused MLP op and the C++ memory manager exist because training throughput on a single A100 is the binding constraint for a zero-budget lab.
266
+
267
+ </td>
268
+ <td width="50%" valign="top">
269
+
270
+ **Fully Open**
271
+
272
+ Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher — with any model, any hardware, any budget — can build a competitive vision-language model.
273
+
274
+ </td>
275
+ </tr>
276
+ </table>
277
+
278
+ <br>
279
+
280
+ ---
281
+
282
+ <br>
283
+
284
+ <div align="center">
285
+
286
+ ## Roadmap
287
+
288
+ </div>
289
+
290
+ <br>
291
+
292
+ <table>
293
+ <tr>
294
+ <td width="50%" valign="top">
295
+
296
+ **Framework**
297
+
298
+ | Feature | Status |
299
+ |:--------|:------:|
300
+ | Python API + model patcher | In development |
301
+ | MLP projector (PyTorch) | In development |
302
+ | Two-phase training engine | In development |
303
+ | Fused CUDA projector kernel | Planned |
304
+ | C++ memory core | Planned |
305
+ | GGUF vision export | Planned |
306
+ | Multi-encoder support (BRAVE-style) | Planned |
307
+
308
+ </td>
309
+ <td width="50%" valign="top">
310
+
311
+ **Vision Models**
312
+
313
+ | Model | Status |
314
+ |:------|:------:|
315
+ | Yuuki NxG VL | Released |
316
+ | Yaki YuuKi+ Vision (8B) | In development |
317
+ | Community model pipeline | Planned |
318
+
319
+ </td>
320
+ </tr>
321
+ </table>
322
+
323
+ <br>
324
+
325
+ ---
326
+
327
+ <br>
328
+
329
+ <div align="center">
330
+
331
+ ## Contributing
332
+
333
+ </div>
334
+
335
+ <br>
336
+
337
+ OpenLLaVA is built to be extended. If you patch a model family that isn't supported yet, the contribution belongs in the framework. If you find a faster kernel implementation, open a PR.
338
+
339
+ The project is maintained by OpceanAI but owned by the community.
340
+
341
+ ```bash
342
+ git clone https://github.com/OpceanAI/openllava
343
+ cd openllava
344
+ pip install -e ".[dev]"
345
+ ```
346
+
347
+ <br>
348
+
349
+ ---
350
+
351
+ <br>
352
+
353
+ <div align="center">
354
+
355
+ ## Built by OpceanAI
356
+
357
+ </div>
358
+
359
+ <br>
360
+
361
+ <div align="center">
362
+
363
+ OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) — an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on Google Colab Pro and validated on consumer hardware.
364
+
365
+ <br>
366
+
367
+ [![OpceanAI](https://img.shields.io/badge/OpceanAI-Research-0D1117?style=for-the-badge)](https://huggingface.co/OpceanAI)
368
+ &nbsp;
369
+ [![HuggingFace](https://img.shields.io/badge/Models-Hugging_Face-ffd21e?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/OpceanAI)
370
+ &nbsp;
371
+ [![Sponsor](https://img.shields.io/badge/Sponsor-GitHub_Sponsors-ea4aaa?style=for-the-badge&logo=githubsponsors&logoColor=white)](https://github.com/sponsors/aguitauwu)
372
+
373
+ <br>
374
+
375
+ ---
376
+
377
+ <br>
378
+
379
+ **Open framework. Open models. Zero budget. Measurable results.**
380
+
381
+ <br>
382
+
383
+ [![OpenLLaVA](https://img.shields.io/badge/OpenLLaVA-2026-0D1117?style=for-the-badge)](https://github.com/OpceanAI/openllava)
384
+
385
+ <br>
386
+
387
+ *The fastest path from any language model to a vision-language model.*
388
+
389
+ </div>