OpceanAI commited on
Commit
56d2bb9
·
verified ·
1 Parent(s): 13dc5f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -87
README.md CHANGED
@@ -1,6 +1,5 @@
1
  ---
2
  title: README
3
- emoji: 👁️
4
  colorFrom: purple
5
  colorTo: indigo
6
  sdk: static
@@ -12,10 +11,6 @@ license: apache-2.0
12
 
13
  <br>
14
 
15
- <img src="https://img.shields.io/badge/%F0%9F%91%81%EF%B8%8F-OPENLLAVA-0D1117?style=for-the-badge&labelColor=0D1117" alt="OpenLLaVA" height="60">
16
-
17
- <br><br>
18
-
19
  <img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
20
  &nbsp;
21
  <img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
@@ -53,8 +48,6 @@ license: apache-2.0
53
 
54
  <br>
55
 
56
- ---
57
-
58
  </div>
59
 
60
  ## What is OpenLLaVA?
@@ -69,8 +62,6 @@ The central design goal: **when a new language model drops, you should have a vi
69
 
70
  <br>
71
 
72
- ---
73
-
74
  ## Quickstart
75
 
76
  ```bash
@@ -92,7 +83,7 @@ model = OpenLLaVA(
92
  )
93
  ```
94
 
95
- That's it. OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
96
 
97
  ### Train with LoRA
98
 
@@ -147,8 +138,6 @@ response = client.chat.completions.create(
147
 
148
  <br>
149
 
150
- ---
151
-
152
  ## Key Features
153
 
154
  <table>
@@ -159,16 +148,16 @@ response = client.chat.completions.create(
159
  - Vision injection into any HuggingFace LLM in 3 lines
160
  - AnyRes dynamic high-resolution with patch grouping
161
  - YakiProjector: configurable MLP alignment
162
- - Auto-detects hidden dims, attention heads, vocab size
163
  - Supports LoRA-patched models
164
 
165
  **Training Pipeline**
166
- - 3-phase training: alignment instruction RL
167
  - LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
168
  - BitNet ternary training (b1.58)
169
  - MoE + LoRA fusion
170
  - FP8 training on H100
171
- - Padding-free + sequence packing
172
  - Curriculum learning
173
 
174
  **RL Alignment**
@@ -179,7 +168,7 @@ response = client.chat.completions.create(
179
  </td>
180
  <td width="50%" valign="top">
181
 
182
- **Inference & Serving**
183
  - Continuous batching
184
  - PagedAttention (4x memory efficiency)
185
  - Speculative decoding (Eagle, Medusa, NGram)
@@ -187,19 +176,19 @@ response = client.chat.completions.create(
187
  - OpenAI-compatible FastAPI server
188
  - Streaming support
189
 
190
- **40+ Optimizations**
191
- - torch.compile full-graph
192
- - GPTQ / AWQ / FP4 / NVFP4
193
  - GaLore gradient projection
194
  - torchao integration
195
  - EMA training stability
196
  - Selective activation checkpointing
197
 
198
  **Distributed Training**
199
- - FSDP2, DeepSpeed ZeRO (0-3)
200
  - Tensor, Pipeline, Expert parallelism
201
- - Ring Attention (long context)
202
- - Heterogeneous GPU+CPU+TPU training
203
  - Auto-parallelism detection
204
 
205
  </td>
@@ -208,31 +197,27 @@ response = client.chat.completions.create(
208
 
209
  <br>
210
 
211
- ---
212
-
213
  ## Multi-Backend Support
214
 
215
  | Backend | Hardware | Status |
216
  |:--------|:---------|:-------|
217
- | CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) | Production |
218
- | ROCm | AMD GPUs (MI250, MI300X, RX 7000) | Production |
219
- | CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) | Production |
220
- | TPU (XLA/SPMD) | Google TPU v3-v5 | 🔶 Beta |
221
- | MLX | Apple Silicon M1-M4 | 🔶 Beta |
222
- | XPU | Intel Arc, Data Center GPU | 🔶 Beta |
223
- | Heterogeneous | GPU + CPU + TPU mixed | 🔶 Beta |
224
 
225
  <br>
226
 
227
- ---
228
-
229
  ## Stack
230
 
231
  | Layer | Technology | Purpose |
232
  |:------|:----------:|:--------|
233
  | CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
234
  | Core | C++ | Memory management, tensor routing, async streams |
235
- | Bindings | pybind11 | C++ Python bridge |
236
  | Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
237
  | API | Python | Public interface, FastVisionModel, Trainer |
238
  | Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
@@ -240,39 +225,12 @@ response = client.chat.completions.create(
240
 
241
  <br>
242
 
243
- ---
244
-
245
  ## Architecture
246
 
247
- ```
248
- ┌─────────────────────────────────────────────────────────────┐
249
- │ OpenLLaVA Framework │
250
- ├─────────────────────────────────────────────────────────────┤
251
- │ │
252
- │ Input: Image + Text │
253
- │ │ │
254
- │ ┌──────▼──────────────────────────────────────────────┐ │
255
- │ │ Vision Encoder (SigLIP2, CLIP, DINOv2, any HF) │ │
256
- │ └──────────────────────┬───────────────────────────────┘ │
257
- │ │ patch features │
258
- │ ┌──────────────────────▼───────────────────────────────┐ │
259
- │ │ YakiProjector — Patch Grouping 3×3 + MLP 2-layer │ │
260
- │ │ [vision_dim × 9] → [llm_dim] │ │
261
- │ └──────────────────────┬───────────────────────────────┘ │
262
- │ │ vision embeddings │
263
- │ ┌──────────────────────▼───────────────────────────────┐ │
264
- │ │ Language Model (any AutoModelForCausalLM) │ │
265
- │ │ QLoRA 4-bit NF4 · LoRA r=64 · Flash Attention │ │
266
- │ └──────────────────────┬───────────────────────────────┘ │
267
- │ │ │
268
- │ Output: Text + <think> reasoning blocks │
269
- └─────────────────────────────────────────────────────────────┘
270
- ```
271
 
272
  <br>
273
 
274
- ---
275
-
276
  ## Yadis Architecture
277
 
278
  Yadis is OpenLLaVA's flagship multimodal architecture — the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.
@@ -299,15 +257,13 @@ model = OpenLLaVA(
299
  ```
300
 
301
  | Mode | Description |
302
- |------|-------------|
303
  | `llava` | LLaVA-style MLP projection (default) |
304
- | `yadis_routing` | Multiple expert encoders + MoE router |
305
- | `yadis_full` | Discrete tokens + cross-attention per layer |
306
 
307
  <br>
308
 
309
- ---
310
-
311
  ## OpceanAI Vision Models
312
 
313
  OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.
@@ -318,9 +274,9 @@ OpceanAI uses OpenLLaVA to publish vision versions of new language models within
318
 
319
  **Yaki v1**
320
 
321
- Vision-language model on Yuuki RxG 8B. Complex visual reasoning, bilingual ES/EN, preserves `<think>` chain-of-thought for multimodal tasks.
322
 
323
- Base: DeepSeek-R1-Qwen3-8B finetune<br>
324
  Encoder: SigLIP 2 SO400M<br>
325
  LoRA: r=64, alpha=128
326
 
@@ -338,7 +294,7 @@ Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4).
338
 
339
  **Yaki v3** *(planned)*
340
 
341
- Built on OwO 32B with full Yadis routing architecture. OCR + visual experts.
342
 
343
  </td>
344
  </tr>
@@ -346,8 +302,6 @@ Built on OwO 32B with full Yadis routing architecture. OCR + visual experts.
346
 
347
  <br>
348
 
349
- ---
350
-
351
  ## Philosophy
352
 
353
  <table>
@@ -360,14 +314,14 @@ Every existing multimodal framework is hardcoded to specific model families. Ope
360
 
361
  **Speed Over Ceremony**
362
 
363
- When a new model drops, the window to publish a vision version is 4872 hours. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.
364
 
365
  </td>
366
  <td width="50%" valign="top">
367
 
368
  **Low Level Where It Matters**
369
 
370
- The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget lab.
371
 
372
  **Fully Open**
373
 
@@ -379,27 +333,23 @@ Apache 2.0. No gating. No commercial restrictions. The framework exists so that
379
 
380
  <br>
381
 
382
- ---
383
-
384
  ## Roadmap
385
 
386
  | Version | Features | Status |
387
- |---------|----------|--------|
388
- | **v1-v3** | LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend | Done |
389
- | **v4-v5** | CUDA kernels, GGUF vision export, CPU offloading, cross-attention | 🔄 Active |
390
- | **v6-v7** | Discrete visual tokens (VQ-VAE), multi-expert routing | 📋 Planned |
391
- | **v8-v9** | Video support, hybrid architectures | 📋 Planned |
392
- | **v10** | Yadis complete, omnimodal prep | 📋 Planned |
393
 
394
  <br>
395
 
396
- ---
397
-
398
  <div align="center">
399
 
400
  ## Built by OpceanAI
401
 
402
- OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) — an independent AI research organization built from zero budget, consumer hardware, and measurable results.
403
 
404
  <br>
405
 
@@ -411,8 +361,6 @@ OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.c
411
 
412
  <br>
413
 
414
- ---
415
-
416
  **Open framework. Open models. Zero budget. Measurable results.**
417
 
418
  [![OpenLLaVA](https://img.shields.io/badge/OpenLLaVA-v3.0.0-0D1117?style=for-the-badge)](https://github.com/OpceanAI/openllava)
 
1
  ---
2
  title: README
 
3
  colorFrom: purple
4
  colorTo: indigo
5
  sdk: static
 
11
 
12
  <br>
13
 
 
 
 
 
14
  <img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
15
  &nbsp;
16
  <img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
 
48
 
49
  <br>
50
 
 
 
51
  </div>
52
 
53
  ## What is OpenLLaVA?
 
62
 
63
  <br>
64
 
 
 
65
  ## Quickstart
66
 
67
  ```bash
 
83
  )
84
  ```
85
 
86
+ OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
87
 
88
  ### Train with LoRA
89
 
 
138
 
139
  <br>
140
 
 
 
141
  ## Key Features
142
 
143
  <table>
 
148
  - Vision injection into any HuggingFace LLM in 3 lines
149
  - AnyRes dynamic high-resolution with patch grouping
150
  - YakiProjector: configurable MLP alignment
151
+ - Auto-detects hidden dimensions, attention heads, vocabulary size
152
  - Supports LoRA-patched models
153
 
154
  **Training Pipeline**
155
+ - 3-phase training: alignment, instruction tuning, RL alignment
156
  - LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
157
  - BitNet ternary training (b1.58)
158
  - MoE + LoRA fusion
159
  - FP8 training on H100
160
+ - Padding-free and sequence packing
161
  - Curriculum learning
162
 
163
  **RL Alignment**
 
168
  </td>
169
  <td width="50%" valign="top">
170
 
171
+ **Inference and Serving**
172
  - Continuous batching
173
  - PagedAttention (4x memory efficiency)
174
  - Speculative decoding (Eagle, Medusa, NGram)
 
176
  - OpenAI-compatible FastAPI server
177
  - Streaming support
178
 
179
+ **Optimization Suite (40+)**
180
+ - torch.compile full-graph compilation
181
+ - GPTQ / AWQ / FP4 / NVFP4 quantization
182
  - GaLore gradient projection
183
  - torchao integration
184
  - EMA training stability
185
  - Selective activation checkpointing
186
 
187
  **Distributed Training**
188
+ - FSDP2, DeepSpeed ZeRO (stages 0-3)
189
  - Tensor, Pipeline, Expert parallelism
190
+ - Ring Attention for long context
191
+ - Heterogeneous GPU + CPU + TPU training
192
  - Auto-parallelism detection
193
 
194
  </td>
 
197
 
198
  <br>
199
 
 
 
200
  ## Multi-Backend Support
201
 
202
  | Backend | Hardware | Status |
203
  |:--------|:---------|:-------|
204
+ | CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) | Production |
205
+ | ROCm | AMD GPUs (MI250, MI300X, RX 7000) | Production |
206
+ | CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) | Production |
207
+ | TPU (XLA/SPMD) | Google TPU v3-v5 | Beta |
208
+ | MLX | Apple Silicon M1-M4 | Beta |
209
+ | XPU | Intel Arc, Data Center GPU | Beta |
210
+ | Heterogeneous | GPU + CPU + TPU mixed | Beta |
211
 
212
  <br>
213
 
 
 
214
  ## Stack
215
 
216
  | Layer | Technology | Purpose |
217
  |:------|:----------:|:--------|
218
  | CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
219
  | Core | C++ | Memory management, tensor routing, async streams |
220
+ | Bindings | pybind11 | C++ to Python bridge |
221
  | Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
222
  | API | Python | Public interface, FastVisionModel, Trainer |
223
  | Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
 
225
 
226
  <br>
227
 
 
 
228
  ## Architecture
229
 
230
+ **Image + Text** feeds into a **Vision Encoder** (SigLIP2, CLIP, DINOv2, or any HuggingFace encoder), whose patch features are passed through the **YakiProjector** (Patch Grouping 3x3 + MLP 2-layer, mapping `vision_dim x 9` to `llm_dim`). The projected embeddings are merged with text embeddings and passed to the **Language Model** (any `AutoModelForCausalLM`, with QLoRA 4-bit NF4 and LoRA r=64), which generates text output including `<think>` reasoning blocks when applicable.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
 
232
  <br>
233
 
 
 
234
  ## Yadis Architecture
235
 
236
  Yadis is OpenLLaVA's flagship multimodal architecture — the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.
 
257
  ```
258
 
259
  | Mode | Description |
260
+ |:-----|:------------|
261
  | `llava` | LLaVA-style MLP projection (default) |
262
+ | `yadis_routing` | Multiple expert encoders with MoE router |
263
+ | `yadis_full` | Discrete visual tokens with cross-attention per layer |
264
 
265
  <br>
266
 
 
 
267
  ## OpceanAI Vision Models
268
 
269
  OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.
 
274
 
275
  **Yaki v1**
276
 
277
+ Vision-language model built on Yuuki RxG 8B. Designed for complex visual reasoning with bilingual support (ES/EN). Preserves the `<think>` chain-of-thought behavior of the base model for multimodal tasks.
278
 
279
+ Base: DeepSeek-R1-Qwen3-8B fine-tune<br>
280
  Encoder: SigLIP 2 SO400M<br>
281
  LoRA: r=64, alpha=128
282
 
 
294
 
295
  **Yaki v3** *(planned)*
296
 
297
+ Built on OwO 32B with full Yadis routing architecture, combining visual and OCR expert encoders.
298
 
299
  </td>
300
  </tr>
 
302
 
303
  <br>
304
 
 
 
305
  ## Philosophy
306
 
307
  <table>
 
314
 
315
  **Speed Over Ceremony**
316
 
317
+ When a new model is released, the window to publish a vision version is 48 to 72 hours. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.
318
 
319
  </td>
320
  <td width="50%" valign="top">
321
 
322
  **Low Level Where It Matters**
323
 
324
+ The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget research organization.
325
 
326
  **Fully Open**
327
 
 
333
 
334
  <br>
335
 
 
 
336
  ## Roadmap
337
 
338
  | Version | Features | Status |
339
+ |:--------|:---------|:-------|
340
+ | v1 - v3 | LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend | Released |
341
+ | v4 - v5 | CUDA kernels, GGUF vision export, CPU offloading, cross-attention | Active |
342
+ | v6 - v7 | Discrete visual tokens (VQ-VAE), multi-expert routing | Planned |
343
+ | v8 - v9 | Video support, hybrid architectures | Planned |
344
+ | v10 | Yadis complete, omnimodal preparation | Planned |
345
 
346
  <br>
347
 
 
 
348
  <div align="center">
349
 
350
  ## Built by OpceanAI
351
 
352
+ OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) — an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on consumer hardware and validated on standard benchmarks.
353
 
354
  <br>
355
 
 
361
 
362
  <br>
363
 
 
 
364
  **Open framework. Open models. Zero budget. Measurable results.**
365
 
366
  [![OpenLLaVA](https://img.shields.io/badge/OpenLLaVA-v3.0.0-0D1117?style=for-the-badge)](https://github.com/OpceanAI/openllava)