Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,5 @@
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
-
emoji: 👁️
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: static
|
|
@@ -12,10 +11,6 @@ license: apache-2.0
|
|
| 12 |
|
| 13 |
<br>
|
| 14 |
|
| 15 |
-
<img src="https://img.shields.io/badge/%F0%9F%91%81%EF%B8%8F-OPENLLAVA-0D1117?style=for-the-badge&labelColor=0D1117" alt="OpenLLaVA" height="60">
|
| 16 |
-
|
| 17 |
-
<br><br>
|
| 18 |
-
|
| 19 |
<img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
|
| 20 |
|
| 21 |
<img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
|
|
@@ -53,8 +48,6 @@ license: apache-2.0
|
|
| 53 |
|
| 54 |
<br>
|
| 55 |
|
| 56 |
-
---
|
| 57 |
-
|
| 58 |
</div>
|
| 59 |
|
| 60 |
## What is OpenLLaVA?
|
|
@@ -69,8 +62,6 @@ The central design goal: **when a new language model drops, you should have a vi
|
|
| 69 |
|
| 70 |
<br>
|
| 71 |
|
| 72 |
-
---
|
| 73 |
-
|
| 74 |
## Quickstart
|
| 75 |
|
| 76 |
```bash
|
|
@@ -92,7 +83,7 @@ model = OpenLLaVA(
|
|
| 92 |
)
|
| 93 |
```
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
### Train with LoRA
|
| 98 |
|
|
@@ -147,8 +138,6 @@ response = client.chat.completions.create(
|
|
| 147 |
|
| 148 |
<br>
|
| 149 |
|
| 150 |
-
---
|
| 151 |
-
|
| 152 |
## Key Features
|
| 153 |
|
| 154 |
<table>
|
|
@@ -159,16 +148,16 @@ response = client.chat.completions.create(
|
|
| 159 |
- Vision injection into any HuggingFace LLM in 3 lines
|
| 160 |
- AnyRes dynamic high-resolution with patch grouping
|
| 161 |
- YakiProjector: configurable MLP alignment
|
| 162 |
-
- Auto-detects hidden
|
| 163 |
- Supports LoRA-patched models
|
| 164 |
|
| 165 |
**Training Pipeline**
|
| 166 |
-
- 3-phase training: alignment
|
| 167 |
- LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
|
| 168 |
- BitNet ternary training (b1.58)
|
| 169 |
- MoE + LoRA fusion
|
| 170 |
- FP8 training on H100
|
| 171 |
-
- Padding-free
|
| 172 |
- Curriculum learning
|
| 173 |
|
| 174 |
**RL Alignment**
|
|
@@ -179,7 +168,7 @@ response = client.chat.completions.create(
|
|
| 179 |
</td>
|
| 180 |
<td width="50%" valign="top">
|
| 181 |
|
| 182 |
-
**Inference
|
| 183 |
- Continuous batching
|
| 184 |
- PagedAttention (4x memory efficiency)
|
| 185 |
- Speculative decoding (Eagle, Medusa, NGram)
|
|
@@ -187,19 +176,19 @@ response = client.chat.completions.create(
|
|
| 187 |
- OpenAI-compatible FastAPI server
|
| 188 |
- Streaming support
|
| 189 |
|
| 190 |
-
**40+
|
| 191 |
-
- torch.compile full-graph
|
| 192 |
-
- GPTQ / AWQ / FP4 / NVFP4
|
| 193 |
- GaLore gradient projection
|
| 194 |
- torchao integration
|
| 195 |
- EMA training stability
|
| 196 |
- Selective activation checkpointing
|
| 197 |
|
| 198 |
**Distributed Training**
|
| 199 |
-
- FSDP2, DeepSpeed ZeRO (0-3)
|
| 200 |
- Tensor, Pipeline, Expert parallelism
|
| 201 |
-
- Ring Attention
|
| 202 |
-
- Heterogeneous GPU+CPU+TPU training
|
| 203 |
- Auto-parallelism detection
|
| 204 |
|
| 205 |
</td>
|
|
@@ -208,31 +197,27 @@ response = client.chat.completions.create(
|
|
| 208 |
|
| 209 |
<br>
|
| 210 |
|
| 211 |
-
---
|
| 212 |
-
|
| 213 |
## Multi-Backend Support
|
| 214 |
|
| 215 |
| Backend | Hardware | Status |
|
| 216 |
|:--------|:---------|:-------|
|
| 217 |
-
| CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) |
|
| 218 |
-
| ROCm | AMD GPUs (MI250, MI300X, RX 7000) |
|
| 219 |
-
| CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) |
|
| 220 |
-
| TPU (XLA/SPMD) | Google TPU v3-v5 |
|
| 221 |
-
| MLX | Apple Silicon M1-M4 |
|
| 222 |
-
| XPU | Intel Arc, Data Center GPU |
|
| 223 |
-
| Heterogeneous | GPU + CPU + TPU mixed |
|
| 224 |
|
| 225 |
<br>
|
| 226 |
|
| 227 |
-
---
|
| 228 |
-
|
| 229 |
## Stack
|
| 230 |
|
| 231 |
| Layer | Technology | Purpose |
|
| 232 |
|:------|:----------:|:--------|
|
| 233 |
| CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
|
| 234 |
| Core | C++ | Memory management, tensor routing, async streams |
|
| 235 |
-
| Bindings | pybind11 | C++
|
| 236 |
| Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
|
| 237 |
| API | Python | Public interface, FastVisionModel, Trainer |
|
| 238 |
| Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
|
|
@@ -240,39 +225,12 @@ response = client.chat.completions.create(
|
|
| 240 |
|
| 241 |
<br>
|
| 242 |
|
| 243 |
-
---
|
| 244 |
-
|
| 245 |
## Architecture
|
| 246 |
|
| 247 |
-
```
|
| 248 |
-
┌─────────────────────────────────────────────────────────────┐
|
| 249 |
-
│ OpenLLaVA Framework │
|
| 250 |
-
├─────────────────────────────────────────────────────────────┤
|
| 251 |
-
│ │
|
| 252 |
-
│ Input: Image + Text │
|
| 253 |
-
│ │ │
|
| 254 |
-
│ ┌──────▼──────────────────────────────────────────────┐ │
|
| 255 |
-
│ │ Vision Encoder (SigLIP2, CLIP, DINOv2, any HF) │ │
|
| 256 |
-
│ └──────────────────────┬───────────────────────────────┘ │
|
| 257 |
-
│ │ patch features │
|
| 258 |
-
│ ┌──────────────────────▼───────────────────────────────┐ │
|
| 259 |
-
│ │ YakiProjector — Patch Grouping 3×3 + MLP 2-layer │ │
|
| 260 |
-
│ │ [vision_dim × 9] → [llm_dim] │ │
|
| 261 |
-
│ └──────────────────────┬───────────────────────────────┘ │
|
| 262 |
-
│ │ vision embeddings │
|
| 263 |
-
│ ┌──────────────────────▼───────────────────────────────┐ │
|
| 264 |
-
│ │ Language Model (any AutoModelForCausalLM) │ │
|
| 265 |
-
│ │ QLoRA 4-bit NF4 · LoRA r=64 · Flash Attention │ │
|
| 266 |
-
│ └──────────────────────┬───────────────────────────────┘ │
|
| 267 |
-
│ │ │
|
| 268 |
-
│ Output: Text + <think> reasoning blocks │
|
| 269 |
-
└─────────────────────────────────────────────────────────────┘
|
| 270 |
-
```
|
| 271 |
|
| 272 |
<br>
|
| 273 |
|
| 274 |
-
---
|
| 275 |
-
|
| 276 |
## Yadis Architecture
|
| 277 |
|
| 278 |
Yadis is OpenLLaVA's flagship multimodal architecture — the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.
|
|
@@ -299,15 +257,13 @@ model = OpenLLaVA(
|
|
| 299 |
```
|
| 300 |
|
| 301 |
| Mode | Description |
|
| 302 |
-
|-----
|
| 303 |
| `llava` | LLaVA-style MLP projection (default) |
|
| 304 |
-
| `yadis_routing` | Multiple expert encoders
|
| 305 |
-
| `yadis_full` | Discrete tokens
|
| 306 |
|
| 307 |
<br>
|
| 308 |
|
| 309 |
-
---
|
| 310 |
-
|
| 311 |
## OpceanAI Vision Models
|
| 312 |
|
| 313 |
OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.
|
|
@@ -318,9 +274,9 @@ OpceanAI uses OpenLLaVA to publish vision versions of new language models within
|
|
| 318 |
|
| 319 |
**Yaki v1**
|
| 320 |
|
| 321 |
-
Vision-language model on Yuuki RxG 8B.
|
| 322 |
|
| 323 |
-
Base: DeepSeek-R1-Qwen3-8B
|
| 324 |
Encoder: SigLIP 2 SO400M<br>
|
| 325 |
LoRA: r=64, alpha=128
|
| 326 |
|
|
@@ -338,7 +294,7 @@ Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4).
|
|
| 338 |
|
| 339 |
**Yaki v3** *(planned)*
|
| 340 |
|
| 341 |
-
Built on OwO 32B with full Yadis routing architecture
|
| 342 |
|
| 343 |
</td>
|
| 344 |
</tr>
|
|
@@ -346,8 +302,6 @@ Built on OwO 32B with full Yadis routing architecture. OCR + visual experts.
|
|
| 346 |
|
| 347 |
<br>
|
| 348 |
|
| 349 |
-
---
|
| 350 |
-
|
| 351 |
## Philosophy
|
| 352 |
|
| 353 |
<table>
|
|
@@ -360,14 +314,14 @@ Every existing multimodal framework is hardcoded to specific model families. Ope
|
|
| 360 |
|
| 361 |
**Speed Over Ceremony**
|
| 362 |
|
| 363 |
-
When a new model
|
| 364 |
|
| 365 |
</td>
|
| 366 |
<td width="50%" valign="top">
|
| 367 |
|
| 368 |
**Low Level Where It Matters**
|
| 369 |
|
| 370 |
-
The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget
|
| 371 |
|
| 372 |
**Fully Open**
|
| 373 |
|
|
@@ -379,27 +333,23 @@ Apache 2.0. No gating. No commercial restrictions. The framework exists so that
|
|
| 379 |
|
| 380 |
<br>
|
| 381 |
|
| 382 |
-
---
|
| 383 |
-
|
| 384 |
## Roadmap
|
| 385 |
|
| 386 |
| Version | Features | Status |
|
| 387 |
-
|--------
|
| 388 |
-
|
|
| 389 |
-
|
|
| 390 |
-
|
|
| 391 |
-
|
|
| 392 |
-
|
|
| 393 |
|
| 394 |
<br>
|
| 395 |
|
| 396 |
-
---
|
| 397 |
-
|
| 398 |
<div align="center">
|
| 399 |
|
| 400 |
## Built by OpceanAI
|
| 401 |
|
| 402 |
-
OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) — an independent AI research organization
|
| 403 |
|
| 404 |
<br>
|
| 405 |
|
|
@@ -411,8 +361,6 @@ OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.c
|
|
| 411 |
|
| 412 |
<br>
|
| 413 |
|
| 414 |
-
---
|
| 415 |
-
|
| 416 |
**Open framework. Open models. Zero budget. Measurable results.**
|
| 417 |
|
| 418 |
[](https://github.com/OpceanAI/openllava)
|
|
|
|
| 1 |
---
|
| 2 |
title: README
|
|
|
|
| 3 |
colorFrom: purple
|
| 4 |
colorTo: indigo
|
| 5 |
sdk: static
|
|
|
|
| 11 |
|
| 12 |
<br>
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
<img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="v3.0.0">
|
| 15 |
|
| 16 |
<img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
|
|
|
|
| 48 |
|
| 49 |
<br>
|
| 50 |
|
|
|
|
|
|
|
| 51 |
</div>
|
| 52 |
|
| 53 |
## What is OpenLLaVA?
|
|
|
|
| 62 |
|
| 63 |
<br>
|
| 64 |
|
|
|
|
|
|
|
| 65 |
## Quickstart
|
| 66 |
|
| 67 |
```bash
|
|
|
|
| 83 |
)
|
| 84 |
```
|
| 85 |
|
| 86 |
+
OpenLLaVA auto-detects hidden dimensions, builds the projector, and patches the tokenizer. No boilerplate. No config files.
|
| 87 |
|
| 88 |
### Train with LoRA
|
| 89 |
|
|
|
|
| 138 |
|
| 139 |
<br>
|
| 140 |
|
|
|
|
|
|
|
| 141 |
## Key Features
|
| 142 |
|
| 143 |
<table>
|
|
|
|
| 148 |
- Vision injection into any HuggingFace LLM in 3 lines
|
| 149 |
- AnyRes dynamic high-resolution with patch grouping
|
| 150 |
- YakiProjector: configurable MLP alignment
|
| 151 |
+
- Auto-detects hidden dimensions, attention heads, vocabulary size
|
| 152 |
- Supports LoRA-patched models
|
| 153 |
|
| 154 |
**Training Pipeline**
|
| 155 |
+
- 3-phase training: alignment, instruction tuning, RL alignment
|
| 156 |
- LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA
|
| 157 |
- BitNet ternary training (b1.58)
|
| 158 |
- MoE + LoRA fusion
|
| 159 |
- FP8 training on H100
|
| 160 |
+
- Padding-free and sequence packing
|
| 161 |
- Curriculum learning
|
| 162 |
|
| 163 |
**RL Alignment**
|
|
|
|
| 168 |
</td>
|
| 169 |
<td width="50%" valign="top">
|
| 170 |
|
| 171 |
+
**Inference and Serving**
|
| 172 |
- Continuous batching
|
| 173 |
- PagedAttention (4x memory efficiency)
|
| 174 |
- Speculative decoding (Eagle, Medusa, NGram)
|
|
|
|
| 176 |
- OpenAI-compatible FastAPI server
|
| 177 |
- Streaming support
|
| 178 |
|
| 179 |
+
**Optimization Suite (40+)**
|
| 180 |
+
- torch.compile full-graph compilation
|
| 181 |
+
- GPTQ / AWQ / FP4 / NVFP4 quantization
|
| 182 |
- GaLore gradient projection
|
| 183 |
- torchao integration
|
| 184 |
- EMA training stability
|
| 185 |
- Selective activation checkpointing
|
| 186 |
|
| 187 |
**Distributed Training**
|
| 188 |
+
- FSDP2, DeepSpeed ZeRO (stages 0-3)
|
| 189 |
- Tensor, Pipeline, Expert parallelism
|
| 190 |
+
- Ring Attention for long context
|
| 191 |
+
- Heterogeneous GPU + CPU + TPU training
|
| 192 |
- Auto-parallelism detection
|
| 193 |
|
| 194 |
</td>
|
|
|
|
| 197 |
|
| 198 |
<br>
|
| 199 |
|
|
|
|
|
|
|
| 200 |
## Multi-Backend Support
|
| 201 |
|
| 202 |
| Backend | Hardware | Status |
|
| 203 |
|:--------|:---------|:-------|
|
| 204 |
+
| CUDA | NVIDIA GPUs (Ampere, Ada, Hopper, Blackwell) | Production |
|
| 205 |
+
| ROCm | AMD GPUs (MI250, MI300X, RX 7000) | Production |
|
| 206 |
+
| CPU FP32 | Any x86/x64 CPU (AVX-512, AVX2, NEON) | Production |
|
| 207 |
+
| TPU (XLA/SPMD) | Google TPU v3-v5 | Beta |
|
| 208 |
+
| MLX | Apple Silicon M1-M4 | Beta |
|
| 209 |
+
| XPU | Intel Arc, Data Center GPU | Beta |
|
| 210 |
+
| Heterogeneous | GPU + CPU + TPU mixed | Beta |
|
| 211 |
|
| 212 |
<br>
|
| 213 |
|
|
|
|
|
|
|
| 214 |
## Stack
|
| 215 |
|
| 216 |
| Layer | Technology | Purpose |
|
| 217 |
|:------|:----------:|:--------|
|
| 218 |
| CUDA Kernels | C/CUDA | Fused projector ops, cross-attention, VQ lookup |
|
| 219 |
| Core | C++ | Memory management, tensor routing, async streams |
|
| 220 |
+
| Bindings | pybind11 | C++ to Python bridge |
|
| 221 |
| Triton | OpenAI Triton | Fused attention, RoPE, SwiGLU, RMSNorm |
|
| 222 |
| API | Python | Public interface, FastVisionModel, Trainer |
|
| 223 |
| Backends | CUDA/ROCm/MLX/TPU/XPU | Hardware abstraction |
|
|
|
|
| 225 |
|
| 226 |
<br>
|
| 227 |
|
|
|
|
|
|
|
| 228 |
## Architecture
|
| 229 |
|
| 230 |
+
**Image + Text** feeds into a **Vision Encoder** (SigLIP2, CLIP, DINOv2, or any HuggingFace encoder), whose patch features are passed through the **YakiProjector** (Patch Grouping 3x3 + MLP 2-layer, mapping `vision_dim x 9` to `llm_dim`). The projected embeddings are merged with text embeddings and passed to the **Language Model** (any `AutoModelForCausalLM`, with QLoRA 4-bit NF4 and LoRA r=64), which generates text output including `<think>` reasoning blocks when applicable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
|
| 232 |
<br>
|
| 233 |
|
|
|
|
|
|
|
| 234 |
## Yadis Architecture
|
| 235 |
|
| 236 |
Yadis is OpenLLaVA's flagship multimodal architecture — the long-term evolution of the framework combining discrete visual tokens, MLP projection, and cross-attention per LLM layer.
|
|
|
|
| 257 |
```
|
| 258 |
|
| 259 |
| Mode | Description |
|
| 260 |
+
|:-----|:------------|
|
| 261 |
| `llava` | LLaVA-style MLP projection (default) |
|
| 262 |
+
| `yadis_routing` | Multiple expert encoders with MoE router |
|
| 263 |
+
| `yadis_full` | Discrete visual tokens with cross-attention per layer |
|
| 264 |
|
| 265 |
<br>
|
| 266 |
|
|
|
|
|
|
|
| 267 |
## OpceanAI Vision Models
|
| 268 |
|
| 269 |
OpceanAI uses OpenLLaVA to publish vision versions of new language models within 48 hours of release.
|
|
|
|
| 274 |
|
| 275 |
**Yaki v1**
|
| 276 |
|
| 277 |
+
Vision-language model built on Yuuki RxG 8B. Designed for complex visual reasoning with bilingual support (ES/EN). Preserves the `<think>` chain-of-thought behavior of the base model for multimodal tasks.
|
| 278 |
|
| 279 |
+
Base: DeepSeek-R1-Qwen3-8B fine-tune<br>
|
| 280 |
Encoder: SigLIP 2 SO400M<br>
|
| 281 |
LoRA: r=64, alpha=128
|
| 282 |
|
|
|
|
| 294 |
|
| 295 |
**Yaki v3** *(planned)*
|
| 296 |
|
| 297 |
+
Built on OwO 32B with full Yadis routing architecture, combining visual and OCR expert encoders.
|
| 298 |
|
| 299 |
</td>
|
| 300 |
</tr>
|
|
|
|
| 302 |
|
| 303 |
<br>
|
| 304 |
|
|
|
|
|
|
|
| 305 |
## Philosophy
|
| 306 |
|
| 307 |
<table>
|
|
|
|
| 314 |
|
| 315 |
**Speed Over Ceremony**
|
| 316 |
|
| 317 |
+
When a new model is released, the window to publish a vision version is 48 to 72 hours. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training.
|
| 318 |
|
| 319 |
</td>
|
| 320 |
<td width="50%" valign="top">
|
| 321 |
|
| 322 |
**Low Level Where It Matters**
|
| 323 |
|
| 324 |
+
The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget research organization.
|
| 325 |
|
| 326 |
**Fully Open**
|
| 327 |
|
|
|
|
| 333 |
|
| 334 |
<br>
|
| 335 |
|
|
|
|
|
|
|
| 336 |
## Roadmap
|
| 337 |
|
| 338 |
| Version | Features | Status |
|
| 339 |
+
|:--------|:---------|:-------|
|
| 340 |
+
| v1 - v3 | LLaVA-style, QLoRA, AnyRes, 3-phase pipeline, multi-backend | Released |
|
| 341 |
+
| v4 - v5 | CUDA kernels, GGUF vision export, CPU offloading, cross-attention | Active |
|
| 342 |
+
| v6 - v7 | Discrete visual tokens (VQ-VAE), multi-expert routing | Planned |
|
| 343 |
+
| v8 - v9 | Video support, hybrid architectures | Planned |
|
| 344 |
+
| v10 | Yadis complete, omnimodal preparation | Planned |
|
| 345 |
|
| 346 |
<br>
|
| 347 |
|
|
|
|
|
|
|
| 348 |
<div align="center">
|
| 349 |
|
| 350 |
## Built by OpceanAI
|
| 351 |
|
| 352 |
+
OpenLLaVA is the vision infrastructure layer of [OpceanAI](https://huggingface.co/OpceanAI) — an independent AI research organization operating with no institutional funding, no cloud compute budget, and no team. Every model in the OpceanAI vision pipeline is trained on consumer hardware and validated on standard benchmarks.
|
| 353 |
|
| 354 |
<br>
|
| 355 |
|
|
|
|
| 361 |
|
| 362 |
<br>
|
| 363 |
|
|
|
|
|
|
|
| 364 |
**Open framework. Open models. Zero budget. Measurable results.**
|
| 365 |
|
| 366 |
[](https://github.com/OpceanAI/openllava)
|