--- title: README colorFrom: purple colorTo: indigo sdk: static pinned: false license: apache-2.0 ---
| **Model Construction** - Vision injection into any HuggingFace LLM in 3 lines - AnyRes dynamic high-resolution with patch grouping - YakiProjector: configurable MLP alignment - Auto-detects hidden dimensions, attention heads, vocabulary size - Supports LoRA-patched models **Training Pipeline** - 3-phase training: alignment, instruction tuning, RL alignment - LoRA, LoRA+, DoRA, QLoRA, Split LoRA, LoRAGA, LoRAFA - BitNet ternary training (b1.58) - MoE + LoRA fusion - FP8 training on H100 - Padding-free and sequence packing - Curriculum learning **RL Alignment** - DPO, GRPO, ORPO, PPO - Composable reward functions - Visual reasoning reward support | **Inference and Serving** - Continuous batching - PagedAttention (4x memory efficiency) - Speculative decoding (Eagle, Medusa, NGram) - KV cache: quantization, eviction, compression - OpenAI-compatible FastAPI server - Streaming support **Optimization Suite (40+)** - torch.compile full-graph compilation - GPTQ / AWQ / FP4 / NVFP4 quantization - GaLore gradient projection - torchao integration - EMA training stability - Selective activation checkpointing **Distributed Training** - FSDP2, DeepSpeed ZeRO (stages 0-3) - Tensor, Pipeline, Expert parallelism - Ring Attention for long context - Heterogeneous GPU + CPU + TPU training - Auto-parallelism detection |
|
**Yaki v1**
Vision-language model built on Yuuki RxG 8B. Designed for complex visual reasoning with bilingual support (ES/EN). Preserves the ` Encoder: SigLIP 2 SO400M LoRA: r=64, alpha=128 [](https://huggingface.co/Openllava/Yaki) |
**Yaki v2** *(planned)* Built on Yuuki ExG 14B with cross-attention architecture (OpenLLaVA v4). | **Yaki v3** *(planned)* Built on OwO 32B with full Yadis routing architecture, combining visual and OCR expert encoders. |
| **Architecture Agnostic by Design** Every existing multimodal framework is hardcoded to specific model families. OpenLLaVA is not. The projector adapts to any hidden dimension. The patcher works on any causal LM. The training engine handles any tokenizer. **Speed Over Ceremony** When a new model is released, the window to publish a vision version is 48 to 72 hours. OpenLLaVA is designed for that constraint — minimal configuration, automated phase management, one-command training. | **Low Level Where It Matters** The projector is the critical path. The CUDA kernel for the fused MLP and the C++ memory manager exist because training throughput on a single GPU is the binding constraint for a zero-budget research organization. **Fully Open** Apache 2.0. No gating. No commercial restrictions. The framework exists so that any researcher — with any model, any hardware, any budget — can build a competitive vision-language model. |