File size: 4,859 Bytes
557c512 2d8a5d3 01aab9c 2d8a5d3 4560717 2d8a5d3 5a45985 2d8a5d3 5a45985 2d8a5d3 5a45985 2d8a5d3 8ea127f 2d8a5d3 5a45985 2d8a5d3 5a45985 9b51e13 5a45985 2d8a5d3 5a45985 2d8a5d3 5a45985 2d8a5d3 5a45985 8ea127f 2d8a5d3 5a45985 2d8a5d3 5a45985 2d8a5d3 5a45985 2d8a5d3 5a45985 2d8a5d3 5a45985 2d8a5d3 8ea127f 5a45985 2d8a5d3 5a45985 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | ---
library_name: transformers
tags:
- hyper-efficient
- long-context
- randnla
- matryoshka
- sub-quadratic
- muon
- research
license: mit
language:
- en
metrics:
- perplexity
---
# MaximusLLM
MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines.
## Model Details
### Model Description
- **Developed by:** Yousef Gamaleldin (Independent Researcher)
- **Model type:** Transformer with Bifurcated Latent Attention
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch (Base) followed by Instruction Pre-training.
- **Tokenizer:** Gemma 3 (262,144 vocab size)
### Model Sources
- **Repository:** [yousefg/MaximusLLM](https://github.com/yousefg/MaximusLLM)
- **Technical Reports:**
- *MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training*
- *Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA*
## Bias, Risks, and Limitations
MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations.
## How to Get Started with the Model
```python
from src.model import Model, Config
from src.lora import blockswap_attention_layers
from src.infer import general_generate_fn
config = Config.from_pretrained("yousefg/MaximusLLM")
model = Model(config, device="cuda")
blockswap_attention_layers(model)
prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50)
print(tokenizer.decode(output[0]))
```
## Training Details
### Training Data
1. **Pre-training:** A high-quality subset of `HuggingFaceFW/fineweb-edu`.
2. **Narrative Alignment:** `roneneldan/TinyStories` to stabilize linguistic fluidity.
3. **Instruction Alignment:** `HuggingFaceH4/ultrachat_200k` using a multi-turn conversational format.
### Training Procedure
Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput.
#### Training Hyperparameters
- **Optimizers:**
- **Muon:** Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT).
- **AdamW:** Applied to Embeddings, Head, and Norms (LR 4e-4).
- **Loss Function:** **MAXIS Loss** (Unnormalized Ghost Logits + Matryoshka Auxiliary loss).
- **Precision:** FP32 Master Weights, FP16 Mixed Precision (Autocast).
- **Effective Batch Size:** 64 to 256 (via Gradient Accumulation).
- **Context Length:** Scaled from 2,048 to 8,192 native (Long-context phase).
#### Speeds, Sizes, Times
- **Throughput:** 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy).
- **VRAM Savings:** 38.7% reduction in peak memory usage.
- **Scaling:** $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression.
## Technical Specifications
### Model Architecture and Objective
MaximusLLM utilizes three core innovations:
1. **MAXIS Loss:** A Matryoshka-structured loss using **Dynamic Variance Ghost Logits** to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax.
2. **RandNLA Attention:** Bifurcates the KV-cache into a **Top-K Detail Path** (lossless) and a **Causal Kronecker Sketch Path** (compressed background). It uses an **Asymmetric Causal Mask** to remain strictly autoregressive.
3. **Fisher SVD:** Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions.
### Compute Infrastructure
#### Hardware
- **Primary:** NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud.
- **Secondary:** Benchmarked on NVIDIA L4 (24GB VRAM).
#### Software
- **Framework:** PyTorch 2.5+ or 2.9+ for training
- **Compiler:** `torch.compile` (Hollow-compilation of inner blocks for stability).
## Citation
**MAXIS Loss:**
```bibtex
@article{gamaleldin2026maxis,
title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training},
author={Gamaleldin, Yousef},
journal={SSRN: Artificial Intelligence eJournal},
year={2026}
}
```
**RandNLA Attention:**
```bibtex
@article{gamaleldin2026randnla,
title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA},
author={Gamaleldin, Yousef},
journal={SSRN: Artificial Intelligence eJournal},
year={2026}
}
```
## Model Card Contact
Yousef Gamaleldin - [yrafat38@gmail.com] |