MaximusLLM / README.md
yousefg's picture
Update README.md
01aab9c verified
---
library_name: transformers
tags:
- hyper-efficient
- long-context
- randnla
- matryoshka
- sub-quadratic
- muon
- research
license: mit
language:
- en
metrics:
- perplexity
---
# MaximusLLM
MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines.
## Model Details
### Model Description
- **Developed by:** Yousef Gamaleldin (Independent Researcher)
- **Model type:** Transformer with Bifurcated Latent Attention
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch (Base) followed by Instruction Pre-training.
- **Tokenizer:** Gemma 3 (262,144 vocab size)
### Model Sources
- **Repository:** [yousefg/MaximusLLM](https://github.com/yousefg/MaximusLLM)
- **Technical Reports:**
- *MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training*
- *Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA*
## Bias, Risks, and Limitations
MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations.
## How to Get Started with the Model
```python
from src.model import Model, Config
from src.lora import blockswap_attention_layers
from src.infer import general_generate_fn
config = Config.from_pretrained("yousefg/MaximusLLM")
model = Model(config, device="cuda")
blockswap_attention_layers(model)
prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50)
print(tokenizer.decode(output[0]))
```
## Training Details
### Training Data
1. **Pre-training:** A high-quality subset of `HuggingFaceFW/fineweb-edu`.
2. **Narrative Alignment:** `roneneldan/TinyStories` to stabilize linguistic fluidity.
3. **Instruction Alignment:** `HuggingFaceH4/ultrachat_200k` using a multi-turn conversational format.
### Training Procedure
Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput.
#### Training Hyperparameters
- **Optimizers:**
- **Muon:** Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT).
- **AdamW:** Applied to Embeddings, Head, and Norms (LR 4e-4).
- **Loss Function:** **MAXIS Loss** (Unnormalized Ghost Logits + Matryoshka Auxiliary loss).
- **Precision:** FP32 Master Weights, FP16 Mixed Precision (Autocast).
- **Effective Batch Size:** 64 to 256 (via Gradient Accumulation).
- **Context Length:** Scaled from 2,048 to 8,192 native (Long-context phase).
#### Speeds, Sizes, Times
- **Throughput:** 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy).
- **VRAM Savings:** 38.7% reduction in peak memory usage.
- **Scaling:** $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression.
## Technical Specifications
### Model Architecture and Objective
MaximusLLM utilizes three core innovations:
1. **MAXIS Loss:** A Matryoshka-structured loss using **Dynamic Variance Ghost Logits** to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax.
2. **RandNLA Attention:** Bifurcates the KV-cache into a **Top-K Detail Path** (lossless) and a **Causal Kronecker Sketch Path** (compressed background). It uses an **Asymmetric Causal Mask** to remain strictly autoregressive.
3. **Fisher SVD:** Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions.
### Compute Infrastructure
#### Hardware
- **Primary:** NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud.
- **Secondary:** Benchmarked on NVIDIA L4 (24GB VRAM).
#### Software
- **Framework:** PyTorch 2.5+ or 2.9+ for training
- **Compiler:** `torch.compile` (Hollow-compilation of inner blocks for stability).
## Citation
**MAXIS Loss:**
```bibtex
@article{gamaleldin2026maxis,
title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training},
author={Gamaleldin, Yousef},
journal={SSRN: Artificial Intelligence eJournal},
year={2026}
}
```
**RandNLA Attention:**
```bibtex
@article{gamaleldin2026randnla,
title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA},
author={Gamaleldin, Yousef},
journal={SSRN: Artificial Intelligence eJournal},
year={2026}
}
```
## Model Card Contact
Yousef Gamaleldin - [yrafat38@gmail.com]