| --- |
| library_name: transformers |
| tags: |
| - hyper-efficient |
| - long-context |
| - randnla |
| - matryoshka |
| - sub-quadratic |
| - muon |
| - research |
| license: mit |
| language: |
| - en |
| metrics: |
| - perplexity |
| --- |
| |
|
|
| # MaximusLLM |
|
|
| MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines. |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| - **Developed by:** Yousef Gamaleldin (Independent Researcher) |
| - **Model type:** Transformer with Bifurcated Latent Attention |
| - **Language(s) (NLP):** English |
| - **License:** MIT |
| - **Finetuned from model:** Trained from scratch (Base) followed by Instruction Pre-training. |
| - **Tokenizer:** Gemma 3 (262,144 vocab size) |
|
|
| ### Model Sources |
|
|
| - **Repository:** [yousefg/MaximusLLM](https://github.com/yousefg/MaximusLLM) |
| - **Technical Reports:** |
| - *MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training* |
| - *Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA* |
|
|
| ## Bias, Risks, and Limitations |
|
|
| MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations. |
|
|
| ## How to Get Started with the Model |
|
|
| ```python |
| from src.model import Model, Config |
| from src.lora import blockswap_attention_layers |
| from src.infer import general_generate_fn |
| |
| config = Config.from_pretrained("yousefg/MaximusLLM") |
| model = Model(config, device="cuda") |
| blockswap_attention_layers(model) |
| |
| prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n" |
| inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
| output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50) |
| print(tokenizer.decode(output[0])) |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| 1. **Pre-training:** A high-quality subset of `HuggingFaceFW/fineweb-edu`. |
| 2. **Narrative Alignment:** `roneneldan/TinyStories` to stabilize linguistic fluidity. |
| 3. **Instruction Alignment:** `HuggingFaceH4/ultrachat_200k` using a multi-turn conversational format. |
|
|
| ### Training Procedure |
|
|
| Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput. |
|
|
| #### Training Hyperparameters |
|
|
| - **Optimizers:** |
| - **Muon:** Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT). |
| - **AdamW:** Applied to Embeddings, Head, and Norms (LR 4e-4). |
| - **Loss Function:** **MAXIS Loss** (Unnormalized Ghost Logits + Matryoshka Auxiliary loss). |
| - **Precision:** FP32 Master Weights, FP16 Mixed Precision (Autocast). |
| - **Effective Batch Size:** 64 to 256 (via Gradient Accumulation). |
| - **Context Length:** Scaled from 2,048 to 8,192 native (Long-context phase). |
|
|
| #### Speeds, Sizes, Times |
|
|
| - **Throughput:** 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy). |
| - **VRAM Savings:** 38.7% reduction in peak memory usage. |
| - **Scaling:** $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression. |
|
|
| ## Technical Specifications |
|
|
| ### Model Architecture and Objective |
|
|
| MaximusLLM utilizes three core innovations: |
| 1. **MAXIS Loss:** A Matryoshka-structured loss using **Dynamic Variance Ghost Logits** to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax. |
| 2. **RandNLA Attention:** Bifurcates the KV-cache into a **Top-K Detail Path** (lossless) and a **Causal Kronecker Sketch Path** (compressed background). It uses an **Asymmetric Causal Mask** to remain strictly autoregressive. |
| 3. **Fisher SVD:** Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions. |
|
|
| ### Compute Infrastructure |
|
|
| #### Hardware |
| - **Primary:** NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud. |
| - **Secondary:** Benchmarked on NVIDIA L4 (24GB VRAM). |
|
|
| #### Software |
| - **Framework:** PyTorch 2.5+ or 2.9+ for training |
| - **Compiler:** `torch.compile` (Hollow-compilation of inner blocks for stability). |
|
|
| ## Citation |
|
|
| **MAXIS Loss:** |
| ```bibtex |
| @article{gamaleldin2026maxis, |
| title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training}, |
| author={Gamaleldin, Yousef}, |
| journal={SSRN: Artificial Intelligence eJournal}, |
| year={2026} |
| } |
| ``` |
|
|
| **RandNLA Attention:** |
| ```bibtex |
| @article{gamaleldin2026randnla, |
| title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA}, |
| author={Gamaleldin, Yousef}, |
| journal={SSRN: Artificial Intelligence eJournal}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Model Card Contact |
| Yousef Gamaleldin - [yrafat38@gmail.com] |