--- library_name: transformers tags: - hyper-efficient - long-context - randnla - matryoshka - sub-quadratic - muon - research license: mit language: - en metrics: - perplexity --- # MaximusLLM MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines. ## Model Details ### Model Description - **Developed by:** Yousef Gamaleldin (Independent Researcher) - **Model type:** Transformer with Bifurcated Latent Attention - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** Trained from scratch (Base) followed by Instruction Pre-training. - **Tokenizer:** Gemma 3 (262,144 vocab size) ### Model Sources - **Repository:** [yousefg/MaximusLLM](https://github.com/yousefg/MaximusLLM) - **Technical Reports:** - *MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training* - *Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA* ## Bias, Risks, and Limitations MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations. ## How to Get Started with the Model ```python from src.model import Model, Config from src.lora import blockswap_attention_layers from src.infer import general_generate_fn config = Config.from_pretrained("yousefg/MaximusLLM") model = Model(config, device="cuda") blockswap_attention_layers(model) prompt = "user\nWhat is the capital of France?\nmodel\n" inputs = tokenizer(prompt, return_tensors="pt").to("cuda") output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50) print(tokenizer.decode(output[0])) ``` ## Training Details ### Training Data 1. **Pre-training:** A high-quality subset of `HuggingFaceFW/fineweb-edu`. 2. **Narrative Alignment:** `roneneldan/TinyStories` to stabilize linguistic fluidity. 3. **Instruction Alignment:** `HuggingFaceH4/ultrachat_200k` using a multi-turn conversational format. ### Training Procedure Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput. #### Training Hyperparameters - **Optimizers:** - **Muon:** Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT). - **AdamW:** Applied to Embeddings, Head, and Norms (LR 4e-4). - **Loss Function:** **MAXIS Loss** (Unnormalized Ghost Logits + Matryoshka Auxiliary loss). - **Precision:** FP32 Master Weights, FP16 Mixed Precision (Autocast). - **Effective Batch Size:** 64 to 256 (via Gradient Accumulation). - **Context Length:** Scaled from 2,048 to 8,192 native (Long-context phase). #### Speeds, Sizes, Times - **Throughput:** 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy). - **VRAM Savings:** 38.7% reduction in peak memory usage. - **Scaling:** $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression. ## Technical Specifications ### Model Architecture and Objective MaximusLLM utilizes three core innovations: 1. **MAXIS Loss:** A Matryoshka-structured loss using **Dynamic Variance Ghost Logits** to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax. 2. **RandNLA Attention:** Bifurcates the KV-cache into a **Top-K Detail Path** (lossless) and a **Causal Kronecker Sketch Path** (compressed background). It uses an **Asymmetric Causal Mask** to remain strictly autoregressive. 3. **Fisher SVD:** Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions. ### Compute Infrastructure #### Hardware - **Primary:** NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud. - **Secondary:** Benchmarked on NVIDIA L4 (24GB VRAM). #### Software - **Framework:** PyTorch 2.5+ or 2.9+ for training - **Compiler:** `torch.compile` (Hollow-compilation of inner blocks for stability). ## Citation **MAXIS Loss:** ```bibtex @article{gamaleldin2026maxis, title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training}, author={Gamaleldin, Yousef}, journal={SSRN: Artificial Intelligence eJournal}, year={2026} } ``` **RandNLA Attention:** ```bibtex @article{gamaleldin2026randnla, title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA}, author={Gamaleldin, Yousef}, journal={SSRN: Artificial Intelligence eJournal}, year={2026} } ``` ## Model Card Contact Yousef Gamaleldin - [yrafat38@gmail.com]