File size: 4,859 Bytes

---
library_name: transformers
tags:
- hyper-efficient
- long-context
- randnla
- matryoshka
- sub-quadratic
- muon
- research
license: mit
language:
- en
metrics:
- perplexity
---


# MaximusLLM

MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines.

## Model Details

### Model Description

- **Developed by:** Yousef Gamaleldin (Independent Researcher)
- **Model type:** Transformer with Bifurcated Latent Attention
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch (Base) followed by Instruction Pre-training.
- **Tokenizer:** Gemma 3 (262,144 vocab size)

### Model Sources

- **Repository:** [yousefg/MaximusLLM](https://github.com/yousefg/MaximusLLM)
- **Technical Reports:** 
  - *MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training*
  - *Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA*

## Bias, Risks, and Limitations

MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations.

## How to Get Started with the Model

```python
from src.model import Model, Config
from src.lora import blockswap_attention_layers
from src.infer import general_generate_fn

config = Config.from_pretrained("yousefg/MaximusLLM")
model = Model(config, device="cuda")
blockswap_attention_layers(model)

prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50)
print(tokenizer.decode(output[0]))
```

## Training Details

### Training Data

1.  **Pre-training:** A high-quality subset of `HuggingFaceFW/fineweb-edu`.
2.  **Narrative Alignment:** `roneneldan/TinyStories` to stabilize linguistic fluidity.
3.  **Instruction Alignment:** `HuggingFaceH4/ultrachat_200k` using a multi-turn conversational format.

### Training Procedure

Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput.

#### Training Hyperparameters

- **Optimizers:** 
  - **Muon:** Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT).
  - **AdamW:** Applied to Embeddings, Head, and Norms (LR 4e-4).
- **Loss Function:** **MAXIS Loss** (Unnormalized Ghost Logits + Matryoshka Auxiliary loss).
- **Precision:** FP32 Master Weights, FP16 Mixed Precision (Autocast).
- **Effective Batch Size:** 64 to 256 (via Gradient Accumulation).
- **Context Length:** Scaled from 2,048 to 8,192 native (Long-context phase).

#### Speeds, Sizes, Times

- **Throughput:** 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy).
- **VRAM Savings:** 38.7% reduction in peak memory usage.
- **Scaling:** $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression.

## Technical Specifications

### Model Architecture and Objective

MaximusLLM utilizes three core innovations:
1.  **MAXIS Loss:** A Matryoshka-structured loss using **Dynamic Variance Ghost Logits** to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax.
2.  **RandNLA Attention:** Bifurcates the KV-cache into a **Top-K Detail Path** (lossless) and a **Causal Kronecker Sketch Path** (compressed background). It uses an **Asymmetric Causal Mask** to remain strictly autoregressive.
3.  **Fisher SVD:** Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions.

### Compute Infrastructure

#### Hardware
- **Primary:** NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud.
- **Secondary:** Benchmarked on NVIDIA L4 (24GB VRAM).

#### Software
- **Framework:** PyTorch 2.5+ or 2.9+ for training
- **Compiler:** `torch.compile` (Hollow-compilation of inner blocks for stability).

## Citation

**MAXIS Loss:**
```bibtex
@article{gamaleldin2026maxis,
  title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training},
  author={Gamaleldin, Yousef},
  journal={SSRN: Artificial Intelligence eJournal},
  year={2026}
}
```

**RandNLA Attention:**
```bibtex
@article{gamaleldin2026randnla,
  title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA},
  author={Gamaleldin, Yousef},
  journal={SSRN: Artificial Intelligence eJournal},
  year={2026}
}
```

## Model Card Contact
Yousef Gamaleldin - [yrafat38@gmail.com]