Update README.md

01aab9c verified 9 days ago

4.86 kB

	---
	library_name: transformers
	tags:
	- hyper-efficient
	- long-context
	- randnla
	- matryoshka
	- sub-quadratic
	- muon
	- research
	license: mit
	language:
	- en
	metrics:
	- perplexity
	---


	# MaximusLLM

	MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines.

	## Model Details

	### Model Description

	- Developed by: Yousef Gamaleldin (Independent Researcher)
	- Model type: Transformer with Bifurcated Latent Attention
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: Trained from scratch (Base) followed by Instruction Pre-training.
	- Tokenizer: Gemma 3 (262,144 vocab size)

	### Model Sources

	- Repository: [yousefg/MaximusLLM](https://github.com/yousefg/MaximusLLM)
	- Technical Reports:
	- MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training
	- Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA

	## Bias, Risks, and Limitations

	MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations.

	## How to Get Started with the Model

	```python
	from src.model import Model, Config
	from src.lora import blockswap_attention_layers
	from src.infer import general_generate_fn

	config = Config.from_pretrained("yousefg/MaximusLLM")
	model = Model(config, device="cuda")
	blockswap_attention_layers(model)

	prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50)
	print(tokenizer.decode(output[0]))
	```

	## Training Details

	### Training Data

	1. Pre-training: A high-quality subset of `HuggingFaceFW/fineweb-edu`.
	2. Narrative Alignment: `roneneldan/TinyStories` to stabilize linguistic fluidity.
	3. Instruction Alignment: `HuggingFaceH4/ultrachat_200k` using a multi-turn conversational format.

	### Training Procedure

	Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput.

	#### Training Hyperparameters

	- Optimizers:
	- Muon: Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT).
	- AdamW: Applied to Embeddings, Head, and Norms (LR 4e-4).
	- Loss Function: MAXIS Loss (Unnormalized Ghost Logits + Matryoshka Auxiliary loss).
	- Precision: FP32 Master Weights, FP16 Mixed Precision (Autocast).
	- Effective Batch Size: 64 to 256 (via Gradient Accumulation).
	- Context Length: Scaled from 2,048 to 8,192 native (Long-context phase).

	#### Speeds, Sizes, Times

	- Throughput: 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy).
	- VRAM Savings: 38.7% reduction in peak memory usage.
	- Scaling: $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression.

	## Technical Specifications

	### Model Architecture and Objective

	MaximusLLM utilizes three core innovations:
	1. MAXIS Loss: A Matryoshka-structured loss using Dynamic Variance Ghost Logits to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax.
	2. RandNLA Attention: Bifurcates the KV-cache into a Top-K Detail Path (lossless) and a Causal Kronecker Sketch Path (compressed background). It uses an Asymmetric Causal Mask to remain strictly autoregressive.
	3. Fisher SVD: Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions.

	### Compute Infrastructure

	#### Hardware
	- Primary: NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud.
	- Secondary: Benchmarked on NVIDIA L4 (24GB VRAM).

	#### Software
	- Framework: PyTorch 2.5+ or 2.9+ for training
	- Compiler: `torch.compile` (Hollow-compilation of inner blocks for stability).

	## Citation

	MAXIS Loss:
	```bibtex
	@article{gamaleldin2026maxis,
	title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training},
	author={Gamaleldin, Yousef},
	journal={SSRN: Artificial Intelligence eJournal},
	year={2026}
	}
	```

	RandNLA Attention:
	```bibtex
	@article{gamaleldin2026randnla,
	title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA},
	author={Gamaleldin, Yousef},
	journal={SSRN: Artificial Intelligence eJournal},
	year={2026}
	}
	```

	## Model Card Contact
	Yousef Gamaleldin - [yrafat38@gmail.com]