File size: 4,859 Bytes
557c512
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d8a5d3
01aab9c
2d8a5d3
4560717
2d8a5d3
 
 
 
 
5a45985
 
 
 
 
 
2d8a5d3
5a45985
2d8a5d3
5a45985
 
 
 
2d8a5d3
 
 
8ea127f
2d8a5d3
 
 
5a45985
 
 
 
2d8a5d3
5a45985
 
9b51e13
5a45985
 
 
 
 
 
2d8a5d3
 
 
 
 
5a45985
 
 
2d8a5d3
 
 
5a45985
2d8a5d3
 
 
5a45985
 
 
 
 
 
8ea127f
2d8a5d3
5a45985
2d8a5d3
5a45985
 
 
2d8a5d3
5a45985
2d8a5d3
 
 
5a45985
 
 
 
2d8a5d3
 
 
 
5a45985
 
2d8a5d3
 
8ea127f
5a45985
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d8a5d3
 
5a45985
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
library_name: transformers
tags:
- hyper-efficient
- long-context
- randnla
- matryoshka
- sub-quadratic
- muon
- research
license: mit
language:
- en
metrics:
- perplexity
---


# MaximusLLM

MaximusLLM is a long-context language model designed for hyper-efficient architecture and training. It introduces a new paradigm for to long context while reducing training VRAM by ~40% and increasing throughput by over 17x compared to optimized standard Cross-Entropy baselines.

## Model Details

### Model Description

- **Developed by:** Yousef Gamaleldin (Independent Researcher)
- **Model type:** Transformer with Bifurcated Latent Attention
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch (Base) followed by Instruction Pre-training.
- **Tokenizer:** Gemma 3 (262,144 vocab size)

### Model Sources

- **Repository:** [yousefg/MaximusLLM](https://github.com/yousefg/MaximusLLM)
- **Technical Reports:** 
  - *MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training*
  - *Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA*

## Bias, Risks, and Limitations

MaximusLLM (190M) is an architectural proof-of-concept. While it demonstrates extreme efficiency, its absolute knowledge capacity is limited by its parameter count. Users should expect hallucinations.

## How to Get Started with the Model

```python
from src.model import Model, Config
from src.lora import blockswap_attention_layers
from src.infer import general_generate_fn

config = Config.from_pretrained("yousefg/MaximusLLM")
model = Model(config, device="cuda")
blockswap_attention_layers(model)

prompt = "<start_of_turn>user\nWhat is the capital of France?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = general_generate_fn(model, inputs, tokenizer, max_new_tokens=50)
print(tokenizer.decode(output[0]))
```

## Training Details

### Training Data

1.  **Pre-training:** A high-quality subset of `HuggingFaceFW/fineweb-edu`.
2.  **Narrative Alignment:** `roneneldan/TinyStories` to stabilize linguistic fluidity.
3.  **Instruction Alignment:** `HuggingFaceH4/ultrachat_200k` using a multi-turn conversational format.

### Training Procedure

Maximus utilizes a specialized training pipeline to maintain FP32 master weight stability while achieving FP16 throughput.

#### Training Hyperparameters

- **Optimizers:** 
  - **Muon:** Applied to all 2D weight matrices (Attention/MLP) with LR 0.02 (Pre-train) and 0.005 (SFT).
  - **AdamW:** Applied to Embeddings, Head, and Norms (LR 4e-4).
- **Loss Function:** **MAXIS Loss** (Unnormalized Ghost Logits + Matryoshka Auxiliary loss).
- **Precision:** FP32 Master Weights, FP16 Mixed Precision (Autocast).
- **Effective Batch Size:** 64 to 256 (via Gradient Accumulation).
- **Context Length:** Scaled from 2,048 to 8,192 native (Long-context phase).

#### Speeds, Sizes, Times

- **Throughput:** 2.81 updates/sec (17.5x faster than Liger-fused Cross-Entropy).
- **VRAM Savings:** 38.7% reduction in peak memory usage.
- **Scaling:** $O(N \cdot K)$ complexity achieved via Query Chunking and KV-compression.

## Technical Specifications

### Model Architecture and Objective

MaximusLLM utilizes three core innovations:
1.  **MAXIS Loss:** A Matryoshka-structured loss using **Dynamic Variance Ghost Logits** to simulate the full-vocabulary distribution, preventing the "premature saturation" common in sampled softmax.
2.  **RandNLA Attention:** Bifurcates the KV-cache into a **Top-K Detail Path** (lossless) and a **Causal Kronecker Sketch Path** (compressed background). It uses an **Asymmetric Causal Mask** to remain strictly autoregressive.
3.  **Fisher SVD:** Leverages the Fisher Information Matrix ($\sum (\frac{\partial L}{\partial W})^2$) to optimally initialize latent spaces, preserving pre-trained intelligence during architectural transitions.

### Compute Infrastructure

#### Hardware
- **Primary:** NVIDIA Tesla T4 (16GB VRAM) / 2x Tesla T4 via Kaggle/Cloud.
- **Secondary:** Benchmarked on NVIDIA L4 (24GB VRAM).

#### Software
- **Framework:** PyTorch 2.5+ or 2.9+ for training
- **Compiler:** `torch.compile` (Hollow-compilation of inner blocks for stability).

## Citation

**MAXIS Loss:**
```bibtex
@article{gamaleldin2026maxis,
  title={MAXIS: A Hyper-Efficient Paradigm for Scalable Long-Context LLM Training},
  author={Gamaleldin, Yousef},
  journal={SSRN: Artificial Intelligence eJournal},
  year={2026}
}
```

**RandNLA Attention:**
```bibtex
@article{gamaleldin2026randnla,
  title={Bifurcated Latent Attention: Scaling LLMs to Infinite Context via Asymmetric Causal RandNLA},
  author={Gamaleldin, Yousef},
  journal={SSRN: Artificial Intelligence eJournal},
  year={2026}
}
```

## Model Card Contact
Yousef Gamaleldin - [yrafat38@gmail.com]