| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: pytorch |
| pipeline_tag: text-generation |
| tags: |
| - causal-lm |
| - small-language-model |
| - mla |
| - multi-head-latent-attention |
| - muon |
| - pretrained |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| --- |
| |
| # TinyLM 275M (MLA + Muon) |
|
|
| A **275M parameter** small language model trained from scratch with |
| **Multi-head Latent Attention (MLA, DeepSeek-V2 style)** and the **Muon |
| optimizer**, on **8B unique tokens** of FineWeb-Edu. Benchmarked against |
| TinyLlama-1.1B as part of a 4-arm architecture ablation. |
|
|
| - **Source code:** https://github.com/shivnarainms22/TinyLM |
| - **Full ablation results:** see [`results/hpc_rerun_ablation.md`](https://github.com/shivnarainms22/TinyLM/blob/main/results/hpc_rerun_ablation.md) in the repo |
| - **All four ablation checkpoints:** [`Shiv-22/tinylm-checkpoints-v2`](https://huggingface.co/Shiv-22/tinylm-checkpoints-v2) |
|
|
| This repo holds the **Run D** arm of the ablation (MLA + Muon) β the |
| best-performing of the four and the model intended for downstream use. |
|
|
| --- |
|
|
| ## Model details |
|
|
| | | | |
| |---|---| |
| | Parameters | 274.6M | |
| | Architecture | TinyLlama-style decoder-only Transformer with MLA | |
| | Layers | 18 | |
| | Hidden size | 1024 | |
| | Attention heads | 16 | |
| | MLA latent dim | 512 (decoupled RoPE 64) | |
| | FFN hidden | 2816 (SwiGLU) | |
| | Context length | 2048 | |
| | Vocab | 32,000 | |
| | Tokenizer | [`meta-llama/Llama-2-7b-hf`](https://huggingface.co/meta-llama/Llama-2-7b-hf) | |
| | Tied embeddings | Yes | |
| | Precision | bfloat16 | |
|
|
| ## Training |
|
|
| | | | |
| |---|---| |
| | Dataset | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (8B unique tokens) | |
| | Tokens processed | ~24B (~3 epochs) | |
| | Steps | 23,000 (warmup 2,000) | |
| | Effective batch | 512 sequences Γ 2048 tokens β 1.05M tokens/step | |
| | Optimizer | **Muon** for matrix params (lr 0.02) + **AdamW** for scalar/embed/LM-head/LN (lr 0.001, wd 0.1) | |
| | LR schedule | Cosine with linear warmup | |
| | Grad clip | 1.0 | |
| | Hardware | A100-40GB (Northeastern Explorer HPC) | |
| | Codebase base | [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) | |
|
|
| Pure FineWeb-Edu throughout (no annealing mix, no instruction tuning). |
|
|
| ## Evaluation |
|
|
| 0-shot eval via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). HellaSwag and ARC-Easy reported as `acc_norm` (length-normalized accuracy); LAMBADA and Winogrande as `acc`. |
|
|
| | Benchmark | Metric | TinyLM 275M (this model) | TinyLlama-1.1B baseline | Ξ | |
| |---|---|---:|---:|---:| |
| | HellaSwag | acc_norm | 41.23% | 59.1% | β17.9 | |
| | ARC-Easy | acc_norm | 51.22% | 55.7% | β4.5 | |
| | LAMBADA | acc | 36.81% | 58.9% | β22.1 | |
| | Winogrande | acc | 51.30% | 58.9% | β7.6 | |
| | **Average** | | **45.14%** | 58.2% | **β13.1** | |
|
|
| Baseline = [`TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T`](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T). |
|
|
| ### Ablation (full 2Γ2) |
|
|
| The four arms differ only in attention class and matrix optimizer; all other |
| training settings are identical. |
|
|
| | | AdamW | Muon | Ξ (Muon β AdamW) | |
| |:---|:---:|:---:|:---:| |
| | **MHA** | Run A: 43.62% | Run C: 44.64% | +1.02 | |
| | **MLA** | Run B: 44.11% | **Run D: 45.14%** *(this model)* | +1.03 | |
| | **Ξ (MLA β MHA)** | +0.49 | +0.50 | β | |
|
|
| **Findings:** |
|
|
| - **Muon contributes ~+1.0 pt avg, consistent across attention type** (+1.02 with MHA, +1.03 with MLA). |
| - **MLA contributes ~+0.5 pt avg, consistent across optimizer** (+0.49 with AdamW, +0.50 with Muon). |
| - **Effects are roughly additive** β sum of individual effects = +1.51, observed Run A β Run D = +1.52. Single-seed eval, so interactions below the ~1% noise floor are not detectable. |
|
|
| HellaSwag and LAMBADA are the cleanest signals (monotonic A < B < C < D); |
| ARC-Easy and Winogrande sit within stderr across all four arms. |
|
|
| --- |
|
|
| ## Usage |
|
|
| This model uses a custom architecture (MLA with decoupled RoPE) that is not |
| in HuggingFace `transformers`. To load it, install the source repo: |
|
|
| ```bash |
| git clone https://github.com/shivnarainms22/TinyLM |
| cd TinyLM |
| pip install torch transformers huggingface_hub |
| ``` |
|
|
| Then: |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from transformers import AutoTokenizer |
| from tinylm.model import TinyLM, ModelConfig |
| |
| # Download checkpoint |
| ckpt_path = hf_hub_download(repo_id="Shiv-22/tinylm", filename="step_22999.pt") |
| ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True) |
| |
| # Build and load model |
| model = TinyLM(ModelConfig(**ckpt["config"])) |
| state = ckpt["model"] |
| if any(k.startswith("_orig_mod.") for k in state): |
| state = {k.removeprefix("_orig_mod."): v for k, v in state.items()} |
| model.load_state_dict(state) |
| model.eval().to("cuda").to(torch.bfloat16) |
| |
| # Tokenize and generate |
| tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") |
| prompt_ids = tok.encode("The capital of France is", return_tensors="pt").to("cuda") |
| |
| with torch.no_grad(): |
| for _ in range(20): |
| logits = model(prompt_ids) |
| next_id = logits[0, -1].argmax(dim=-1, keepdim=True).unsqueeze(0) |
| prompt_ids = torch.cat([prompt_ids, next_id], dim=1) |
| |
| print(tok.decode(prompt_ids[0])) |
| ``` |
|
|
| ## Limitations |
|
|
| - **Small training budget for a base model.** 8B unique tokens / ~24B processed is well below modern SLMs (TinyLlama-1.1B saw 3T). Absolute benchmark numbers reflect that. |
| - **Below the 1.1B baseline on all four tasks.** The headline portfolio finding is the *architecture* comparison (MLA+Muon vs MHA+AdamW), not raw absolute capability. |
| - **Pretrain-only.** No instruction tuning, no RLHF, no safety filtering beyond what FineWeb-Edu already applies upstream. |
| - **Winogrande at ~51% is essentially chance** (50% binary task) β this benchmark is not a meaningful capability signal at 275M scale. |
| - **Single-seed evals.** Stderrs of ~0.5β1.0% on `acc_norm` metrics; differences smaller than that should be read as noise. |
| - **Custom architecture.** Not compatible with `transformers.AutoModel.from_pretrained` out of the box. |
|
|
| ## Project history |
|
|
| - **v1** (2026-05, RunPod A100-80GB): single MLA+Muon training run on 1B unique tokens repeated ~21Γ. Established the training pipeline but the data looping hurt long-range coherence (LAMBADA acc 29.2%). v1 weights preserved at [`Shiv-22/tinylm-checkpoints`](https://huggingface.co/Shiv-22/tinylm-checkpoints) for contrast. |
| - **HPC re-run** (2026-05, Northeastern Explorer A100-40GB): full 4-arm ablation on 8B unique tokens. The weights in this repo are from the **Run D** arm of that re-run. |
|
|
| Re-running the same MLA+Muon arm with the data fix (1BΓ21 β 8B unique) was |
| worth **+3.97 pts** average β roughly 2.6Γ the architecture-and-optimizer |
| ablation gain. Data quality dominates architecture at this scale. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{tinylm-275m, |
| author = {Shivnarain}, |
| title = {TinyLM 275M: A small language model with MLA and Muon}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/Shiv-22/tinylm}, |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0. Inherits the permissive terms of [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) (MIT) for the codebase and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (ODC-By) for the training data. |
|
|