--- license: apache-2.0 language: - en library_name: pytorch pipeline_tag: text-generation tags: - causal-lm - small-language-model - mla - multi-head-latent-attention - muon - pretrained datasets: - HuggingFaceFW/fineweb-edu --- # TinyLM 275M (MLA + Muon) A **275M parameter** small language model trained from scratch with **Multi-head Latent Attention (MLA, DeepSeek-V2 style)** and the **Muon optimizer**, on **8B unique tokens** of FineWeb-Edu. Benchmarked against TinyLlama-1.1B as part of a 4-arm architecture ablation. - **Source code:** https://github.com/shivnarainms22/TinyLM - **Full ablation results:** see [`results/hpc_rerun_ablation.md`](https://github.com/shivnarainms22/TinyLM/blob/main/results/hpc_rerun_ablation.md) in the repo - **All four ablation checkpoints:** [`Shiv-22/tinylm-checkpoints-v2`](https://huggingface.co/Shiv-22/tinylm-checkpoints-v2) This repo holds the **Run D** arm of the ablation (MLA + Muon) — the best-performing of the four and the model intended for downstream use. --- ## Model details | | | |---|---| | Parameters | 274.6M | | Architecture | TinyLlama-style decoder-only Transformer with MLA | | Layers | 18 | | Hidden size | 1024 | | Attention heads | 16 | | MLA latent dim | 512 (decoupled RoPE 64) | | FFN hidden | 2816 (SwiGLU) | | Context length | 2048 | | Vocab | 32,000 | | Tokenizer | [`meta-llama/Llama-2-7b-hf`](https://huggingface.co/meta-llama/Llama-2-7b-hf) | | Tied embeddings | Yes | | Precision | bfloat16 | ## Training | | | |---|---| | Dataset | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (8B unique tokens) | | Tokens processed | ~24B (~3 epochs) | | Steps | 23,000 (warmup 2,000) | | Effective batch | 512 sequences × 2048 tokens ≈ 1.05M tokens/step | | Optimizer | **Muon** for matrix params (lr 0.02) + **AdamW** for scalar/embed/LM-head/LN (lr 0.001, wd 0.1) | | LR schedule | Cosine with linear warmup | | Grad clip | 1.0 | | Hardware | A100-40GB (Northeastern Explorer HPC) | | Codebase base | [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) | Pure FineWeb-Edu throughout (no annealing mix, no instruction tuning). ## Evaluation 0-shot eval via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). HellaSwag and ARC-Easy reported as `acc_norm` (length-normalized accuracy); LAMBADA and Winogrande as `acc`. | Benchmark | Metric | TinyLM 275M (this model) | TinyLlama-1.1B baseline | Δ | |---|---|---:|---:|---:| | HellaSwag | acc_norm | 41.23% | 59.1% | −17.9 | | ARC-Easy | acc_norm | 51.22% | 55.7% | −4.5 | | LAMBADA | acc | 36.81% | 58.9% | −22.1 | | Winogrande | acc | 51.30% | 58.9% | −7.6 | | **Average** | | **45.14%** | 58.2% | **−13.1** | Baseline = [`TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T`](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T). ### Ablation (full 2×2) The four arms differ only in attention class and matrix optimizer; all other training settings are identical. | | AdamW | Muon | Δ (Muon − AdamW) | |:---|:---:|:---:|:---:| | **MHA** | Run A: 43.62% | Run C: 44.64% | +1.02 | | **MLA** | Run B: 44.11% | **Run D: 45.14%** *(this model)* | +1.03 | | **Δ (MLA − MHA)** | +0.49 | +0.50 | — | **Findings:** - **Muon contributes ~+1.0 pt avg, consistent across attention type** (+1.02 with MHA, +1.03 with MLA). - **MLA contributes ~+0.5 pt avg, consistent across optimizer** (+0.49 with AdamW, +0.50 with Muon). - **Effects are roughly additive** — sum of individual effects = +1.51, observed Run A → Run D = +1.52. Single-seed eval, so interactions below the ~1% noise floor are not detectable. HellaSwag and LAMBADA are the cleanest signals (monotonic A < B < C < D); ARC-Easy and Winogrande sit within stderr across all four arms. --- ## Usage This model uses a custom architecture (MLA with decoupled RoPE) that is not in HuggingFace `transformers`. To load it, install the source repo: ```bash git clone https://github.com/shivnarainms22/TinyLM cd TinyLM pip install torch transformers huggingface_hub ``` Then: ```python import torch from huggingface_hub import hf_hub_download from transformers import AutoTokenizer from tinylm.model import TinyLM, ModelConfig # Download checkpoint ckpt_path = hf_hub_download(repo_id="Shiv-22/tinylm", filename="step_22999.pt") ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True) # Build and load model model = TinyLM(ModelConfig(**ckpt["config"])) state = ckpt["model"] if any(k.startswith("_orig_mod.") for k in state): state = {k.removeprefix("_orig_mod."): v for k, v in state.items()} model.load_state_dict(state) model.eval().to("cuda").to(torch.bfloat16) # Tokenize and generate tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") prompt_ids = tok.encode("The capital of France is", return_tensors="pt").to("cuda") with torch.no_grad(): for _ in range(20): logits = model(prompt_ids) next_id = logits[0, -1].argmax(dim=-1, keepdim=True).unsqueeze(0) prompt_ids = torch.cat([prompt_ids, next_id], dim=1) print(tok.decode(prompt_ids[0])) ``` ## Limitations - **Small training budget for a base model.** 8B unique tokens / ~24B processed is well below modern SLMs (TinyLlama-1.1B saw 3T). Absolute benchmark numbers reflect that. - **Below the 1.1B baseline on all four tasks.** The headline portfolio finding is the *architecture* comparison (MLA+Muon vs MHA+AdamW), not raw absolute capability. - **Pretrain-only.** No instruction tuning, no RLHF, no safety filtering beyond what FineWeb-Edu already applies upstream. - **Winogrande at ~51% is essentially chance** (50% binary task) — this benchmark is not a meaningful capability signal at 275M scale. - **Single-seed evals.** Stderrs of ~0.5–1.0% on `acc_norm` metrics; differences smaller than that should be read as noise. - **Custom architecture.** Not compatible with `transformers.AutoModel.from_pretrained` out of the box. ## Project history - **v1** (2026-05, RunPod A100-80GB): single MLA+Muon training run on 1B unique tokens repeated ~21×. Established the training pipeline but the data looping hurt long-range coherence (LAMBADA acc 29.2%). v1 weights preserved at [`Shiv-22/tinylm-checkpoints`](https://huggingface.co/Shiv-22/tinylm-checkpoints) for contrast. - **HPC re-run** (2026-05, Northeastern Explorer A100-40GB): full 4-arm ablation on 8B unique tokens. The weights in this repo are from the **Run D** arm of that re-run. Re-running the same MLA+Muon arm with the data fix (1B×21 → 8B unique) was worth **+3.97 pts** average — roughly 2.6× the architecture-and-optimizer ablation gain. Data quality dominates architecture at this scale. ## Citation ```bibtex @misc{tinylm-275m, author = {Shivnarain}, title = {TinyLM 275M: A small language model with MLA and Muon}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/Shiv-22/tinylm}, } ``` ## License Apache 2.0. Inherits the permissive terms of [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) (MIT) for the codebase and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (ODC-By) for the training data.