---
license: apache-2.0
language:
  - en
library_name: pytorch
pipeline_tag: text-generation
tags:
  - causal-lm
  - small-language-model
  - mla
  - multi-head-latent-attention
  - muon
  - pretrained
datasets:
  - HuggingFaceFW/fineweb-edu
---

# TinyLM 275M (MLA + Muon)

A **275M parameter** small language model trained from scratch with
**Multi-head Latent Attention (MLA, DeepSeek-V2 style)** and the **Muon
optimizer**, on **8B unique tokens** of FineWeb-Edu. Benchmarked against
TinyLlama-1.1B as part of a 4-arm architecture ablation.

- **Source code:** https://github.com/shivnarainms22/TinyLM
- **Full ablation results:** see [`results/hpc_rerun_ablation.md`](https://github.com/shivnarainms22/TinyLM/blob/main/results/hpc_rerun_ablation.md) in the repo
- **All four ablation checkpoints:** [`Shiv-22/tinylm-checkpoints-v2`](https://huggingface.co/Shiv-22/tinylm-checkpoints-v2)

This repo holds the **Run D** arm of the ablation (MLA + Muon) — the
best-performing of the four and the model intended for downstream use.

---

## Model details

| | |
|---|---|
| Parameters | 274.6M |
| Architecture | TinyLlama-style decoder-only Transformer with MLA |
| Layers | 18 |
| Hidden size | 1024 |
| Attention heads | 16 |
| MLA latent dim | 512 (decoupled RoPE 64) |
| FFN hidden | 2816 (SwiGLU) |
| Context length | 2048 |
| Vocab | 32,000 |
| Tokenizer | [`meta-llama/Llama-2-7b-hf`](https://huggingface.co/meta-llama/Llama-2-7b-hf) |
| Tied embeddings | Yes |
| Precision | bfloat16 |

## Training

| | |
|---|---|
| Dataset | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (8B unique tokens) |
| Tokens processed | ~24B (~3 epochs) |
| Steps | 23,000 (warmup 2,000) |
| Effective batch | 512 sequences × 2048 tokens ≈ 1.05M tokens/step |
| Optimizer | **Muon** for matrix params (lr 0.02) + **AdamW** for scalar/embed/LM-head/LN (lr 0.001, wd 0.1) |
| LR schedule | Cosine with linear warmup |
| Grad clip | 1.0 |
| Hardware | A100-40GB (Northeastern Explorer HPC) |
| Codebase base | [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) |

Pure FineWeb-Edu throughout (no annealing mix, no instruction tuning).

## Evaluation

0-shot eval via [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). HellaSwag and ARC-Easy reported as `acc_norm` (length-normalized accuracy); LAMBADA and Winogrande as `acc`.

| Benchmark | Metric | TinyLM 275M (this model) | TinyLlama-1.1B baseline | Δ |
|---|---|---:|---:|---:|
| HellaSwag | acc_norm | 41.23% | 59.1% | −17.9 |
| ARC-Easy | acc_norm | 51.22% | 55.7% | −4.5 |
| LAMBADA | acc | 36.81% | 58.9% | −22.1 |
| Winogrande | acc | 51.30% | 58.9% | −7.6 |
| **Average** | | **45.14%** | 58.2% | **−13.1** |

Baseline = [`TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T`](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T).

### Ablation (full 2×2)

The four arms differ only in attention class and matrix optimizer; all other
training settings are identical.

|  | AdamW | Muon | Δ (Muon − AdamW) |
|:---|:---:|:---:|:---:|
| **MHA** | Run A: 43.62% | Run C: 44.64% | +1.02 |
| **MLA** | Run B: 44.11% | **Run D: 45.14%** *(this model)* | +1.03 |
| **Δ (MLA − MHA)** | +0.49 | +0.50 | — |

**Findings:**

- **Muon contributes ~+1.0 pt avg, consistent across attention type** (+1.02 with MHA, +1.03 with MLA).
- **MLA contributes ~+0.5 pt avg, consistent across optimizer** (+0.49 with AdamW, +0.50 with Muon).
- **Effects are roughly additive** — sum of individual effects = +1.51, observed Run A → Run D = +1.52. Single-seed eval, so interactions below the ~1% noise floor are not detectable.

HellaSwag and LAMBADA are the cleanest signals (monotonic A < B < C < D);
ARC-Easy and Winogrande sit within stderr across all four arms.

---

## Usage

This model uses a custom architecture (MLA with decoupled RoPE) that is not
in HuggingFace `transformers`. To load it, install the source repo:

```bash
git clone https://github.com/shivnarainms22/TinyLM
cd TinyLM
pip install torch transformers huggingface_hub
```

Then:

```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from tinylm.model import TinyLM, ModelConfig

# Download checkpoint
ckpt_path = hf_hub_download(repo_id="Shiv-22/tinylm", filename="step_22999.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True)

# Build and load model
model = TinyLM(ModelConfig(**ckpt["config"]))
state = ckpt["model"]
if any(k.startswith("_orig_mod.") for k in state):
    state = {k.removeprefix("_orig_mod."): v for k, v in state.items()}
model.load_state_dict(state)
model.eval().to("cuda").to(torch.bfloat16)

# Tokenize and generate
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
prompt_ids = tok.encode("The capital of France is", return_tensors="pt").to("cuda")

with torch.no_grad():
    for _ in range(20):
        logits = model(prompt_ids)
        next_id = logits[0, -1].argmax(dim=-1, keepdim=True).unsqueeze(0)
        prompt_ids = torch.cat([prompt_ids, next_id], dim=1)

print(tok.decode(prompt_ids[0]))
```

## Limitations

- **Small training budget for a base model.** 8B unique tokens / ~24B processed is well below modern SLMs (TinyLlama-1.1B saw 3T). Absolute benchmark numbers reflect that.
- **Below the 1.1B baseline on all four tasks.** The headline portfolio finding is the *architecture* comparison (MLA+Muon vs MHA+AdamW), not raw absolute capability.
- **Pretrain-only.** No instruction tuning, no RLHF, no safety filtering beyond what FineWeb-Edu already applies upstream.
- **Winogrande at ~51% is essentially chance** (50% binary task) — this benchmark is not a meaningful capability signal at 275M scale.
- **Single-seed evals.** Stderrs of ~0.5–1.0% on `acc_norm` metrics; differences smaller than that should be read as noise.
- **Custom architecture.** Not compatible with `transformers.AutoModel.from_pretrained` out of the box.

## Project history

- **v1** (2026-05, RunPod A100-80GB): single MLA+Muon training run on 1B unique tokens repeated ~21×. Established the training pipeline but the data looping hurt long-range coherence (LAMBADA acc 29.2%). v1 weights preserved at [`Shiv-22/tinylm-checkpoints`](https://huggingface.co/Shiv-22/tinylm-checkpoints) for contrast.
- **HPC re-run** (2026-05, Northeastern Explorer A100-40GB): full 4-arm ablation on 8B unique tokens. The weights in this repo are from the **Run D** arm of that re-run.

Re-running the same MLA+Muon arm with the data fix (1B×21 → 8B unique) was
worth **+3.97 pts** average — roughly 2.6× the architecture-and-optimizer
ablation gain. Data quality dominates architecture at this scale.

## Citation

```bibtex
@misc{tinylm-275m,
  author       = {Shivnarain},
  title        = {TinyLM 275M: A small language model with MLA and Muon},
  year         = {2026},
  publisher    = {HuggingFace},
  url          = {https://huggingface.co/Shiv-22/tinylm},
}
```

## License

Apache 2.0. Inherits the permissive terms of [modded-nanogpt](https://github.com/KellerJordan/modded-nanogpt) (MIT) for the codebase and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (ODC-By) for the training data.