TinyLM β 275M Parameter SLM
A 275M parameter causal language model trained from scratch using Multi-head Latent Attention (MLA) (DeepSeek-V2) and the Muon optimizer, trained on 1B tokens of FineWeb-Edu.
Built as a research portfolio piece to study the effect of modern KV-compression (MLA) and second-order optimizer improvements (Muon) at 275M scale.
Model Details
| Property | Value |
|---|---|
| Parameters | ~275M |
| Architecture | Transformer (MLA) |
| Layers | 18 |
| d_model | 1024 |
| Attention heads | 16 |
| KV latent dim (MLA) | 512 |
| Decoupled RoPE dim | 64 |
| FFN hidden | 2816 (SwiGLU) |
| Context length | 2048 |
| Vocab size | 32,000 |
| Tokenizer | Llama-2 |
Attention β MLA: KV cache stores d_latent + d_rope = 576 values per token
per layer instead of n_heads Γ head_dim Γ 2 = 2048, giving a 3.6Γ KV cache
reduction at inference time. Positional information is carried only by the
d_rope = 64 decoupled RoPE branch; the full latent path has no positional bias.
Optimizer β Muon: Newton-Schulz orthogonalization is applied to weight gradients before the AdamW update step, providing approximate second-order curvature correction without the cost of a full Hessian.
Other: RMSNorm, SwiGLU activations, tied input/output embeddings, BF16 training.
Training
| Property | Value |
|---|---|
| Dataset | FineWeb-Edu (1B unique tokens, 10 shards) |
| Steps | 20,000 |
| Effective epochs | ~21 (data repeated) |
| Batch size | 512 sequences Γ 2048 tokens |
| LR schedule | Cosine with warmup |
| Precision | BF16 |
| Hardware | 1Γ A100-80GB |
| Final loss | 2.22 |
| Gradient norm (final) | 0.094 |
Training logs: WandB run
Note: Training uses 1B unique tokens repeated ~21Γ over 20k steps (no annealing mix). This limits long-range coherence; benchmarks that reward it (HellaSwag, LAMBADA) are most affected.
Benchmark Results (0-shot)
Evaluated with lm-eval v0.4.12.
Baseline is TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T (3T tokens, 4Γ larger).
| Benchmark | TinyLM 275M | TinyLlama 1.1B | Ξ |
|---|---|---|---|
| HellaSwag (acc) | 32.4% | 59.1% | β26.7% |
| ARC-Easy (acc) | 53.8% | 55.7% | β1.9% |
| LAMBADA (acc) | 29.2% | 58.9% | β29.7% |
| Winogrande (acc) | 50.0% | 58.9% | β8.9% |
| Average | 41.3% | 58.2% | β16.9% |
ARC-Easy is within 2% of a model trained on 150Γ more unique tokens with 4Γ more parameters. HellaSwag and LAMBADA are weak, which is expected from a model trained on heavily repeated data β both tasks heavily reward long-range coherence.
Checkpoint
The model checkpoint (step_19999.pt, 2.33 GB) is in a separate repo:
generate.py downloads it automatically on first run.
Usage
Requirements
pip install torch transformers huggingface_hub
Quick start
git clone https://huggingface.co/Shiv-22/tinylm
cd tinylm
python generate.py --prompt "The theory of relativity states that"
Interactive mode:
python generate.py
Greedy decoding:
python generate.py --prompt "Once upon a time" --temperature 0
Programmatic loading
import torch
from huggingface_hub import hf_hub_download
from tinylm.model import ModelConfig, TinyLM
ckpt_path = hf_hub_download("Shiv-22/tinylm-checkpoints", "step_19999.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True)
c = ckpt["config"]
model = TinyLM(ModelConfig(
n_layers=c["n_layers"], d_model=c["d_model"], n_heads=c["n_heads"],
d_latent=c["d_latent"], d_rope=c["d_rope"], ffn_hidden=c["ffn_hidden"],
ctx=c["ctx"], vocab_size=c["vocab_size"], tie_weights=c["tie_weights"],
attention=c["attention"],
))
state = ckpt["model"]
if any(k.startswith("_orig_mod.") for k in state):
state = {k.removeprefix("_orig_mod."): v for k, v in state.items()}
model.load_state_dict(state)
model.eval()
Limitations
- Base model only β no instruction tuning or RLHF. Outputs raw continuations.
- English educational text β trained on FineWeb-Edu; other domains/languages will be poor.
- Repeated data β 1B unique tokens Γ 21 epochs limits long-range coherence.
- Single run β only the MLA+Muon configuration (Run D) was trained to completion; no ablation comparison is available.
Acknowledgments
Architecture is based on modded-nanogpt by Keller Jordan. MLA adapted from the DeepSeek-V2 HuggingFace implementation (MIT licensed). Muon optimizer by Keller Jordan.
Dataset used to train Shiv-22/tinylm
Paper for Shiv-22/tinylm
Evaluation results
- acc (0-shot) on HellaSwagself-reported0.324
- acc (0-shot) on ARC-Easyself-reported0.538
- acc (0-shot) on LAMBADAself-reported0.292
- acc (0-shot) on Winograndeself-reported0.500