TinyLM β€” 275M Parameter SLM

A 275M parameter causal language model trained from scratch using Multi-head Latent Attention (MLA) (DeepSeek-V2) and the Muon optimizer, trained on 1B tokens of FineWeb-Edu.

Built as a research portfolio piece to study the effect of modern KV-compression (MLA) and second-order optimizer improvements (Muon) at 275M scale.


Model Details

Property Value
Parameters ~275M
Architecture Transformer (MLA)
Layers 18
d_model 1024
Attention heads 16
KV latent dim (MLA) 512
Decoupled RoPE dim 64
FFN hidden 2816 (SwiGLU)
Context length 2048
Vocab size 32,000
Tokenizer Llama-2

Attention β€” MLA: KV cache stores d_latent + d_rope = 576 values per token per layer instead of n_heads Γ— head_dim Γ— 2 = 2048, giving a 3.6Γ— KV cache reduction at inference time. Positional information is carried only by the d_rope = 64 decoupled RoPE branch; the full latent path has no positional bias.

Optimizer β€” Muon: Newton-Schulz orthogonalization is applied to weight gradients before the AdamW update step, providing approximate second-order curvature correction without the cost of a full Hessian.

Other: RMSNorm, SwiGLU activations, tied input/output embeddings, BF16 training.


Training

Property Value
Dataset FineWeb-Edu (1B unique tokens, 10 shards)
Steps 20,000
Effective epochs ~21 (data repeated)
Batch size 512 sequences Γ— 2048 tokens
LR schedule Cosine with warmup
Precision BF16
Hardware 1Γ— A100-80GB
Final loss 2.22
Gradient norm (final) 0.094

Training logs: WandB run

Note: Training uses 1B unique tokens repeated ~21Γ— over 20k steps (no annealing mix). This limits long-range coherence; benchmarks that reward it (HellaSwag, LAMBADA) are most affected.


Benchmark Results (0-shot)

Evaluated with lm-eval v0.4.12. Baseline is TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T (3T tokens, 4Γ— larger).

Benchmark TinyLM 275M TinyLlama 1.1B Ξ”
HellaSwag (acc) 32.4% 59.1% βˆ’26.7%
ARC-Easy (acc) 53.8% 55.7% βˆ’1.9%
LAMBADA (acc) 29.2% 58.9% βˆ’29.7%
Winogrande (acc) 50.0% 58.9% βˆ’8.9%
Average 41.3% 58.2% βˆ’16.9%

ARC-Easy is within 2% of a model trained on 150Γ— more unique tokens with 4Γ— more parameters. HellaSwag and LAMBADA are weak, which is expected from a model trained on heavily repeated data β€” both tasks heavily reward long-range coherence.


Checkpoint

The model checkpoint (step_19999.pt, 2.33 GB) is in a separate repo:

Shiv-22/tinylm-checkpoints

generate.py downloads it automatically on first run.


Usage

Requirements

pip install torch transformers huggingface_hub

Quick start

git clone https://huggingface.co/Shiv-22/tinylm
cd tinylm
python generate.py --prompt "The theory of relativity states that"

Interactive mode:

python generate.py

Greedy decoding:

python generate.py --prompt "Once upon a time" --temperature 0

Programmatic loading

import torch
from huggingface_hub import hf_hub_download
from tinylm.model import ModelConfig, TinyLM

ckpt_path = hf_hub_download("Shiv-22/tinylm-checkpoints", "step_19999.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True)
c = ckpt["config"]

model = TinyLM(ModelConfig(
    n_layers=c["n_layers"], d_model=c["d_model"], n_heads=c["n_heads"],
    d_latent=c["d_latent"], d_rope=c["d_rope"], ffn_hidden=c["ffn_hidden"],
    ctx=c["ctx"], vocab_size=c["vocab_size"], tie_weights=c["tie_weights"],
    attention=c["attention"],
))
state = ckpt["model"]
if any(k.startswith("_orig_mod.") for k in state):
    state = {k.removeprefix("_orig_mod."): v for k, v in state.items()}
model.load_state_dict(state)
model.eval()

Limitations

  • Base model only β€” no instruction tuning or RLHF. Outputs raw continuations.
  • English educational text β€” trained on FineWeb-Edu; other domains/languages will be poor.
  • Repeated data β€” 1B unique tokens Γ— 21 epochs limits long-range coherence.
  • Single run β€” only the MLA+Muon configuration (Run D) was trained to completion; no ablation comparison is available.

Acknowledgments

Architecture is based on modded-nanogpt by Keller Jordan. MLA adapted from the DeepSeek-V2 HuggingFace implementation (MIT licensed). Muon optimizer by Keller Jordan.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Shiv-22/tinylm

Paper for Shiv-22/tinylm

Evaluation results