modernbert-tiny

Smallest ModernBERT (TinyBERT-style distillation) — for edge / low-latency.

A compressed, fine-tunable base encoder derived from answerdotai/ModernBERT-base — the fork/derivative: 15.3% of the teacher's size while keeping 80.4% of its GLUE quality. Use it as a general base and fine-tune on your downstream task, exactly like ModernBERT-base.

The family (one exercise)

All three were produced in one ModernBERT compression exercise — same teacher (answerdotai/ModernBERT-base), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. Pick the tier that fits your size/quality budget:

codechrl/modernbert-tiny ← you are here — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillation
codechrl/modernbert-mini — 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillation
codechrl/modernbert-lite — 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization

How it was made (general process)

Teacher — answerdotai/ModernBERT-base (149.7M params), the distillation target.
General-corpus distillation — the student learns from the teacher on FineWeb-Edu (general English web text) using the tinybert recipe. No task-/domain-specific data, so it stays a general base.
Evaluation — quality measured on GLUE (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically), reported purely as % retained vs the teacher.

Scores (% against the ModernBERT-base teacher)

Size: 92.0 MB → 15.3% of baseline (params 22.1M)
GLUE quality retained: 80.4%
eff_score: 82.6 / 100 = 0.5 · GLUE_retention% + 0.5 · size_reduction% (higher is better)

Full tier comparison

model	params (M)	size (MB)	size vs base	GLUE vs base	eff_score
`ModernBERT-base` (teacher)	149.7	602.2	100%	100%	50.0
modernbert-tiny ⭐	22.1	92.0	15.3%	80.4%	82.6
`modernbert-mini`	69.4	281.2	46.7%	92.9%	73.1
`modernbert-lite`	149.7	302.9	50.3%	99.3%	74.5

Methods & architecture (each tier)

Every tier derives from the same teacher but uses a different compression method:

`modernbert-tiny` ⭐

4 transformer layers, hidden size 312, 12 heads (~22M params)

TinyBERT-style distillation. A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.

`modernbert-mini`

6 transformer layers, hidden size 768 (~69M params)

DistilBERT-style distillation. The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.

`modernbert-lite`

full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16

Half-precision (fp16) quantization. No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("codechrl/modernbert-tiny")
model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-tiny")

# fine-tune for your task:
# from transformers import AutoModelForSequenceClassification
# clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-tiny", num_labels=N)

Intended use & limitations

A base to fine-tune, not a finished classifier.
Distilled on a small compute budget (demo-grade); for production, redistill with more steps/corpus.
tiny trades the most quality for the smallest size; mini/lite retain more.

Citation

Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).

Downloads last month: 16

Safetensors

Model size

22.1M params

Tensor type

F32

Model tree for codechrl/modernbert-tiny

Base model

answerdotai/ModernBERT-base

Finetuned

(1349)

this model

codechrl
/

modernbert-tiny

modernbert-tiny

The family (one exercise)

How it was made (general process)

Scores (% against the ModernBERT-base teacher)

Full tier comparison

Methods & architecture (each tier)

`modernbert-tiny` ⭐

`modernbert-mini`

`modernbert-lite`

Usage

Intended use & limitations

Citation

Model tree for codechrl/modernbert-tiny

Dataset used to train codechrl/modernbert-tiny

modernbert-tiny

The family (one exercise)

How it was made (general process)

Scores (% against the ModernBERT-base teacher)

Full tier comparison

Methods & architecture (each tier)

modernbert-tiny ⭐

modernbert-mini

modernbert-lite

Usage

Intended use & limitations

Citation

Model tree for codechrl/modernbert-tiny

Dataset used to train codechrl/modernbert-tiny

`modernbert-tiny` ⭐

`modernbert-mini`

`modernbert-lite`