modernbert-tiny: compressed ModernBERT base (tinybert)

75da445 verified 6 days ago

4.78 kB

	---
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	base_model_relation: finetune
	library_name: transformers
	pipeline_tag: fill-mask
	language:
	- en
	datasets:
	- HuggingFaceFW/fineweb-edu
	tags:
	- modernbert
	- distillation
	- knowledge-distillation
	- model-compression
	- fill-mask
	---
	# modernbert-tiny

	Smallest ModernBERT (TinyBERT-style distillation) — for edge / low-latency.

	A compressed, fine-tunable base encoder derived from [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) — the fork/derivative:
	15.3% of the teacher's size while keeping 80.4% of its GLUE quality. Use it as a general base and
	fine-tune on your downstream task, exactly like ModernBERT-base.

	## The family (one exercise)

	All three were produced in one ModernBERT compression exercise — same teacher ([`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. Pick the tier that fits your size/quality budget:

	- [`codechrl/modernbert-tiny`](https://huggingface.co/codechrl/modernbert-tiny) ← you are here — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillation
	- [`codechrl/modernbert-mini`](https://huggingface.co/codechrl/modernbert-mini) — 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillation
	- [`codechrl/modernbert-lite`](https://huggingface.co/codechrl/modernbert-lite) — 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization

	## How it was made (general process)

	1. Teacher — `answerdotai/ModernBERT-base` (149.7M params), the distillation target.
	2. General-corpus distillation — the student learns from the teacher on FineWeb-Edu (general English web
	text) using the `tinybert` recipe. No task-/domain-specific data, so it stays a general base.
	3. Evaluation — quality measured on GLUE (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically),
	reported purely as % retained vs the teacher.

	## Scores (% against the ModernBERT-base teacher)

	- Size: 92.0 MB → 15.3% of baseline (params 22.1M)
	- GLUE quality retained: 80.4%
	- eff_score: 82.6 / 100 = `0.5 · GLUE_retention% + 0.5 · size_reduction%` (higher is better)

	### Full tier comparison

	\| model \| params (M) \| size (MB) \| size vs base \| GLUE vs base \| eff_score \|
	\|---\|---\|---\|---\|---\|---\|
	\| `ModernBERT-base` (teacher) \| 149.7 \| 602.2 \| 100% \| 100% \| 50.0 \|
	\| modernbert-tiny ⭐ \| 22.1 \| 92.0 \| 15.3% \| 80.4% \| 82.6 \|
	\| `modernbert-mini` \| 69.4 \| 281.2 \| 46.7% \| 92.9% \| 73.1 \|
	\| `modernbert-lite` \| 149.7 \| 302.9 \| 50.3% \| 99.3% \| 74.5 \|

	## Methods & architecture (each tier)

	Every tier derives from the same teacher but uses a different compression method:

	### `modernbert-tiny` ⭐
	4 transformer layers, hidden size 312, 12 heads (~22M params)

	TinyBERT-style distillation. A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.

	### `modernbert-mini`
	6 transformer layers, hidden size 768 (~69M params)

	DistilBERT-style distillation. The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.

	### `modernbert-lite`
	full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16

	Half-precision (fp16) quantization. No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.


	## Usage

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer
	tok = AutoTokenizer.from_pretrained("codechrl/modernbert-tiny")
	model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-tiny")

	# fine-tune for your task:
	# from transformers import AutoModelForSequenceClassification
	# clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-tiny", num_labels=N)
	```

	## Intended use & limitations

	- A base to fine-tune, not a finished classifier.
	- Distilled on a small compute budget (demo-grade); for production, redistill with more steps/corpus.
	- `tiny` trades the most quality for the smallest size; `mini`/`lite` retain more.

	## Citation

	Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).