Instructions to use codechrl/modernbert-tiny with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codechrl/modernbert-tiny with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="codechrl/modernbert-tiny")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("codechrl/modernbert-tiny") model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-tiny") - Notebooks
- Google Colab
- Kaggle
modernbert-tiny
Smallest ModernBERT (TinyBERT-style distillation) — for edge / low-latency.
A compressed, fine-tunable base encoder derived from answerdotai/ModernBERT-base — the fork/derivative:
15.3% of the teacher's size while keeping 80.4% of its GLUE quality. Use it as a general base and
fine-tune on your downstream task, exactly like ModernBERT-base.
The family (one exercise)
All three were produced in one ModernBERT compression exercise — same teacher (answerdotai/ModernBERT-base), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. Pick the tier that fits your size/quality budget:
codechrl/modernbert-tiny← you are here — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillationcodechrl/modernbert-mini— 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillationcodechrl/modernbert-lite— 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization
How it was made (general process)
- Teacher —
answerdotai/ModernBERT-base(149.7M params), the distillation target. - General-corpus distillation — the student learns from the teacher on FineWeb-Edu (general English web
text) using the
tinybertrecipe. No task-/domain-specific data, so it stays a general base. - Evaluation — quality measured on GLUE (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically), reported purely as % retained vs the teacher.
Scores (% against the ModernBERT-base teacher)
- Size: 92.0 MB → 15.3% of baseline (params 22.1M)
- GLUE quality retained: 80.4%
- eff_score: 82.6 / 100 =
0.5 · GLUE_retention% + 0.5 · size_reduction%(higher is better)
Full tier comparison
| model | params (M) | size (MB) | size vs base | GLUE vs base | eff_score |
|---|---|---|---|---|---|
ModernBERT-base (teacher) |
149.7 | 602.2 | 100% | 100% | 50.0 |
| modernbert-tiny ⭐ | 22.1 | 92.0 | 15.3% | 80.4% | 82.6 |
modernbert-mini |
69.4 | 281.2 | 46.7% | 92.9% | 73.1 |
modernbert-lite |
149.7 | 302.9 | 50.3% | 99.3% | 74.5 |
Methods & architecture (each tier)
Every tier derives from the same teacher but uses a different compression method:
modernbert-tiny ⭐
4 transformer layers, hidden size 312, 12 heads (~22M params)
TinyBERT-style distillation. A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.
modernbert-mini
6 transformer layers, hidden size 768 (~69M params)
DistilBERT-style distillation. The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.
modernbert-lite
full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16
Half-precision (fp16) quantization. No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("codechrl/modernbert-tiny")
model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-tiny")
# fine-tune for your task:
# from transformers import AutoModelForSequenceClassification
# clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-tiny", num_labels=N)
Intended use & limitations
- A base to fine-tune, not a finished classifier.
- Distilled on a small compute budget (demo-grade); for production, redistill with more steps/corpus.
tinytrades the most quality for the smallest size;mini/literetain more.
Citation
Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).
- Downloads last month
- 16
Model tree for codechrl/modernbert-tiny
Base model
answerdotai/ModernBERT-base