modernbert-lite

Full ModernBERT in half precision — near-baseline quality at ~half the storage.

A compressed, fine-tunable base encoder derived from answerdotai/ModernBERT-base — the fork/derivative: 50.3% of the teacher's size while keeping 99.3% of its GLUE quality. Use it as a general base and fine-tune on your downstream task, exactly like ModernBERT-base.

The family (one exercise)

All three were produced in one ModernBERT compression exercise — same teacher (answerdotai/ModernBERT-base), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. Pick the tier that fits your size/quality budget:

  • codechrl/modernbert-tiny — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillation
  • codechrl/modernbert-mini — 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillation
  • codechrl/modernbert-liteyou are here — 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization

How it was made (general process)

  1. Teacheranswerdotai/ModernBERT-base (149.7M params), the distillation target.
  2. General-corpus distillation — the student learns from the teacher on FineWeb-Edu (general English web text) using the fp16 recipe. No task-/domain-specific data, so it stays a general base.
  3. Evaluation — quality measured on GLUE (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically), reported purely as % retained vs the teacher.

Scores (% against the ModernBERT-base teacher)

  • Size: 302.9 MB → 50.3% of baseline (params 149.7M)
  • GLUE quality retained: 99.3%
  • eff_score: 74.5 / 100 = 0.5 · GLUE_retention% + 0.5 · size_reduction% (higher is better)

Full tier comparison

model params (M) size (MB) size vs base GLUE vs base eff_score
ModernBERT-base (teacher) 149.7 602.2 100% 100% 50.0
modernbert-tiny 22.1 92.0 15.3% 80.4% 82.6
modernbert-mini 69.4 281.2 46.7% 92.9% 73.1
modernbert-lite 149.7 302.9 50.3% 99.3% 74.5

Methods & architecture (each tier)

Every tier derives from the same teacher but uses a different compression method:

modernbert-tiny

4 transformer layers, hidden size 312, 12 heads (~22M params)

TinyBERT-style distillation. A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.

modernbert-mini

6 transformer layers, hidden size 768 (~69M params)

DistilBERT-style distillation. The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.

modernbert-lite

full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16

Half-precision (fp16) quantization. No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("codechrl/modernbert-lite")
model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-lite")

# fine-tune for your task:
# from transformers import AutoModelForSequenceClassification
# clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-lite", num_labels=N)

Intended use & limitations

  • A base to fine-tune, not a finished classifier.
  • Distilled on a small compute budget (demo-grade); for production, redistill with more steps/corpus.
  • tiny trades the most quality for the smallest size; mini/lite retain more.

Citation

Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).

Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for codechrl/modernbert-lite

Finetuned
(1349)
this model

Dataset used to train codechrl/modernbert-lite