Fill-Mask
Transformers
Safetensors
English
modernbert
distillation
knowledge-distillation
model-compression
Instructions to use codechrl/modernbert-lite with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codechrl/modernbert-lite with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="codechrl/modernbert-lite")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("codechrl/modernbert-lite") model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-lite") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: answerdotai/ModernBERT-base | |
| base_model_relation: finetune | |
| library_name: transformers | |
| pipeline_tag: fill-mask | |
| language: | |
| - en | |
| datasets: | |
| - HuggingFaceFW/fineweb-edu | |
| tags: | |
| - modernbert | |
| - distillation | |
| - knowledge-distillation | |
| - model-compression | |
| - fill-mask | |
| # modernbert-lite | |
| Full ModernBERT in half precision β near-baseline quality at ~half the storage. | |
| A **compressed, fine-tunable base encoder** derived from [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) β the *fork/derivative*: | |
| **50.3% of the teacher's size** while keeping **99.3% of its GLUE quality**. Use it as a general base and | |
| fine-tune on your downstream task, exactly like ModernBERT-base. | |
| ## The family (one exercise) | |
| All three were produced in **one ModernBERT compression exercise** β same teacher ([`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)), same FineWeb-Edu corpus, same GLUE eval β comparing different compression methods. **Pick the tier that fits your size/quality budget:** | |
| - [`codechrl/modernbert-tiny`](https://huggingface.co/codechrl/modernbert-tiny) β 22.1M params, 15.3% of base size, 80.4% GLUE retained Β· TinyBERT-style attention+hidden distillation | |
| - [`codechrl/modernbert-mini`](https://huggingface.co/codechrl/modernbert-mini) β 69.4M params, 46.7% of base size, 92.9% GLUE retained Β· DistilBERT-style depth distillation | |
| - [`codechrl/modernbert-lite`](https://huggingface.co/codechrl/modernbert-lite) β **you are here** β 149.7M params, 50.3% of base size, 99.3% GLUE retained Β· fp16 half-precision quantization | |
| ## How it was made (general process) | |
| 1. **Teacher** β `answerdotai/ModernBERT-base` (149.7M params), the distillation target. | |
| 2. **General-corpus distillation** β the student learns from the teacher on **FineWeb-Edu** (general English web | |
| text) using the `fp16` recipe. No task-/domain-specific data, so it stays a general base. | |
| 3. **Evaluation** β quality measured on **GLUE** (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically), | |
| reported purely as **% retained vs the teacher**. | |
| ## Scores (% against the ModernBERT-base teacher) | |
| - **Size:** 302.9 MB β **50.3% of baseline** (params 149.7M) | |
| - **GLUE quality retained:** **99.3%** | |
| - **eff_score:** 74.5 / 100 = `0.5 Β· GLUE_retention% + 0.5 Β· size_reduction%` (higher is better) | |
| ### Full tier comparison | |
| | model | params (M) | size (MB) | size vs base | GLUE vs base | eff_score | | |
| |---|---|---|---|---|---| | |
| | `ModernBERT-base` (teacher) | 149.7 | 602.2 | 100% | 100% | 50.0 | | |
| | `modernbert-tiny` | 22.1 | 92.0 | 15.3% | 80.4% | 82.6 | | |
| | `modernbert-mini` | 69.4 | 281.2 | 46.7% | 92.9% | 73.1 | | |
| | **modernbert-lite** β | 149.7 | 302.9 | 50.3% | 99.3% | 74.5 | | |
| ## Methods & architecture (each tier) | |
| Every tier derives from the **same teacher** but uses a different compression method: | |
| ### `modernbert-tiny` | |
| *4 transformer layers, hidden size 312, 12 heads (~22M params)* | |
| **TinyBERT-style distillation.** A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality. | |
| ### `modernbert-mini` | |
| *6 transformer layers, hidden size 768 (~69M params)* | |
| **DistilBERT-style distillation.** The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here. | |
| ### `modernbert-lite` β | |
| *full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16* | |
| **Half-precision (fp16) quantization.** No retraining β weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune. | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| tok = AutoTokenizer.from_pretrained("codechrl/modernbert-lite") | |
| model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-lite") | |
| # fine-tune for your task: | |
| # from transformers import AutoModelForSequenceClassification | |
| # clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-lite", num_labels=N) | |
| ``` | |
| ## Intended use & limitations | |
| - **A base to fine-tune**, not a finished classifier. | |
| - Distilled on a **small compute budget** (demo-grade); for production, redistill with more steps/corpus. | |
| - `tiny` trades the most quality for the smallest size; `mini`/`lite` retain more. | |
| ## Citation | |
| Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020). | |