modernbert-mini: compressed ModernBERT base (distilbert)

Browse files

Files changed (5) hide show

README.md +97 -0
config.json +62 -0
model.safetensors +3 -0
tokenizer.json +0 -0
tokenizer_config.json +17 -0

README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+---
+license: apache-2.0
+base_model: answerdotai/ModernBERT-base
+base_model_relation: finetune
+library_name: transformers
+pipeline_tag: fill-mask
+language:
+- en
+datasets:
+- HuggingFaceFW/fineweb-edu
+tags:
+- modernbert
+- distillation
+- knowledge-distillation
+- model-compression
+- fill-mask
+---
+# modernbert-mini
+DistilBERT-style distillation — the balanced, recommended general base.
+A **compressed, fine-tunable base encoder** derived from [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) — the *fork/derivative*:
+**46.7% of the teacher's size** while keeping **92.9% of its GLUE quality**. Use it as a general base and
+fine-tune on your downstream task, exactly like ModernBERT-base.
+## The family (one exercise)
+All three were produced in **one ModernBERT compression exercise** — same teacher ([`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. **Pick the tier that fits your size/quality budget:**
+- [`codechrl/modernbert-tiny`](https://huggingface.co/codechrl/modernbert-tiny) — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillation
+- [`codechrl/modernbert-mini`](https://huggingface.co/codechrl/modernbert-mini) ← **you are here** — 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillation
+- [`codechrl/modernbert-lite`](https://huggingface.co/codechrl/modernbert-lite) — 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization
+## How it was made (general process)
+1. **Teacher** — `answerdotai/ModernBERT-base` (149.7M params), the distillation target.
+2. **General-corpus distillation** — the student learns from the teacher on **FineWeb-Edu** (general English web
+   text) using the `distilbert` recipe. No task-/domain-specific data, so it stays a general base.
+3. **Evaluation** — quality measured on **GLUE** (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically),
+   reported purely as **% retained vs the teacher**.
+## Scores (% against the ModernBERT-base teacher)
+- **Size:** 281.2 MB → **46.7% of baseline**  (params 69.4M)
+- **GLUE quality retained:** **92.9%**
+- **eff_score:** 73.1 / 100  =  `0.5 · GLUE_retention% + 0.5 · size_reduction%`  (higher is better)
+### Full tier comparison
+| model | params (M) | size (MB) | size vs base | GLUE vs base | eff_score |
+|---|---|---|---|---|---|
+| `ModernBERT-base` (teacher) | 149.7 | 602.2 | 100% | 100% | 50.0 |
+| `modernbert-tiny` | 22.1 | 92.0 | 15.3% | 80.4% | 82.6 |
+| **modernbert-mini** ⭐ | 69.4 | 281.2 | 46.7% | 92.9% | 73.1 |
+| `modernbert-lite` | 149.7 | 302.9 | 50.3% | 99.3% | 74.5 |
+## Methods & architecture (each tier)
+Every tier derives from the **same teacher** but uses a different compression method:
+### `modernbert-tiny`
+*4 transformer layers, hidden size 312, 12 heads (~22M params)*
+**TinyBERT-style distillation.** A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.
+### `modernbert-mini` ⭐
+*6 transformer layers, hidden size 768 (~69M params)*
+**DistilBERT-style distillation.** The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.
+### `modernbert-lite`
+*full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16*
+**Half-precision (fp16) quantization.** No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.
+## Usage
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+tok = AutoTokenizer.from_pretrained("codechrl/modernbert-mini")
+model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-mini")
+# fine-tune for your task:
+# from transformers import AutoModelForSequenceClassification
+# clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-mini", num_labels=N)
+```
+## Intended use & limitations
+- **A base to fine-tune**, not a finished classifier.
+- Distilled on a **small compute budget** (demo-grade); for production, redistill with more steps/corpus.
+- `tiny` trades the most quality for the smallest size; `mini`/`lite` retain more.
+## Citation
+Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).

config.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "architectures": [
+    "ModernBertForMaskedLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 50281,
+  "classifier_activation": "gelu",
+  "classifier_bias": false,
+  "classifier_dropout": 0.0,
+  "classifier_pooling": "mean",
+  "cls_token_id": 50281,
+  "decoder_bias": true,
+  "deterministic_flash_attn": false,
+  "dtype": "float32",
+  "embedding_dropout": 0.0,
+  "eos_token_id": 50282,
+  "global_attn_every_n_layers": 3,
+  "gradient_checkpointing": false,
+  "hidden_activation": "gelu",
+  "hidden_size": 768,
+  "initializer_cutoff_factor": 2.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "layer_norm_eps": 1e-05,
+  "layer_types": [
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention"
+  ],
+  "local_attention": 128,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "mlp_dropout": 0.0,
+  "model_type": "modernbert",
+  "norm_bias": false,
+  "norm_eps": 1e-05,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 50283,
+  "position_embedding_type": "absolute",
+  "rope_parameters": {
+    "full_attention": {
+      "rope_theta": 160000.0,
+      "rope_type": "default"
+    },
+    "sliding_attention": {
+      "rope_theta": 10000.0,
+      "rope_type": "default"
+    }
+  },
+  "sep_token_id": 50282,
+  "sparse_pred_ignore_index": -100,
+  "sparse_prediction": false,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.12.1",
+  "use_cache": false,
+  "vocab_size": 50368
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:462e202ed275c99b8d5c5380d20d7dde621496f6fb57330b3a68b0371106394b
+size 277662544

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "is_local": false,
+  "local_files_only": false,
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "[UNK]"
+}