codechrl commited on
Commit
0cf3229
·
verified ·
1 Parent(s): da616de

modernbert-mini: compressed ModernBERT base (distilbert)

Browse files
Files changed (5) hide show
  1. README.md +97 -0
  2. config.json +62 -0
  3. model.safetensors +3 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +17 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: answerdotai/ModernBERT-base
4
+ base_model_relation: finetune
5
+ library_name: transformers
6
+ pipeline_tag: fill-mask
7
+ language:
8
+ - en
9
+ datasets:
10
+ - HuggingFaceFW/fineweb-edu
11
+ tags:
12
+ - modernbert
13
+ - distillation
14
+ - knowledge-distillation
15
+ - model-compression
16
+ - fill-mask
17
+ ---
18
+ # modernbert-mini
19
+
20
+ DistilBERT-style distillation — the balanced, recommended general base.
21
+
22
+ A **compressed, fine-tunable base encoder** derived from [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) — the *fork/derivative*:
23
+ **46.7% of the teacher's size** while keeping **92.9% of its GLUE quality**. Use it as a general base and
24
+ fine-tune on your downstream task, exactly like ModernBERT-base.
25
+
26
+ ## The family (one exercise)
27
+
28
+ All three were produced in **one ModernBERT compression exercise** — same teacher ([`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. **Pick the tier that fits your size/quality budget:**
29
+
30
+ - [`codechrl/modernbert-tiny`](https://huggingface.co/codechrl/modernbert-tiny) — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillation
31
+ - [`codechrl/modernbert-mini`](https://huggingface.co/codechrl/modernbert-mini) ← **you are here** — 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillation
32
+ - [`codechrl/modernbert-lite`](https://huggingface.co/codechrl/modernbert-lite) — 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization
33
+
34
+ ## How it was made (general process)
35
+
36
+ 1. **Teacher** — `answerdotai/ModernBERT-base` (149.7M params), the distillation target.
37
+ 2. **General-corpus distillation** — the student learns from the teacher on **FineWeb-Edu** (general English web
38
+ text) using the `distilbert` recipe. No task-/domain-specific data, so it stays a general base.
39
+ 3. **Evaluation** — quality measured on **GLUE** (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically),
40
+ reported purely as **% retained vs the teacher**.
41
+
42
+ ## Scores (% against the ModernBERT-base teacher)
43
+
44
+ - **Size:** 281.2 MB → **46.7% of baseline** (params 69.4M)
45
+ - **GLUE quality retained:** **92.9%**
46
+ - **eff_score:** 73.1 / 100 = `0.5 · GLUE_retention% + 0.5 · size_reduction%` (higher is better)
47
+
48
+ ### Full tier comparison
49
+
50
+ | model | params (M) | size (MB) | size vs base | GLUE vs base | eff_score |
51
+ |---|---|---|---|---|---|
52
+ | `ModernBERT-base` (teacher) | 149.7 | 602.2 | 100% | 100% | 50.0 |
53
+ | `modernbert-tiny` | 22.1 | 92.0 | 15.3% | 80.4% | 82.6 |
54
+ | **modernbert-mini** ⭐ | 69.4 | 281.2 | 46.7% | 92.9% | 73.1 |
55
+ | `modernbert-lite` | 149.7 | 302.9 | 50.3% | 99.3% | 74.5 |
56
+
57
+ ## Methods & architecture (each tier)
58
+
59
+ Every tier derives from the **same teacher** but uses a different compression method:
60
+
61
+ ### `modernbert-tiny`
62
+ *4 transformer layers, hidden size 312, 12 heads (~22M params)*
63
+
64
+ **TinyBERT-style distillation.** A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.
65
+
66
+ ### `modernbert-mini` ⭐
67
+ *6 transformer layers, hidden size 768 (~69M params)*
68
+
69
+ **DistilBERT-style distillation.** The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.
70
+
71
+ ### `modernbert-lite`
72
+ *full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16*
73
+
74
+ **Half-precision (fp16) quantization.** No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.
75
+
76
+
77
+ ## Usage
78
+
79
+ ```python
80
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
81
+ tok = AutoTokenizer.from_pretrained("codechrl/modernbert-mini")
82
+ model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-mini")
83
+
84
+ # fine-tune for your task:
85
+ # from transformers import AutoModelForSequenceClassification
86
+ # clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-mini", num_labels=N)
87
+ ```
88
+
89
+ ## Intended use & limitations
90
+
91
+ - **A base to fine-tune**, not a finished classifier.
92
+ - Distilled on a **small compute budget** (demo-grade); for production, redistill with more steps/corpus.
93
+ - `tiny` trades the most quality for the smallest size; `mini`/`lite` retain more.
94
+
95
+ ## Citation
96
+
97
+ Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).
config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ModernBertForMaskedLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 50281,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 50281,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
+ "dtype": "float32",
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": 50282,
18
+ "global_attn_every_n_layers": 3,
19
+ "gradient_checkpointing": false,
20
+ "hidden_activation": "gelu",
21
+ "hidden_size": 768,
22
+ "initializer_cutoff_factor": 2.0,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 1152,
25
+ "layer_norm_eps": 1e-05,
26
+ "layer_types": [
27
+ "full_attention",
28
+ "sliding_attention",
29
+ "sliding_attention",
30
+ "full_attention",
31
+ "sliding_attention",
32
+ "sliding_attention"
33
+ ],
34
+ "local_attention": 128,
35
+ "max_position_embeddings": 8192,
36
+ "mlp_bias": false,
37
+ "mlp_dropout": 0.0,
38
+ "model_type": "modernbert",
39
+ "norm_bias": false,
40
+ "norm_eps": 1e-05,
41
+ "num_attention_heads": 12,
42
+ "num_hidden_layers": 6,
43
+ "pad_token_id": 50283,
44
+ "position_embedding_type": "absolute",
45
+ "rope_parameters": {
46
+ "full_attention": {
47
+ "rope_theta": 160000.0,
48
+ "rope_type": "default"
49
+ },
50
+ "sliding_attention": {
51
+ "rope_theta": 10000.0,
52
+ "rope_type": "default"
53
+ }
54
+ },
55
+ "sep_token_id": 50282,
56
+ "sparse_pred_ignore_index": -100,
57
+ "sparse_prediction": false,
58
+ "tie_word_embeddings": true,
59
+ "transformers_version": "5.12.1",
60
+ "use_cache": false,
61
+ "vocab_size": 50368
62
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:462e202ed275c99b8d5c5380d20d7dde621496f6fb57330b3a68b0371106394b
3
+ size 277662544
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "[CLS]",
5
+ "is_local": false,
6
+ "local_files_only": false,
7
+ "mask_token": "[MASK]",
8
+ "model_input_names": [
9
+ "input_ids",
10
+ "attention_mask"
11
+ ],
12
+ "model_max_length": 8192,
13
+ "pad_token": "[PAD]",
14
+ "sep_token": "[SEP]",
15
+ "tokenizer_class": "TokenizersBackend",
16
+ "unk_token": "[UNK]"
17
+ }