codechrl commited on
Commit
75da445
·
verified ·
1 Parent(s): f96e737

modernbert-tiny: compressed ModernBERT base (tinybert)

Browse files
Files changed (5) hide show
  1. README.md +97 -0
  2. config.json +60 -0
  3. model.safetensors +3 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +17 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: answerdotai/ModernBERT-base
4
+ base_model_relation: finetune
5
+ library_name: transformers
6
+ pipeline_tag: fill-mask
7
+ language:
8
+ - en
9
+ datasets:
10
+ - HuggingFaceFW/fineweb-edu
11
+ tags:
12
+ - modernbert
13
+ - distillation
14
+ - knowledge-distillation
15
+ - model-compression
16
+ - fill-mask
17
+ ---
18
+ # modernbert-tiny
19
+
20
+ Smallest ModernBERT (TinyBERT-style distillation) — for edge / low-latency.
21
+
22
+ A **compressed, fine-tunable base encoder** derived from [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) — the *fork/derivative*:
23
+ **15.3% of the teacher's size** while keeping **80.4% of its GLUE quality**. Use it as a general base and
24
+ fine-tune on your downstream task, exactly like ModernBERT-base.
25
+
26
+ ## The family (one exercise)
27
+
28
+ All three were produced in **one ModernBERT compression exercise** — same teacher ([`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)), same FineWeb-Edu corpus, same GLUE eval — comparing different compression methods. **Pick the tier that fits your size/quality budget:**
29
+
30
+ - [`codechrl/modernbert-tiny`](https://huggingface.co/codechrl/modernbert-tiny) ← **you are here** — 22.1M params, 15.3% of base size, 80.4% GLUE retained · TinyBERT-style attention+hidden distillation
31
+ - [`codechrl/modernbert-mini`](https://huggingface.co/codechrl/modernbert-mini) — 69.4M params, 46.7% of base size, 92.9% GLUE retained · DistilBERT-style depth distillation
32
+ - [`codechrl/modernbert-lite`](https://huggingface.co/codechrl/modernbert-lite) — 149.7M params, 50.3% of base size, 99.3% GLUE retained · fp16 half-precision quantization
33
+
34
+ ## How it was made (general process)
35
+
36
+ 1. **Teacher** — `answerdotai/ModernBERT-base` (149.7M params), the distillation target.
37
+ 2. **General-corpus distillation** — the student learns from the teacher on **FineWeb-Edu** (general English web
38
+ text) using the `tinybert` recipe. No task-/domain-specific data, so it stays a general base.
39
+ 3. **Evaluation** — quality measured on **GLUE** (SST-2, MRPC, STS-B, RTE; each model fine-tuned identically),
40
+ reported purely as **% retained vs the teacher**.
41
+
42
+ ## Scores (% against the ModernBERT-base teacher)
43
+
44
+ - **Size:** 92.0 MB → **15.3% of baseline** (params 22.1M)
45
+ - **GLUE quality retained:** **80.4%**
46
+ - **eff_score:** 82.6 / 100 = `0.5 · GLUE_retention% + 0.5 · size_reduction%` (higher is better)
47
+
48
+ ### Full tier comparison
49
+
50
+ | model | params (M) | size (MB) | size vs base | GLUE vs base | eff_score |
51
+ |---|---|---|---|---|---|
52
+ | `ModernBERT-base` (teacher) | 149.7 | 602.2 | 100% | 100% | 50.0 |
53
+ | **modernbert-tiny** ⭐ | 22.1 | 92.0 | 15.3% | 80.4% | 82.6 |
54
+ | `modernbert-mini` | 69.4 | 281.2 | 46.7% | 92.9% | 73.1 |
55
+ | `modernbert-lite` | 149.7 | 302.9 | 50.3% | 99.3% | 74.5 |
56
+
57
+ ## Methods & architecture (each tier)
58
+
59
+ Every tier derives from the **same teacher** but uses a different compression method:
60
+
61
+ ### `modernbert-tiny` ⭐
62
+ *4 transformer layers, hidden size 312, 12 heads (~22M params)*
63
+
64
+ **TinyBERT-style distillation.** A small student mimics multiple internal signals of the teacher: token embeddings, per-layer hidden states (compared L2-normalized for stability), attention probability maps, and output-logit KL. This deep multi-signal supervision lets a much narrower/shallower network recover usable quality.
65
+
66
+ ### `modernbert-mini`
67
+ *6 transformer layers, hidden size 768 (~69M params)*
68
+
69
+ **DistilBERT-style distillation.** The 6-layer student is initialized from evenly-spaced teacher layers, then trained with masked-LM loss + soft-logit KL divergence + last-hidden cosine. Depth-only reduction (full width kept) is the best quality-per-byte recipe here.
70
+
71
+ ### `modernbert-lite`
72
+ *full ModernBERT (22 layers, hidden 768, ~150M params), weights stored in float16*
73
+
74
+ **Half-precision (fp16) quantization.** No retraining — weights are cast to 16-bit, roughly halving storage and memory with near-zero quality loss. Re-load in fp32 (or bf16) to fine-tune.
75
+
76
+
77
+ ## Usage
78
+
79
+ ```python
80
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
81
+ tok = AutoTokenizer.from_pretrained("codechrl/modernbert-tiny")
82
+ model = AutoModelForMaskedLM.from_pretrained("codechrl/modernbert-tiny")
83
+
84
+ # fine-tune for your task:
85
+ # from transformers import AutoModelForSequenceClassification
86
+ # clf = AutoModelForSequenceClassification.from_pretrained("codechrl/modernbert-tiny", num_labels=N)
87
+ ```
88
+
89
+ ## Intended use & limitations
90
+
91
+ - **A base to fine-tune**, not a finished classifier.
92
+ - Distilled on a **small compute budget** (demo-grade); for production, redistill with more steps/corpus.
93
+ - `tiny` trades the most quality for the smallest size; `mini`/`lite` retain more.
94
+
95
+ ## Citation
96
+
97
+ Built on ModernBERT (Warner et al., 2024). Distillation recipes: DistilBERT (Sanh 2019), TinyBERT (Jiao 2020).
config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ModernBertForMaskedLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 50281,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 50281,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
+ "dtype": "float32",
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": 50282,
18
+ "global_attn_every_n_layers": 3,
19
+ "gradient_checkpointing": false,
20
+ "hidden_activation": "gelu",
21
+ "hidden_size": 312,
22
+ "initializer_cutoff_factor": 2.0,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 1248,
25
+ "layer_norm_eps": 1e-05,
26
+ "layer_types": [
27
+ "full_attention",
28
+ "sliding_attention",
29
+ "sliding_attention",
30
+ "full_attention"
31
+ ],
32
+ "local_attention": 128,
33
+ "max_position_embeddings": 8192,
34
+ "mlp_bias": false,
35
+ "mlp_dropout": 0.0,
36
+ "model_type": "modernbert",
37
+ "norm_bias": false,
38
+ "norm_eps": 1e-05,
39
+ "num_attention_heads": 12,
40
+ "num_hidden_layers": 4,
41
+ "pad_token_id": 50283,
42
+ "position_embedding_type": "absolute",
43
+ "rope_parameters": {
44
+ "full_attention": {
45
+ "rope_theta": 160000.0,
46
+ "rope_type": "default"
47
+ },
48
+ "sliding_attention": {
49
+ "rope_theta": 10000.0,
50
+ "rope_type": "default"
51
+ }
52
+ },
53
+ "sep_token_id": 50282,
54
+ "sparse_pred_ignore_index": -100,
55
+ "sparse_prediction": false,
56
+ "tie_word_embeddings": true,
57
+ "transformers_version": "5.12.1",
58
+ "use_cache": false,
59
+ "vocab_size": 50368
60
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6bb9fe8cd2515af89a7b02f5703ea72b5daa63ba55f11216e208671966328665
3
+ size 88385544
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "[CLS]",
5
+ "is_local": false,
6
+ "local_files_only": false,
7
+ "mask_token": "[MASK]",
8
+ "model_input_names": [
9
+ "input_ids",
10
+ "attention_mask"
11
+ ],
12
+ "model_max_length": 8192,
13
+ "pad_token": "[PAD]",
14
+ "sep_token": "[SEP]",
15
+ "tokenizer_class": "TokenizersBackend",
16
+ "unk_token": "[UNK]"
17
+ }