modernbert_L6_uniform_distilled (Distilled)
Lightweight sentence encoder created from answerdotai/ModernBERT-base via layer pruning + vocabulary pruning + knowledge distillation.
Model Details
| Property |
Value |
| Teacher |
answerdotai/ModernBERT-base |
| Architecture |
ModernBERT (pruned) |
| Hidden dim |
768 |
| Layers |
6 / 22 |
| Layer indices |
[0, 4, 8, 13, 17, 21] |
| Strategy |
6 layers, evenly spaced from ModernBERT (22L) |
| Parameters |
63,870,720 |
| Model size (FP32) |
176.0MB |
| Distilled |
Yes |
Architecture
==============================================================
TEACHER: ModernBERT β STUDENT: 6L / 27,279 vocab
==============================================================
TEACHER STUDENT
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Input Tokens β β Input Tokens β
ββββββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββ
β β
ββββββββββββββ΄βββββββββββββ ββββββββββββββ΄βββββββββββββ
β Embeddings β β Embeddings (pruned) β
β vocab: 50,368 β β vocab: 27,279 β
β dim: 768 β β dim: 768 β
ββββββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββ
β β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Layer 0 β βββΊ β Layer 0 β L0 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 1 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 2 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 3 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 4 β βββΊ β Layer 1 β L4 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 5 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 6 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 7 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 8 β βββΊ β Layer 2 β L8 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 9 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 10 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 11 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 12 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 13 β βββΊ β Layer 3 β L13 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 14 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 15 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 16 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 17 β βββΊ β Layer 4 β L17 β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 18 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 19 β β³ β β
β β β β β β β β β β β ββ€ β β
β Layer 20 β β³ β β
βββββββββββββββββββββββββββ€ βββββββββββββββββββββββββββ€
β Layer 21 β βββΊ β Layer 5 β L21 β
ββββββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββ
β β
ββββββββββββββ΄βββββββββββββ ββββββββββββββ΄βββββββββββββ
β Mean Pooling β β Mean Pooling β
β β 768d embedding β β β 768d embedding β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
Size: 495.8MB (FP32) β 176.0MB (FP32)
Params: 129,980,160 β 46,138,368
Reduction: 64.5%
==============================================================
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("modernbert_L6_uniform_distilled", trust_remote_code=True)
sentences = [
"Hello, how are you?",
"μλ
νμΈμ",
"Bonjour, comment allez-vous?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
MTEB Evaluation Results
Overall Average: 40.35%
| Task Group |
Average |
| Classification |
50.57% |
| Clustering |
29.91% |
| STS |
40.56% |
Classification
| Task |
Average |
Details |
| AmazonCounterfactualClassification |
66.11% |
en: 69.96%, en-ext: 68.7%, de: 63.61% |
| Banking77Classification |
61.69% |
default: 61.69% |
| ImdbClassification |
53.73% |
default: 53.73% |
| MTOPDomainClassification |
54.47% |
en: 71.35%, es: 57.77%, de: 54.3% |
| MassiveIntentClassification |
31.3% |
en: 53.06%, zh-CN: 47.36%, zh-TW: 41.65% |
| MassiveScenarioClassification |
32.3% |
en: 57.38%, zh-CN: 50.25%, zh-TW: 43.81% |
| ToxicConversationsClassification |
57.77% |
default: 57.77% |
| TweetSentimentExtractionClassification |
47.19% |
default: 47.19% |
Clustering
| Task |
Average |
Details |
| ArXivHierarchicalClusteringP2P |
49.8% |
default: 49.8% |
| ArXivHierarchicalClusteringS2S |
48.45% |
default: 48.45% |
| BiorxivClusteringP2P.v2 |
11.05% |
default: 11.05% |
| MedrxivClusteringP2P.v2 |
21.71% |
default: 21.71% |
| MedrxivClusteringS2S.v2 |
21.75% |
default: 21.75% |
| StackExchangeClustering.v2 |
42.7% |
default: 42.7% |
| StackExchangeClusteringP2P.v2 |
32.54% |
default: 32.54% |
| TwentyNewsgroupsClustering.v2 |
11.24% |
default: 11.24% |
STS
| Task |
Average |
Details |
| BIOSSES |
42.43% |
default: 42.43% |
| SICK-R |
53.89% |
default: 53.89% |
| STS12 |
43.95% |
default: 43.95% |
| STS13 |
42.51% |
default: 42.51% |
| STS14 |
40.74% |
default: 40.74% |
| STS15 |
53.89% |
default: 53.89% |
| STS17 |
27.51% |
en-en: 60.2%, es-es: 58.08%, ko-ko: 44.39% |
| STS22.v2 |
18.53% |
zh: 47.25%, es: 39.83%, fr: 35.19% |
| STSBenchmark |
41.58% |
default: 41.58% |
Distillation Impact
| Task |
Before |
After |
Delta |
| AmazonCounterfactualClassification |
59.33% |
66.11% |
+6.78%p |
| ArXivHierarchicalClusteringP2P |
50.19% |
49.8% |
-0.39%p |
| ArXivHierarchicalClusteringS2S |
46.96% |
48.45% |
+1.49%p |
| Banking77Classification |
35.01% |
61.69% |
+26.68%p |
| BiorxivClusteringP2P.v2 |
12.62% |
11.05% |
-1.57%p |
| BIOSSES |
33.84% |
42.43% |
+8.59%p |
| ImdbClassification |
55.05% |
53.73% |
-1.32%p |
| MassiveIntentClassification |
25.86% |
31.3% |
+5.44%p |
| MassiveScenarioClassification |
26.28% |
32.3% |
+6.02%p |
| MedrxivClusteringP2P.v2 |
22.13% |
21.71% |
-0.42%p |
| MedrxivClusteringS2S.v2 |
19.43% |
21.75% |
+2.32%p |
| MTOPDomainClassification |
43.24% |
54.47% |
+11.23%p |
| SICK-R |
46.99% |
53.89% |
+6.9%p |
| StackExchangeClustering.v2 |
34.26% |
42.7% |
+8.44%p |
| StackExchangeClusteringP2P.v2 |
31.01% |
32.54% |
+1.53%p |
| STS12 |
35.32% |
43.95% |
+8.63%p |
| STS13 |
33.7% |
42.51% |
+8.81%p |
| STS14 |
37.07% |
40.74% |
+3.67%p |
| STS15 |
49.85% |
53.89% |
+4.04%p |
| STS17 |
23.34% |
27.51% |
+4.17%p |
| STS22.v2 |
24.05% |
18.53% |
-5.52%p |
| STSBenchmark |
39.82% |
41.58% |
+1.76%p |
| ToxicConversationsClassification |
52.6% |
57.77% |
+5.17%p |
| TweetSentimentExtractionClassification |
38.42% |
47.19% |
+8.77%p |
| TwentyNewsgroupsClustering.v2 |
9.11% |
11.24% |
+2.13%p |
Training
Stage 1: Layer Pruning
- Teacher:
answerdotai/ModernBERT-base (22 layers, 768d)
- Selected layers:
[0, 4, 8, 13, 17, 21] (6 layers, evenly spaced from ModernBERT (22L))
- Vocabulary pruning applied
Stage 2: Knowledge Distillation
- Method: MSE + Cosine Similarity loss
- Data: MTEB Classification/Clustering/STS task datasets
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Schedule: Cosine annealing over 3 epochs
Supported Languages (18)
ko, en, ja, zh, es, fr, de, pt, it, ru, ar, hi, th, vi, id, tr, nl, pl