Upload L6_bottom with MTEB results
Browse files- README.md +52 -43
- config.json +1 -1
- config_sentence_transformers.json +1 -1
- id_map.json +0 -0
- model.safetensors +2 -2
- tokenizer.json +2 -2
- tokenizer_config.json +1 -1
README.md
CHANGED
|
@@ -4,18 +4,17 @@ tags:
|
|
| 4 |
- sentence-transformers
|
| 5 |
- intent-classification
|
| 6 |
- multilingual
|
| 7 |
-
- distillation
|
| 8 |
- layer-pruning
|
|
|
|
| 9 |
library_name: sentence-transformers
|
| 10 |
pipeline_tag: sentence-similarity
|
| 11 |
license: apache-2.0
|
| 12 |
---
|
| 13 |
|
| 14 |
-
#
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
Created by **layer pruning** from `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
|
| 19 |
|
| 20 |
## Model Details
|
| 21 |
|
|
@@ -24,76 +23,86 @@ Created by **layer pruning** from `sentence-transformers/paraphrase-multilingual
|
|
| 24 |
| Teacher | paraphrase-multilingual-MiniLM-L12-v2 |
|
| 25 |
| Architecture | XLM-RoBERTa (pruned) |
|
| 26 |
| Hidden dim | 384 |
|
| 27 |
-
| Layers | 6
|
| 28 |
| Layer indices | [0, 1, 2, 3, 4, 5] |
|
| 29 |
| Strategy | 6 layers, bottom half (syntactic-focused) |
|
| 30 |
-
|
|
| 31 |
-
|
|
| 32 |
-
|
|
| 33 |
-
|
|
| 34 |
|
| 35 |
## Supported Languages (18)
|
| 36 |
|
| 37 |
ko, en, ja, zh, es, fr, de, pt, it, ru, ar, hi, th, vi, id, tr, nl, pl
|
| 38 |
|
| 39 |
-
##
|
| 40 |
-
|
| 41 |
-
This is a **student encoder** designed to be used as the backbone for a lightweight
|
| 42 |
-
3-class intent classifier (Action / Recall / Other) in multilingual dialogue systems.
|
| 43 |
-
|
| 44 |
-
- **Action**: User requests an action (book, order, change settings, etc.)
|
| 45 |
-
- **Recall**: User asks about past events or stored information
|
| 46 |
-
- **Other**: Greetings, chitchat, emotions, etc.
|
| 47 |
-
|
| 48 |
-
## Usage
|
| 49 |
|
| 50 |
```python
|
| 51 |
from sentence_transformers import SentenceTransformer
|
| 52 |
|
| 53 |
model = SentenceTransformer("L6_bottom")
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
```
|
| 57 |
|
| 58 |
-
## MTEB Results
|
|
|
|
|
|
|
| 59 |
|
| 60 |
### MassiveIntentClassification
|
| 61 |
|
| 62 |
-
**Average:
|
| 63 |
|
| 64 |
| Language | Score |
|
| 65 |
|----------|-------|
|
| 66 |
-
| ar |
|
| 67 |
-
| en |
|
| 68 |
-
| es | 56.
|
| 69 |
-
| ko |
|
| 70 |
|
| 71 |
### MassiveScenarioClassification
|
| 72 |
|
| 73 |
-
**Average:
|
| 74 |
|
| 75 |
| Language | Score |
|
| 76 |
|----------|-------|
|
| 77 |
-
| ar |
|
| 78 |
-
| en |
|
| 79 |
-
| es | 60.
|
| 80 |
-
| ko |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
|
|
|
| 83 |
|
| 84 |
-
## Training / Distillation
|
| 85 |
|
| 86 |
-
|
| 87 |
-
1. Load teacher: `paraphrase-multilingual-MiniLM-L12-v2` (12 layers, 384 hidden)
|
| 88 |
-
2. Select layers: `[0, 1, 2, 3, 4, 5]`
|
| 89 |
-
3. Copy embedding weights + selected layer weights
|
| 90 |
-
4. Wrap with mean pooling for sentence embeddings
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
## Limitations
|
| 96 |
|
| 97 |
-
-
|
| 98 |
-
- Vocabulary pruning limits the model to the target 18 languages
|
| 99 |
- Designed for short dialogue utterances, not long documents
|
|
|
|
|
|
| 4 |
- sentence-transformers
|
| 5 |
- intent-classification
|
| 6 |
- multilingual
|
|
|
|
| 7 |
- layer-pruning
|
| 8 |
+
- vocab-pruning
|
| 9 |
library_name: sentence-transformers
|
| 10 |
pipeline_tag: sentence-similarity
|
| 11 |
license: apache-2.0
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# L6_bottom
|
| 15 |
|
| 16 |
+
Lightweight multilingual sentence encoder optimized for intent classification.
|
| 17 |
+
Created from `paraphrase-multilingual-MiniLM-L12-v2` via layer pruning + corpus-based vocabulary pruning.
|
|
|
|
| 18 |
|
| 19 |
## Model Details
|
| 20 |
|
|
|
|
| 23 |
| Teacher | paraphrase-multilingual-MiniLM-L12-v2 |
|
| 24 |
| Architecture | XLM-RoBERTa (pruned) |
|
| 25 |
| Hidden dim | 384 |
|
| 26 |
+
| Layers | 6 / 12 |
|
| 27 |
| Layer indices | [0, 1, 2, 3, 4, 5] |
|
| 28 |
| Strategy | 6 layers, bottom half (syntactic-focused) |
|
| 29 |
+
| Vocab size | ~38,330 (pruned from 250K) |
|
| 30 |
+
| Parameters | 26,184,576 |
|
| 31 |
+
| Safetensors size | 98.1MB |
|
| 32 |
+
| Distilled | No |
|
| 33 |
|
| 34 |
## Supported Languages (18)
|
| 35 |
|
| 36 |
ko, en, ja, zh, es, fr, de, pt, it, ru, ar, hi, th, vi, id, tr, nl, pl
|
| 37 |
|
| 38 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
```python
|
| 41 |
from sentence_transformers import SentenceTransformer
|
| 42 |
|
| 43 |
model = SentenceTransformer("L6_bottom")
|
| 44 |
+
|
| 45 |
+
sentences = [
|
| 46 |
+
"예약 좀 해줘", # Korean
|
| 47 |
+
"What did I order?", # English
|
| 48 |
+
"今日はいい天気ですね", # Japanese
|
| 49 |
+
"Reserva una mesa", # Spanish
|
| 50 |
+
]
|
| 51 |
+
|
| 52 |
+
embeddings = model.encode(sentences)
|
| 53 |
+
print(embeddings.shape) # (4, 384)
|
| 54 |
```
|
| 55 |
|
| 56 |
+
## MTEB Evaluation Results
|
| 57 |
+
|
| 58 |
+
**Overall Average: 57.05%**
|
| 59 |
|
| 60 |
### MassiveIntentClassification
|
| 61 |
|
| 62 |
+
**Average: 54.7%**
|
| 63 |
|
| 64 |
| Language | Score |
|
| 65 |
|----------|-------|
|
| 66 |
+
| ar | 46.36% |
|
| 67 |
+
| en | 59.84% |
|
| 68 |
+
| es | 56.11% |
|
| 69 |
+
| ko | 56.49% |
|
| 70 |
|
| 71 |
### MassiveScenarioClassification
|
| 72 |
|
| 73 |
+
**Average: 59.39%**
|
| 74 |
|
| 75 |
| Language | Score |
|
| 76 |
|----------|-------|
|
| 77 |
+
| ar | 50.55% |
|
| 78 |
+
| en | 64.52% |
|
| 79 |
+
| es | 60.31% |
|
| 80 |
+
| ko | 62.19% |
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
## Training
|
| 85 |
+
|
| 86 |
+
This model was created via **layer pruning + vocabulary pruning**:
|
| 87 |
|
| 88 |
+
1. **Teacher**: `paraphrase-multilingual-MiniLM-L12-v2` (12 layers, 384 hidden dim)
|
| 89 |
+
2. **Layer selection**: `[0, 1, 2, 3, 4, 5]` - 6 layers, bottom half (syntactic-focused)
|
| 90 |
+
3. **Vocab pruning**: 250K -> ~38K tokens (corpus-based filtering for 18 target languages)
|
| 91 |
+
4. **No additional training** - weights are directly copied from the teacher
|
| 92 |
|
| 93 |
+
A distilled version of this model is also available with improved performance.
|
| 94 |
|
|
|
|
| 95 |
|
| 96 |
+
## Compression Summary
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
+
| Stage | Vocab | Layers | Size |
|
| 99 |
+
|-------|-------|--------|------|
|
| 100 |
+
| Teacher (original) | 250,002 | 12 | ~480MB |
|
| 101 |
+
| + Layer pruning | 250,002 | 6 | ~407MB |
|
| 102 |
+
| + Vocab pruning | ~38,330 | 6 | ~98MB |
|
| 103 |
|
| 104 |
## Limitations
|
| 105 |
|
| 106 |
+
- Vocabulary pruning restricts the model to the 18 target languages
|
|
|
|
| 107 |
- Designed for short dialogue utterances, not long documents
|
| 108 |
+
- Layer pruning may reduce performance on complex semantic tasks
|
config.json
CHANGED
|
@@ -21,5 +21,5 @@
|
|
| 21 |
"transformers_version": "4.56.2",
|
| 22 |
"type_vocab_size": 2,
|
| 23 |
"use_cache": true,
|
| 24 |
-
"vocab_size":
|
| 25 |
}
|
|
|
|
| 21 |
"transformers_version": "4.56.2",
|
| 22 |
"type_vocab_size": 2,
|
| 23 |
"use_cache": true,
|
| 24 |
+
"vocab_size": 38330
|
| 25 |
}
|
config_sentence_transformers.json
CHANGED
|
@@ -3,7 +3,7 @@
|
|
| 3 |
"__version__": {
|
| 4 |
"sentence_transformers": "5.3.0",
|
| 5 |
"transformers": "4.56.2",
|
| 6 |
-
"pytorch": "2.10.0+
|
| 7 |
},
|
| 8 |
"prompts": {
|
| 9 |
"query": "",
|
|
|
|
| 3 |
"__version__": {
|
| 4 |
"sentence_transformers": "5.3.0",
|
| 5 |
"transformers": "4.56.2",
|
| 6 |
+
"pytorch": "2.10.0+cu128"
|
| 7 |
},
|
| 8 |
"prompts": {
|
| 9 |
"query": "",
|
id_map.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:75aade5a2325bfa6346cc282b70cbad0525ffc5add0ef159448f2df61b1260e7
|
| 3 |
+
size 102857288
|
tokenizer.json
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0ab1d8ad18d647b10254a627ba87f4f8dac8aea96ca026510f5f883fe2e6532e
|
| 3 |
+
size 2816831
|
tokenizer_config.json
CHANGED
|
@@ -32,7 +32,7 @@
|
|
| 32 |
"single_word": false,
|
| 33 |
"special": true
|
| 34 |
},
|
| 35 |
-
"
|
| 36 |
"content": "<mask>",
|
| 37 |
"lstrip": true,
|
| 38 |
"normalized": false,
|
|
|
|
| 32 |
"single_word": false,
|
| 33 |
"special": true
|
| 34 |
},
|
| 35 |
+
"38329": {
|
| 36 |
"content": "<mask>",
|
| 37 |
"lstrip": true,
|
| 38 |
"normalized": false,
|