L4_uniform_distilled (Distilled)

Lightweight multilingual sentence encoder optimized for intent classification. Created from paraphrase-multilingual-MiniLM-L12-v2 via layer pruning + corpus-based vocabulary pruning + knowledge distillation.

Model Details

Property	Value
Teacher	paraphrase-multilingual-MiniLM-L12-v2
Architecture	XLM-RoBERTa (pruned)
Hidden dim	384
Layers	4 / 12
Layer indices	[0, 4, 7, 11]
Strategy	4 layers, evenly spaced (compact)
Vocab size	~38,330 (pruned from 250K)
Parameters	22,642,560
Safetensors size	84.6MB
Distilled	Yes

Supported Languages (18)

ko, en, ja, zh, es, fr, de, pt, it, ru, ar, hi, th, vi, id, tr, nl, pl

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("L4_uniform_distilled")

sentences = [
    "예약 좀 해줘",           # Korean
    "What did I order?",     # English
    "今日はいい天気ですね",    # Japanese
    "Reserva una mesa",      # Spanish
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (4, 384)

MTEB Evaluation Results

Overall Average: 54.61%

MassiveIntentClassification

Average: 50.88%

Language	Score
ar	41.25%
en	62.02%
es	53.15%
ko	47.08%

MassiveScenarioClassification

Average: 58.34%

Language	Score
ar	47.4%
en	69.96%
es	61.24%
ko	54.74%

Distillation Impact

Task	Before Distillation	After Distillation	Delta
MassiveIntentClassification	50.25%	50.88%	+0.63%p
MassiveScenarioClassification	53.82%	58.34%	+4.52%p

Training

This model was created in two stages:

Stage 1: Layer Pruning

Teacher model: paraphrase-multilingual-MiniLM-L12-v2 (12 layers, 384 hidden dim)
Selected layers: [0, 4, 7, 11] (4 layers, evenly spaced (compact))
Vocabulary pruning: 250K -> ~38K tokens (corpus-based, 18 target languages)

Stage 2: Knowledge Distillation

Method: MSE + Cosine Similarity loss between teacher and student embeddings
Training data: MASSIVE dataset (90K multilingual sentences, 18 languages)
Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
Schedule: Cosine annealing over 3 epochs
Batch size: 64
Base model: L4_uniform (layer-pruned only)

Compression Summary

Stage	Vocab	Layers	Size
Teacher (original)	250,002	12	~480MB
+ Layer pruning	250,002	4	~393MB
+ Vocab pruning	~38,330	4	~85MB

Limitations

Vocabulary pruning restricts the model to the 18 target languages
Designed for short dialogue utterances, not long documents
Layer pruning may reduce performance on complex semantic tasks

Downloads last month: 6

Safetensors

Model size

22.2M params

Tensor type

F32