L6_bottom_distilled_v2

Lightweight multilingual sentence encoder optimized for intent classification. Created from paraphrase-multilingual-MiniLM-L12-v2 via layer pruning + vocab pruning + improved knowledge distillation.

What's Different (v2)

This model uses an improved distillation strategy compared to v1:

Hyperparameter v1 v2
Learning rate 2e-5 5e-6 (gentler)
Epochs 3 10 (longer training)
MSE weight 1.0 0.3 (less magnitude forcing)
Cosine weight 0.5 2.0 (focus on direction alignment)

Key insight: The original L6_bottom model already has strong syntactic representations from the teacher's bottom layers. Aggressive MSE loss destroys these. The v2 strategy uses cosine-dominant loss to align direction without forcing magnitude, preserving the model's existing strengths.

Results Comparison

Model MassiveIntent MassiveScenario Average
Teacher (12L, 480MB) 55.52% 61.01% 58.27%
L6_bottom (no distill) 54.70% 59.39% 57.05%
L6_bottom_distilled v1 51.63% 59.03% 55.33%
L6_bottom_distilled v2 53.06% 59.86% 56.46%

v2 surpasses the original on MassiveScenarioClassification (59.86% > 59.39%).

Detailed Scores (v2)

Language MassiveIntent MassiveScenario
ar 42.95% 48.90%
en 61.44% 69.26%
es 56.33% 62.55%
ko 51.53% 58.74%

Model Details

Property Value
Teacher paraphrase-multilingual-MiniLM-L12-v2
Architecture XLM-RoBERTa (pruned)
Hidden dim 384
Layers 6 / 12
Layer indices [0, 1, 2, 3, 4, 5]
Strategy Bottom half (syntactic-focused)
Vocab size ~38,330 (pruned from 250K)
Safetensors 98.1MB

Quick Start

Training

Stage 1: Layer Pruning

  • Teacher: paraphrase-multilingual-MiniLM-L12-v2 (12 layers)
  • Selected layers: [0, 1, 2, 3, 4, 5] (bottom half, syntactic-focused)
  • Vocab pruning: 250K -> ~38K tokens (corpus-based, 18 target languages)

Stage 2: Knowledge Distillation (v2)

  • Loss: 0.3 * MSE + 2.0 * (1 - CosineSimilarity)
  • Training data: MASSIVE dataset (90K multilingual sentences)
  • Optimizer: AdamW (lr=5e-6, weight_decay=0.01)
  • Schedule: Cosine annealing over 10 epochs
  • Batch size: 64, GPU: NVIDIA RTX 5090

Limitations

  • Vocabulary pruning restricts the model to 18 target languages
  • Designed for short dialogue utterances, not long documents
Downloads last month
14
Safetensors
Model size
25.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support