Voxtral Emotion Speech
Dataset: MrlolDev/voxtral-emotion-speech
Model: MrlolDev/voxtral-emotion-speech
Benchmark Results
| Model | UA% | WA% | F1% | WF1% | Data |
|---|---|---|---|---|---|
| Ours (Frozen + MLP) | 16.3 | 25.4 | 14.2 | 21.9 | 500 synthetic (11Labs) |
| Ours (LoRA + MLP) | - | - | - | - | end-to-end |
| SenseVoice-S | 70.5 | 65.7 | 67.9 | 67.8 | zero-shot |
| emotion2vec+ large | ~80 | ~80 | - | - | IEMOCAP |
Tested on 477/1004 IEMOCAP test samples (4-class: neutral, happy, sad, angry). Our model trained only on synthetic ElevenLabs clips - low score is expected. Models marked β were fine-tuned on IEMOCAP.
Benchmark Methodology
- Dataset: IEMOCAP test set from AudioLLMs/iemocap_emotion_recognition
- Features: 1280-dim from Voxtral encoder via
audio_tower()+ mean pooling - Classifier: MLP (1280β512β256β6) - two variants:
- Frozen encoder + MLP (trained on 500 synthetic clips)
- LoRA finetuned encoder + MLP (end-to-end training)
- Metrics: UA (macro recall), WA (accuracy), F1 (macro), WF1 (weighted)
What We Did
- Loaded audio from the dataset
- Extracted 1280-dim features from Voxtral encoder hidden states using mean pooling
- Trained a classification head (MLP: 1280β512β256β6) with class weights for imbalance
- LoRA finetuning: Added LoRA adapters to last 6 encoder layers for end-to-end training
- Benchmarked against SenseVoice on IEMOCAP emotion recognition
Scripts
1. finetune_lora.py
End-to-end LoRA finetuning on Voxtral encoder + emotion head:
python finetune_lora.py
Output: emotion_head_lora_best.pt + LoRA adapter in lora_adapter/
2. benchmark_lora.py
Compare frozen vs LoRA encoder on IEMOCAP:
python benchmark_lora.py
Emotions
- neutral
- happy
- sad
- angry
- fear
- surprise
Scripts
1. setup.sh
Installs dependencies using UV and logs into HuggingFace.
bash setup.sh
2. extract_features.py
- Loads dataset from HuggingFace
- Loads Voxtral model (float16)
- Extracts 1280-dim features from encoder hidden states (mean pooling)
- Saves features to features.pkl
- Uploads features.pkl and README.md to model repo
python extract_features.py
Output: features.pkl - list of records with keys:
features: numpy array (1280,)label: int (0-5)emotion: stringsplit: "train"/"validation"/"test"sensevoice_score: float
3. train.py
- Loads features from features.pkl
- Splits 70/15/15 if no split in data
- Trains EmotionHead MLP:
- 1280 β 512 β 256 β 6
- BatchNorm + ReLU + Dropout(0.3)
- Uses class weights for imbalance
- Trains 150 epochs with AdamW + ReduceLROnPlateau
- Saves best model by validation accuracy
- Uploads model weights and plots to model repo
python train.py
Outputs:
emotion_head_best.pt- Best model weightsconfusion_matrix.png- Test confusion matrixtraining_curve.png- Loss curves
4. benchmark.py
Benchmarks the trained model:
Bench 1: Emotion F1 vs SenseVoice
- Uses RAVDESS test set
- Maps 8 RAVDESS emotions to 6 classes
- Compares against SenseVoice baseline
Bench 2: Transcription WER
- Uses LibriSpeech test-clean (100 samples)
- Verifies encoder freezing doesn't affect decoder
python benchmark.py
Output: benchmark_results.json
Running on RunPod
Pod Setup
- GPU: RTX 4090 (~$0.48/hr)
- Template: RunPod PyTorch 2.1
- Container Disk: 30GB
Execution Order
# 1. Setup
bash setup.sh
# 2. Extract features (~20 min)
python extract_features.py
# 3. Train (~10 min)
python train.py
# 4. Benchmark (~20 min)
python benchmark.py
# 5. Download results
tar -czf results.tar.gz emotion_head_best.pt features.pkl \
confusion_matrix.png training_curve.png benchmark_results.json
Then download results.tar.gz from RunPod Files tab.
Model Architecture
Voxtral Encoder (frozen)
β
Mean Pooling (1280 dims)
β
EmotionHead MLP
- Linear(1280, 512) + BatchNorm + ReLU + Dropout(0.3)
- Linear(512, 256) + BatchNorm + ReLU + Dropout(0.3)
- Linear(256, 6)
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for MrlolDev/voxtral-emotion-speech
Base model
mistralai/Ministral-3-3B-Base-2512 Finetuned
mistralai/Voxtral-Mini-4B-Realtime-2602