Voxtral Emotion Speech

Dataset: MrlolDev/voxtral-emotion-speech

Model: MrlolDev/voxtral-emotion-speech

Benchmark Results

Model UA% WA% F1% WF1% Data
Ours (Frozen + MLP) 16.3 25.4 14.2 21.9 500 synthetic (11Labs)
Ours (LoRA + MLP) - - - - end-to-end
SenseVoice-S 70.5 65.7 67.9 67.8 zero-shot
emotion2vec+ large ~80 ~80 - - IEMOCAP

Tested on 477/1004 IEMOCAP test samples (4-class: neutral, happy, sad, angry). Our model trained only on synthetic ElevenLabs clips - low score is expected. Models marked βœ… were fine-tuned on IEMOCAP.

Benchmark Methodology

  1. Dataset: IEMOCAP test set from AudioLLMs/iemocap_emotion_recognition
  2. Features: 1280-dim from Voxtral encoder via audio_tower() + mean pooling
  3. Classifier: MLP (1280β†’512β†’256β†’6) - two variants:
    • Frozen encoder + MLP (trained on 500 synthetic clips)
    • LoRA finetuned encoder + MLP (end-to-end training)
  4. Metrics: UA (macro recall), WA (accuracy), F1 (macro), WF1 (weighted)

What We Did

  1. Loaded audio from the dataset
  2. Extracted 1280-dim features from Voxtral encoder hidden states using mean pooling
  3. Trained a classification head (MLP: 1280β†’512β†’256β†’6) with class weights for imbalance
  4. LoRA finetuning: Added LoRA adapters to last 6 encoder layers for end-to-end training
  5. Benchmarked against SenseVoice on IEMOCAP emotion recognition

Scripts

1. finetune_lora.py

End-to-end LoRA finetuning on Voxtral encoder + emotion head:

python finetune_lora.py

Output: emotion_head_lora_best.pt + LoRA adapter in lora_adapter/

2. benchmark_lora.py

Compare frozen vs LoRA encoder on IEMOCAP:

python benchmark_lora.py

Emotions

  • neutral
  • happy
  • sad
  • angry
  • fear
  • surprise

Scripts

1. setup.sh

Installs dependencies using UV and logs into HuggingFace.

bash setup.sh

2. extract_features.py

  1. Loads dataset from HuggingFace
  2. Loads Voxtral model (float16)
  3. Extracts 1280-dim features from encoder hidden states (mean pooling)
  4. Saves features to features.pkl
  5. Uploads features.pkl and README.md to model repo
python extract_features.py

Output: features.pkl - list of records with keys:

  • features: numpy array (1280,)
  • label: int (0-5)
  • emotion: string
  • split: "train"/"validation"/"test"
  • sensevoice_score: float

3. train.py

  1. Loads features from features.pkl
  2. Splits 70/15/15 if no split in data
  3. Trains EmotionHead MLP:
    • 1280 β†’ 512 β†’ 256 β†’ 6
    • BatchNorm + ReLU + Dropout(0.3)
  4. Uses class weights for imbalance
  5. Trains 150 epochs with AdamW + ReduceLROnPlateau
  6. Saves best model by validation accuracy
  7. Uploads model weights and plots to model repo
python train.py

Outputs:

  • emotion_head_best.pt - Best model weights
  • confusion_matrix.png - Test confusion matrix
  • training_curve.png - Loss curves

4. benchmark.py

Benchmarks the trained model:

Bench 1: Emotion F1 vs SenseVoice

  • Uses RAVDESS test set
  • Maps 8 RAVDESS emotions to 6 classes
  • Compares against SenseVoice baseline

Bench 2: Transcription WER

  • Uses LibriSpeech test-clean (100 samples)
  • Verifies encoder freezing doesn't affect decoder
python benchmark.py

Output: benchmark_results.json


Running on RunPod

Pod Setup

  • GPU: RTX 4090 (~$0.48/hr)
  • Template: RunPod PyTorch 2.1
  • Container Disk: 30GB

Execution Order

# 1. Setup
bash setup.sh

# 2. Extract features (~20 min)
python extract_features.py

# 3. Train (~10 min)
python train.py

# 4. Benchmark (~20 min)
python benchmark.py

# 5. Download results
tar -czf results.tar.gz emotion_head_best.pt features.pkl \
    confusion_matrix.png training_curve.png benchmark_results.json

Then download results.tar.gz from RunPod Files tab.

Model Architecture

Voxtral Encoder (frozen)
    ↓
Mean Pooling (1280 dims)
    ↓
EmotionHead MLP
    - Linear(1280, 512) + BatchNorm + ReLU + Dropout(0.3)
    - Linear(512, 256) + BatchNorm + ReLU + Dropout(0.3)
    - Linear(256, 6)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MrlolDev/voxtral-emotion-speech

Dataset used to train MrlolDev/voxtral-emotion-speech