Voxtral Emotion Speech

Dataset: MrlolDev/voxtral-emotion-speech

Model: MrlolDev/voxtral-emotion-speech

Benchmark Results

Model	UA%	WA%	F1%	WF1%	Data
Ours (Frozen + MLP)	16.3	25.4	14.2	21.9	500 synthetic (11Labs)
Ours (LoRA + MLP)	-	-	-	-	end-to-end
SenseVoice-S	70.5	65.7	67.9	67.8	zero-shot
emotion2vec+ large	~80	~80	-	-	IEMOCAP

Tested on 477/1004 IEMOCAP test samples (4-class: neutral, happy, sad, angry). Our model trained only on synthetic ElevenLabs clips - low score is expected. Models marked ✅ were fine-tuned on IEMOCAP.

Benchmark Methodology

Dataset: IEMOCAP test set from AudioLLMs/iemocap_emotion_recognition
Features: 1280-dim from Voxtral encoder via audio_tower() + mean pooling
Classifier: MLP (1280→512→256→6) - two variants:
- Frozen encoder + MLP (trained on 500 synthetic clips)
- LoRA finetuned encoder + MLP (end-to-end training)
Metrics: UA (macro recall), WA (accuracy), F1 (macro), WF1 (weighted)

What We Did

Loaded audio from the dataset
Extracted 1280-dim features from Voxtral encoder hidden states using mean pooling
Trained a classification head (MLP: 1280→512→256→6) with class weights for imbalance
LoRA finetuning: Added LoRA adapters to last 6 encoder layers for end-to-end training
Benchmarked against SenseVoice on IEMOCAP emotion recognition

Scripts

1. finetune_lora.py

End-to-end LoRA finetuning on Voxtral encoder + emotion head:

python finetune_lora.py

Output: emotion_head_lora_best.pt + LoRA adapter in lora_adapter/

2. benchmark_lora.py

Compare frozen vs LoRA encoder on IEMOCAP:

python benchmark_lora.py

Emotions

neutral
happy
sad
angry
fear
surprise

Scripts

1. setup.sh

Installs dependencies using UV and logs into HuggingFace.

bash setup.sh

2. extract_features.py

Loads dataset from HuggingFace
Loads Voxtral model (float16)
Extracts 1280-dim features from encoder hidden states (mean pooling)
Saves features to features.pkl
Uploads features.pkl and README.md to model repo

python extract_features.py

Output: features.pkl - list of records with keys:

features: numpy array (1280,)
label: int (0-5)
emotion: string
split: "train"/"validation"/"test"
sensevoice_score: float

3. train.py

Loads features from features.pkl
Splits 70/15/15 if no split in data
Trains EmotionHead MLP:
- 1280 → 512 → 256 → 6
- BatchNorm + ReLU + Dropout(0.3)
Uses class weights for imbalance
Trains 150 epochs with AdamW + ReduceLROnPlateau
Saves best model by validation accuracy
Uploads model weights and plots to model repo

python train.py

Outputs:

emotion_head_best.pt - Best model weights
confusion_matrix.png - Test confusion matrix
training_curve.png - Loss curves

4. benchmark.py

Benchmarks the trained model:

Bench 1: Emotion F1 vs SenseVoice

Uses RAVDESS test set
Maps 8 RAVDESS emotions to 6 classes
Compares against SenseVoice baseline

Bench 2: Transcription WER

Uses LibriSpeech test-clean (100 samples)
Verifies encoder freezing doesn't affect decoder

python benchmark.py

Output: benchmark_results.json

Running on RunPod

Pod Setup

GPU: RTX 4090 (~$0.48/hr)
Template: RunPod PyTorch 2.1
Container Disk: 30GB

Execution Order

# 1. Setup
bash setup.sh

# 2. Extract features (~20 min)
python extract_features.py

# 3. Train (~10 min)
python train.py

# 4. Benchmark (~20 min)
python benchmark.py

# 5. Download results
tar -czf results.tar.gz emotion_head_best.pt features.pkl \
    confusion_matrix.png training_curve.png benchmark_results.json

Then download results.tar.gz from RunPod Files tab.

Model Architecture

Voxtral Encoder (frozen)
    ↓
Mean Pooling (1280 dims)
    ↓
EmotionHead MLP
    - Linear(1280, 512) + BatchNorm + ReLU + Dropout(0.3)
    - Linear(512, 256) + BatchNorm + ReLU + Dropout(0.3)
    - Linear(256, 6)

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MrlolDev/voxtral-emotion-speech

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Finetuned

(15)

this model

MrlolDev
/

voxtral-emotion-speech

Voxtral Emotion Speech

Benchmark Results

Benchmark Methodology

What We Did

Scripts

1. finetune_lora.py

2. benchmark_lora.py

Emotions

Scripts

1. setup.sh

2. extract_features.py

3. train.py

4. benchmark.py

Running on RunPod

Pod Setup

Execution Order

Model Architecture

Model tree for MrlolDev/voxtral-emotion-speech

Dataset used to train MrlolDev/voxtral-emotion-speech