vocaliz-wav2vec2-speech-emotion-recognition-finetuned

This model is a fine-tuned checkpoint built on top of:

Base model: Wiam/wav2vec2-lg-xlsr-en-speech-emotion-recognition-finetuned-ravdess-v8 (Wav2Vec2 / XLSR-style encoder with an 8-class speech-emotion head, originally associated with RAVDESS-oriented training).

Training used the Hugging Face Trainer with AutoModelForAudioClassification, keeping the same label space and classification head structure as the base checkpoint.


Intended uses

  • Research and prototyping for speech emotion recognition (SER) on English, short utterance-level clips.
  • Offline or interactive demos (e.g. file-based or microphone pipelines) where approximate emotion labels are sufficient.

Training data

Source mix (by file count)

Combined Emotions training pool (all classes, all files):

Source Files Share (approx.)
CREMA-D 7,442 58.15%
TESS 2,800 21.88%
RAVDESS 2,076 16.22%
SAVEE 480 3.75%
Total 12,798 100%

Training procedure (defaults from training script)

Setting Value
Train / validation split Stratified ~85% / ~15% (test_size=0.15, seed=42) โ†’ ~10,878 train / ~1,920 validation samples (exact counts depend on per-class stratification)
Epochs 5
Optimizer AdamW with learning rate 3e-5, warmup ratio 0.1, weight decay 0.01
Batch size 4 per device
Max audio length 8 s after resampling (truncation)
Training metrics (per validation) Accuracy, macro F1
Model selection Best checkpoint by validation macro F1 (load_best_model_at_end=True)
Precision FP16 only if CUDA + --fp16 (default run often CPU)

Evaluation results (external benchmark)

metrics with six displayed classes: neutral, happy, angry, fearful, disgust, surprised.

Model Accuracy Macro F1 Weighted F1 Micro F1
Base: Wiam/wav2vec2-lg-xlsr-en-speech-emotion-recognition-finetuned-ravdess-v8 0.9417 0.9367 0.9414 0.9417
This fine-tune (local export wav2vec2-emotions-finetuned) 0.9868 0.9850 0.9868 0.9868

How to load

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

MODEL_ID = "amykon/vocaliz-wav2vec2-speech-emotion-recognition-finetuned"

processor = AutoFeatureExtractor.from_pretrained(MODEL_ID)
model = AutoModelForAudioClassification.from_pretrained(MODEL_ID)

Public repo: no token needed for from_pretrained. Private repo: set HF_TOKEN or huggingface-cli login.


Install dependencies

pip install -r https://huggingface.co/amykon/vocaliz-wav2vec2-speech-emotion-recognition-finetuned/resolve/main/requirements.txt

requirements.txt lists torch, transformers, safetensors, huggingface_hub, numpy, and sounddevice (optional for mic-only scripts).


Files in this repository

  • README.md โ€” this model card
  • requirements.txt โ€” pip dependencies for inference
  • config.json โ€” model config and label mapping
  • preprocessor_config.json โ€” feature extractor (sample rate, etc.)
  • model.safetensors โ€” weights
Downloads last month
98
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support