|
|
--- |
|
|
library_name: transformers |
|
|
license: bsd-3-clause |
|
|
base_model: Simon-Kotchou/ssast-tiny-patch-audioset-16-16 |
|
|
tags: |
|
|
- audio-classification |
|
|
- speech-emotion-recognition |
|
|
- yoruba |
|
|
- code-switching |
|
|
- multilingual |
|
|
- ssast |
|
|
- nigeria |
|
|
datasets: |
|
|
- Professor/YORENG100 |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: YORENG100-SSAST-Emotion-Recognition |
|
|
results: |
|
|
- task: |
|
|
type: audio-classification |
|
|
name: Speech Emotion Recognition |
|
|
dataset: |
|
|
name: YORENG100 (Yoruba-English Code-Switching) |
|
|
type: custom |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.9645 |
|
|
name: Test Accuracy |
|
|
pipeline_tag: audio-classification |
|
|
--- |
|
|
|
|
|
# YORENG100-SSAST-Emotion-Recognition |
|
|
|
|
|
This model is a specialized Speech Emotion Recognition (SER) system fine-tuned from the **SSAST (Self-Supervised Audio Spectrogram Transformer)**. It is specifically designed to handle the linguistic complexities of **Yoruba-English code-switching**, common in Nigerian urban centers. |
|
|
|
|
|
## π Model Highlights |
|
|
- **Architecture:** SSAST-Tiny (16x16 Patches) |
|
|
- **Dataset:** YORENG100 (100 hours of curated Yoruba-English bilingual speech) |
|
|
- **Top Accuracy:** 96.45% on validation set. |
|
|
- **Innovation:** Captures emotional prosody across language boundaries where traditional monolingual models often fail. |
|
|
|
|
|
|
|
|
|
|
|
## π How to Use (Manual Inference) |
|
|
|
|
|
Since SSAST requires specific tensor dimension handling for its patch-based architecture, we recommend the following manual inference style over the standard `pipeline`. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import librosa |
|
|
from transformers import AutoFeatureExtractor, ASTForAudioClassification |
|
|
|
|
|
# 1. Load Model & Processor |
|
|
model_id = "Professor/YORENG100-SSAST-Emotion-Recognition" |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) |
|
|
model = ASTForAudioClassification.from_pretrained(model_id).to(device) |
|
|
model.eval() |
|
|
|
|
|
# 2. Prepare Audio (Resample to 16kHz) |
|
|
audio_path = "path_to_your_audio.wav" |
|
|
speech, _ = librosa.load(audio_path, sr=16000) |
|
|
|
|
|
# 3. Preprocess & Predict |
|
|
inputs = feature_extractor(speech, sampling_rate=16000, return_tensors="pt") |
|
|
# SSAST Fix: Ensure [Batch, Time, Freq] shape |
|
|
input_values = inputs.input_values.squeeze().unsqueeze(0).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(input_values).logits |
|
|
prediction = torch.argmax(logits, dim=-1).item() |
|
|
|
|
|
print(f"Predicted Emotion: {model.config.id2label[prediction]}") |
|
|
|
|
|
``` |
|
|
|
|
|
## π Training Results |
|
|
|
|
|
The model was trained for 3 epochs on a TPU-v3-8. Note the high stability in accuracy even as loss continued to decrease. |
|
|
|
|
|
| Epoch | Step | Training Loss | Validation Loss | Accuracy | |
|
|
| --- | --- | --- | --- | --- | |
|
|
| 1.0 | 2501 | 0.2156 | 0.2080 | 0.9645 | |
|
|
| 2.0 | 5002 | 0.1943 | 0.2044 | 0.9645 | |
|
|
| 3.0 | 7503 | 0.1842 | 0.2023 | 0.9645 | |
|
|
|
|
|
## π Background: Yoruba-English Emotion Recognition |
|
|
|
|
|
Emotion recognition in tonal languages like **Yoruba** is uniquely challenging because pitch (F0) is used to distinguish word meanings (lexical tone) as well as emotional state. When speakers code-switch into English, the "emotional acoustic profile" shifts mid-sentence. |
|
|
|
|
|
This model uses the patch-based attention mechanism of SSAST to focus on local time-frequency features that are invariant to these language shifts. |
|
|
|
|
|
## π Limitations |
|
|
|
|
|
* **Background Noise:** Not yet robust to high-noise environments (e.g., market or traffic noise). |
|
|
* **Sampling Rate:** Only supports 16kHz audio input. |
|
|
* **Labels:** Performance is optimized for [Angry, Happy, Sad, Neutral, etc.]. |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model or the YORENG100 dataset, please cite us |