Professor's picture
Update README.md
8f2deef verified
---
library_name: transformers
license: bsd-3-clause
base_model: Simon-Kotchou/ssast-tiny-patch-audioset-16-16
tags:
- audio-classification
- speech-emotion-recognition
- yoruba
- code-switching
- multilingual
- ssast
- nigeria
datasets:
- Professor/YORENG100
metrics:
- accuracy
model-index:
- name: YORENG100-SSAST-Emotion-Recognition
results:
- task:
type: audio-classification
name: Speech Emotion Recognition
dataset:
name: YORENG100 (Yoruba-English Code-Switching)
type: custom
metrics:
- type: accuracy
value: 0.9645
name: Test Accuracy
pipeline_tag: audio-classification
---
# YORENG100-SSAST-Emotion-Recognition
This model is a specialized Speech Emotion Recognition (SER) system fine-tuned from the **SSAST (Self-Supervised Audio Spectrogram Transformer)**. It is specifically designed to handle the linguistic complexities of **Yoruba-English code-switching**, common in Nigerian urban centers.
## 🌟 Model Highlights
- **Architecture:** SSAST-Tiny (16x16 Patches)
- **Dataset:** YORENG100 (100 hours of curated Yoruba-English bilingual speech)
- **Top Accuracy:** 96.45% on validation set.
- **Innovation:** Captures emotional prosody across language boundaries where traditional monolingual models often fail.
## πŸš€ How to Use (Manual Inference)
Since SSAST requires specific tensor dimension handling for its patch-based architecture, we recommend the following manual inference style over the standard `pipeline`.
```python
import torch
import librosa
from transformers import AutoFeatureExtractor, ASTForAudioClassification
# 1. Load Model & Processor
model_id = "Professor/YORENG100-SSAST-Emotion-Recognition"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = ASTForAudioClassification.from_pretrained(model_id).to(device)
model.eval()
# 2. Prepare Audio (Resample to 16kHz)
audio_path = "path_to_your_audio.wav"
speech, _ = librosa.load(audio_path, sr=16000)
# 3. Preprocess & Predict
inputs = feature_extractor(speech, sampling_rate=16000, return_tensors="pt")
# SSAST Fix: Ensure [Batch, Time, Freq] shape
input_values = inputs.input_values.squeeze().unsqueeze(0).to(device)
with torch.no_grad():
logits = model(input_values).logits
prediction = torch.argmax(logits, dim=-1).item()
print(f"Predicted Emotion: {model.config.id2label[prediction]}")
```
## πŸ“Š Training Results
The model was trained for 3 epochs on a TPU-v3-8. Note the high stability in accuracy even as loss continued to decrease.
| Epoch | Step | Training Loss | Validation Loss | Accuracy |
| --- | --- | --- | --- | --- |
| 1.0 | 2501 | 0.2156 | 0.2080 | 0.9645 |
| 2.0 | 5002 | 0.1943 | 0.2044 | 0.9645 |
| 3.0 | 7503 | 0.1842 | 0.2023 | 0.9645 |
## 🌍 Background: Yoruba-English Emotion Recognition
Emotion recognition in tonal languages like **Yoruba** is uniquely challenging because pitch (F0) is used to distinguish word meanings (lexical tone) as well as emotional state. When speakers code-switch into English, the "emotional acoustic profile" shifts mid-sentence.
This model uses the patch-based attention mechanism of SSAST to focus on local time-frequency features that are invariant to these language shifts.
## πŸ›  Limitations
* **Background Noise:** Not yet robust to high-noise environments (e.g., market or traffic noise).
* **Sampling Rate:** Only supports 16kHz audio input.
* **Labels:** Performance is optimized for [Angry, Happy, Sad, Neutral, etc.].
## πŸ“œ Citation
If you use this model or the YORENG100 dataset, please cite us