---
library_name: transformers
license: bsd-3-clause
base_model: Simon-Kotchou/ssast-tiny-patch-audioset-16-16
tags:
- audio-classification
- speech-emotion-recognition
- yoruba
- code-switching
- multilingual
- ssast
- nigeria
datasets:
- Professor/YORENG100
metrics:
- accuracy
model-index:
- name: YORENG100-SSAST-Emotion-Recognition
  results:
  - task:
      type: audio-classification
      name: Speech Emotion Recognition
    dataset:
      name: YORENG100 (Yoruba-English Code-Switching)
      type: custom
    metrics:
    - type: accuracy
      value: 0.9645
      name: Test Accuracy
pipeline_tag: audio-classification
---

# YORENG100-SSAST-Emotion-Recognition

This model is a specialized Speech Emotion Recognition (SER) system fine-tuned from the **SSAST (Self-Supervised Audio Spectrogram Transformer)**. It is specifically designed to handle the linguistic complexities of **Yoruba-English code-switching**, common in Nigerian urban centers.

## 🌟 Model Highlights
- **Architecture:** SSAST-Tiny (16x16 Patches)
- **Dataset:** YORENG100 (100 hours of curated Yoruba-English bilingual speech)
- **Top Accuracy:** 96.45% on validation set.
- **Innovation:** Captures emotional prosody across language boundaries where traditional monolingual models often fail.


## 🚀 How to Use (Manual Inference)

Since SSAST requires specific tensor dimension handling for its patch-based architecture, we recommend the following manual inference style over the standard `pipeline`.

```python
import torch
import librosa
from transformers import AutoFeatureExtractor, ASTForAudioClassification

# 1. Load Model & Processor
model_id = "Professor/YORENG100-SSAST-Emotion-Recognition"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = ASTForAudioClassification.from_pretrained(model_id).to(device)
model.eval()

# 2. Prepare Audio (Resample to 16kHz)
audio_path = "path_to_your_audio.wav"
speech, _ = librosa.load(audio_path, sr=16000)

# 3. Preprocess & Predict
inputs = feature_extractor(speech, sampling_rate=16000, return_tensors="pt")
# SSAST Fix: Ensure [Batch, Time, Freq] shape
input_values = inputs.input_values.squeeze().unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(input_values).logits
    prediction = torch.argmax(logits, dim=-1).item()

print(f"Predicted Emotion: {model.config.id2label[prediction]}")

```

## 📊 Training Results

The model was trained for 3 epochs on a TPU-v3-8. Note the high stability in accuracy even as loss continued to decrease.

| Epoch | Step | Training Loss | Validation Loss | Accuracy |
| --- | --- | --- | --- | --- |
| 1.0 | 2501 | 0.2156 | 0.2080 | 0.9645 |
| 2.0 | 5002 | 0.1943 | 0.2044 | 0.9645 |
| 3.0 | 7503 | 0.1842 | 0.2023 | 0.9645 |

## 🌍 Background: Yoruba-English Emotion Recognition

Emotion recognition in tonal languages like **Yoruba** is uniquely challenging because pitch (F0) is used to distinguish word meanings (lexical tone) as well as emotional state. When speakers code-switch into English, the "emotional acoustic profile" shifts mid-sentence.

This model uses the patch-based attention mechanism of SSAST to focus on local time-frequency features that are invariant to these language shifts.

## 🛠 Limitations

* **Background Noise:** Not yet robust to high-noise environments (e.g., market or traffic noise).
* **Sampling Rate:** Only supports 16kHz audio input.
* **Labels:** Performance is optimized for [Angry, Happy, Sad, Neutral, etc.].

## 📜 Citation

If you use this model or the YORENG100 dataset, please cite us