File size: 3,628 Bytes
e6cadfd
 
e6b3403
 
 
8f2deef
 
 
 
 
 
 
 
 
e6b3403
 
 
 
8f2deef
 
 
 
 
 
 
 
 
 
 
 
e6cadfd
 
e6b3403
e6cadfd
8f2deef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6cadfd
8f2deef
 
 
 
e6cadfd
8f2deef
 
 
e6cadfd
8f2deef
e6cadfd
8f2deef
e6cadfd
8f2deef
e6cadfd
8f2deef
e6cadfd
8f2deef
 
 
 
 
e6cadfd
8f2deef
e6cadfd
8f2deef
e6cadfd
8f2deef
e6cadfd
8f2deef
e6cadfd
8f2deef
 
 
e6cadfd
8f2deef
e6cadfd
8f2deef
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
library_name: transformers
license: bsd-3-clause
base_model: Simon-Kotchou/ssast-tiny-patch-audioset-16-16
tags:
- audio-classification
- speech-emotion-recognition
- yoruba
- code-switching
- multilingual
- ssast
- nigeria
datasets:
- Professor/YORENG100
metrics:
- accuracy
model-index:
- name: YORENG100-SSAST-Emotion-Recognition
  results:
  - task:
      type: audio-classification
      name: Speech Emotion Recognition
    dataset:
      name: YORENG100 (Yoruba-English Code-Switching)
      type: custom
    metrics:
    - type: accuracy
      value: 0.9645
      name: Test Accuracy
pipeline_tag: audio-classification
---

# YORENG100-SSAST-Emotion-Recognition

This model is a specialized Speech Emotion Recognition (SER) system fine-tuned from the **SSAST (Self-Supervised Audio Spectrogram Transformer)**. It is specifically designed to handle the linguistic complexities of **Yoruba-English code-switching**, common in Nigerian urban centers.

## 🌟 Model Highlights
- **Architecture:** SSAST-Tiny (16x16 Patches)
- **Dataset:** YORENG100 (100 hours of curated Yoruba-English bilingual speech)
- **Top Accuracy:** 96.45% on validation set.
- **Innovation:** Captures emotional prosody across language boundaries where traditional monolingual models often fail.



## πŸš€ How to Use (Manual Inference)

Since SSAST requires specific tensor dimension handling for its patch-based architecture, we recommend the following manual inference style over the standard `pipeline`.

```python
import torch
import librosa
from transformers import AutoFeatureExtractor, ASTForAudioClassification

# 1. Load Model & Processor
model_id = "Professor/YORENG100-SSAST-Emotion-Recognition"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = ASTForAudioClassification.from_pretrained(model_id).to(device)
model.eval()

# 2. Prepare Audio (Resample to 16kHz)
audio_path = "path_to_your_audio.wav"
speech, _ = librosa.load(audio_path, sr=16000)

# 3. Preprocess & Predict
inputs = feature_extractor(speech, sampling_rate=16000, return_tensors="pt")
# SSAST Fix: Ensure [Batch, Time, Freq] shape
input_values = inputs.input_values.squeeze().unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(input_values).logits
    prediction = torch.argmax(logits, dim=-1).item()

print(f"Predicted Emotion: {model.config.id2label[prediction]}")

```

## πŸ“Š Training Results

The model was trained for 3 epochs on a TPU-v3-8. Note the high stability in accuracy even as loss continued to decrease.

| Epoch | Step | Training Loss | Validation Loss | Accuracy |
| --- | --- | --- | --- | --- |
| 1.0 | 2501 | 0.2156 | 0.2080 | 0.9645 |
| 2.0 | 5002 | 0.1943 | 0.2044 | 0.9645 |
| 3.0 | 7503 | 0.1842 | 0.2023 | 0.9645 |

## 🌍 Background: Yoruba-English Emotion Recognition

Emotion recognition in tonal languages like **Yoruba** is uniquely challenging because pitch (F0) is used to distinguish word meanings (lexical tone) as well as emotional state. When speakers code-switch into English, the "emotional acoustic profile" shifts mid-sentence.

This model uses the patch-based attention mechanism of SSAST to focus on local time-frequency features that are invariant to these language shifts.

## πŸ›  Limitations

* **Background Noise:** Not yet robust to high-noise environments (e.g., market or traffic noise).
* **Sampling Rate:** Only supports 16kHz audio input.
* **Labels:** Performance is optimized for [Angry, Happy, Sad, Neutral, etc.].

## πŸ“œ Citation

If you use this model or the YORENG100 dataset, please cite us