Update README.md
Browse files
README.md
CHANGED
|
@@ -3,63 +3,101 @@ library_name: transformers
|
|
| 3 |
license: bsd-3-clause
|
| 4 |
base_model: Simon-Kotchou/ssast-tiny-patch-audioset-16-16
|
| 5 |
tags:
|
| 6 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
metrics:
|
| 8 |
- accuracy
|
| 9 |
model-index:
|
| 10 |
- name: YORENG100-SSAST-Emotion-Recognition
|
| 11 |
-
results:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 15 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 16 |
-
|
| 17 |
# YORENG100-SSAST-Emotion-Recognition
|
| 18 |
|
| 19 |
-
This model is a fine-tuned
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
## Training
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
- learning_rate: 3e-05
|
| 42 |
-
- train_batch_size: 32
|
| 43 |
-
- eval_batch_size: 32
|
| 44 |
-
- seed: 42
|
| 45 |
-
- distributed_type: tpu
|
| 46 |
-
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 47 |
-
- lr_scheduler_type: linear
|
| 48 |
-
- lr_scheduler_warmup_steps: 2500
|
| 49 |
-
- num_epochs: 3
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|:-------------:|:-----:|:----:|:---------------:|:--------:|
|
| 55 |
-
| 0.2156 | 1.0 | 2501 | 0.2080 | 0.9645 |
|
| 56 |
-
| 0.1943 | 2.0 | 5002 | 0.2044 | 0.9645 |
|
| 57 |
-
| 0.1842 | 3.0 | 7503 | 0.2023 | 0.9645 |
|
| 58 |
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
| 63 |
-
- Pytorch 2.9.0+cpu
|
| 64 |
-
- Datasets 4.5.0
|
| 65 |
-
- Tokenizers 0.22.2
|
|
|
|
| 3 |
license: bsd-3-clause
|
| 4 |
base_model: Simon-Kotchou/ssast-tiny-patch-audioset-16-16
|
| 5 |
tags:
|
| 6 |
+
- audio-classification
|
| 7 |
+
- speech-emotion-recognition
|
| 8 |
+
- yoruba
|
| 9 |
+
- code-switching
|
| 10 |
+
- multilingual
|
| 11 |
+
- ssast
|
| 12 |
+
- nigeria
|
| 13 |
+
datasets:
|
| 14 |
+
- Professor/YORENG100
|
| 15 |
metrics:
|
| 16 |
- accuracy
|
| 17 |
model-index:
|
| 18 |
- name: YORENG100-SSAST-Emotion-Recognition
|
| 19 |
+
results:
|
| 20 |
+
- task:
|
| 21 |
+
type: audio-classification
|
| 22 |
+
name: Speech Emotion Recognition
|
| 23 |
+
dataset:
|
| 24 |
+
name: YORENG100 (Yoruba-English Code-Switching)
|
| 25 |
+
type: custom
|
| 26 |
+
metrics:
|
| 27 |
+
- type: accuracy
|
| 28 |
+
value: 0.9645
|
| 29 |
+
name: Test Accuracy
|
| 30 |
+
pipeline_tag: audio-classification
|
| 31 |
---
|
| 32 |
|
|
|
|
|
|
|
|
|
|
| 33 |
# YORENG100-SSAST-Emotion-Recognition
|
| 34 |
|
| 35 |
+
This model is a specialized Speech Emotion Recognition (SER) system fine-tuned from the **SSAST (Self-Supervised Audio Spectrogram Transformer)**. It is specifically designed to handle the linguistic complexities of **Yoruba-English code-switching**, common in Nigerian urban centers.
|
| 36 |
+
|
| 37 |
+
## π Model Highlights
|
| 38 |
+
- **Architecture:** SSAST-Tiny (16x16 Patches)
|
| 39 |
+
- **Dataset:** YORENG100 (100 hours of curated Yoruba-English bilingual speech)
|
| 40 |
+
- **Top Accuracy:** 96.45% on validation set.
|
| 41 |
+
- **Innovation:** Captures emotional prosody across language boundaries where traditional monolingual models often fail.
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
## π How to Use (Manual Inference)
|
| 46 |
+
|
| 47 |
+
Since SSAST requires specific tensor dimension handling for its patch-based architecture, we recommend the following manual inference style over the standard `pipeline`.
|
| 48 |
+
|
| 49 |
+
```python
|
| 50 |
+
import torch
|
| 51 |
+
import librosa
|
| 52 |
+
from transformers import AutoFeatureExtractor, ASTForAudioClassification
|
| 53 |
+
|
| 54 |
+
# 1. Load Model & Processor
|
| 55 |
+
model_id = "Professor/YORENG100-SSAST-Emotion-Recognition"
|
| 56 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 57 |
+
|
| 58 |
+
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
|
| 59 |
+
model = ASTForAudioClassification.from_pretrained(model_id).to(device)
|
| 60 |
+
model.eval()
|
| 61 |
+
|
| 62 |
+
# 2. Prepare Audio (Resample to 16kHz)
|
| 63 |
+
audio_path = "path_to_your_audio.wav"
|
| 64 |
+
speech, _ = librosa.load(audio_path, sr=16000)
|
| 65 |
|
| 66 |
+
# 3. Preprocess & Predict
|
| 67 |
+
inputs = feature_extractor(speech, sampling_rate=16000, return_tensors="pt")
|
| 68 |
+
# SSAST Fix: Ensure [Batch, Time, Freq] shape
|
| 69 |
+
input_values = inputs.input_values.squeeze().unsqueeze(0).to(device)
|
| 70 |
|
| 71 |
+
with torch.no_grad():
|
| 72 |
+
logits = model(input_values).logits
|
| 73 |
+
prediction = torch.argmax(logits, dim=-1).item()
|
| 74 |
|
| 75 |
+
print(f"Predicted Emotion: {model.config.id2label[prediction]}")
|
| 76 |
|
| 77 |
+
```
|
| 78 |
|
| 79 |
+
## π Training Results
|
| 80 |
|
| 81 |
+
The model was trained for 3 epochs on a TPU-v3-8. Note the high stability in accuracy even as loss continued to decrease.
|
| 82 |
|
| 83 |
+
| Epoch | Step | Training Loss | Validation Loss | Accuracy |
|
| 84 |
+
| --- | --- | --- | --- | --- |
|
| 85 |
+
| 1.0 | 2501 | 0.2156 | 0.2080 | 0.9645 |
|
| 86 |
+
| 2.0 | 5002 | 0.1943 | 0.2044 | 0.9645 |
|
| 87 |
+
| 3.0 | 7503 | 0.1842 | 0.2023 | 0.9645 |
|
| 88 |
|
| 89 |
+
## π Background: Yoruba-English Emotion Recognition
|
| 90 |
|
| 91 |
+
Emotion recognition in tonal languages like **Yoruba** is uniquely challenging because pitch (F0) is used to distinguish word meanings (lexical tone) as well as emotional state. When speakers code-switch into English, the "emotional acoustic profile" shifts mid-sentence.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
This model uses the patch-based attention mechanism of SSAST to focus on local time-frequency features that are invariant to these language shifts.
|
| 94 |
|
| 95 |
+
## π Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
* **Background Noise:** Not yet robust to high-noise environments (e.g., market or traffic noise).
|
| 98 |
+
* **Sampling Rate:** Only supports 16kHz audio input.
|
| 99 |
+
* **Labels:** Performance is optimized for [Angry, Happy, Sad, Neutral, etc.].
|
| 100 |
|
| 101 |
+
## π Citation
|
| 102 |
|
| 103 |
+
If you use this model or the YORENG100 dataset, please cite us
|
|
|
|
|
|
|
|
|