--- library_name: transformers license: bsd-3-clause base_model: Simon-Kotchou/ssast-tiny-patch-audioset-16-16 tags: - audio-classification - speech-emotion-recognition - yoruba - code-switching - multilingual - ssast - nigeria datasets: - Professor/YORENG100 metrics: - accuracy model-index: - name: YORENG100-SSAST-Emotion-Recognition results: - task: type: audio-classification name: Speech Emotion Recognition dataset: name: YORENG100 (Yoruba-English Code-Switching) type: custom metrics: - type: accuracy value: 0.9645 name: Test Accuracy pipeline_tag: audio-classification --- # YORENG100-SSAST-Emotion-Recognition This model is a specialized Speech Emotion Recognition (SER) system fine-tuned from the **SSAST (Self-Supervised Audio Spectrogram Transformer)**. It is specifically designed to handle the linguistic complexities of **Yoruba-English code-switching**, common in Nigerian urban centers. ## 🌟 Model Highlights - **Architecture:** SSAST-Tiny (16x16 Patches) - **Dataset:** YORENG100 (100 hours of curated Yoruba-English bilingual speech) - **Top Accuracy:** 96.45% on validation set. - **Innovation:** Captures emotional prosody across language boundaries where traditional monolingual models often fail. ## 🚀 How to Use (Manual Inference) Since SSAST requires specific tensor dimension handling for its patch-based architecture, we recommend the following manual inference style over the standard `pipeline`. ```python import torch import librosa from transformers import AutoFeatureExtractor, ASTForAudioClassification # 1. Load Model & Processor model_id = "Professor/YORENG100-SSAST-Emotion-Recognition" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) model = ASTForAudioClassification.from_pretrained(model_id).to(device) model.eval() # 2. Prepare Audio (Resample to 16kHz) audio_path = "path_to_your_audio.wav" speech, _ = librosa.load(audio_path, sr=16000) # 3. Preprocess & Predict inputs = feature_extractor(speech, sampling_rate=16000, return_tensors="pt") # SSAST Fix: Ensure [Batch, Time, Freq] shape input_values = inputs.input_values.squeeze().unsqueeze(0).to(device) with torch.no_grad(): logits = model(input_values).logits prediction = torch.argmax(logits, dim=-1).item() print(f"Predicted Emotion: {model.config.id2label[prediction]}") ``` ## 📊 Training Results The model was trained for 3 epochs on a TPU-v3-8. Note the high stability in accuracy even as loss continued to decrease. | Epoch | Step | Training Loss | Validation Loss | Accuracy | | --- | --- | --- | --- | --- | | 1.0 | 2501 | 0.2156 | 0.2080 | 0.9645 | | 2.0 | 5002 | 0.1943 | 0.2044 | 0.9645 | | 3.0 | 7503 | 0.1842 | 0.2023 | 0.9645 | ## 🌍 Background: Yoruba-English Emotion Recognition Emotion recognition in tonal languages like **Yoruba** is uniquely challenging because pitch (F0) is used to distinguish word meanings (lexical tone) as well as emotional state. When speakers code-switch into English, the "emotional acoustic profile" shifts mid-sentence. This model uses the patch-based attention mechanism of SSAST to focus on local time-frequency features that are invariant to these language shifts. ## 🛠 Limitations * **Background Noise:** Not yet robust to high-noise environments (e.g., market or traffic noise). * **Sampling Rate:** Only supports 16kHz audio input. * **Labels:** Performance is optimized for [Angry, Happy, Sad, Neutral, etc.]. ## 📜 Citation If you use this model or the YORENG100 dataset, please cite us