Update README.md
Browse files
README.md
CHANGED
|
@@ -38,24 +38,14 @@ license: cc-by-4.0
|
|
| 38 |
|
| 39 |
This work is based on the research paper:
|
| 40 |
**"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**
|
| 41 |
-
*Authors: Anonymous Author(s) (as per the provided OCR text)*
|
| 42 |
-
*(Please refer to the full paper when published for complete author list and affiliations).*
|
| 43 |
-
*Paper link: (To be added when EMONET-VOICE paper is available, e.g., ArXiv/Conference link)*
|
| 44 |
|
| 45 |
-
The models and datasets (LAION'S GOT TALENT, EMONET-VOICE benchmark) are intended to be released under a permissive license (e.g., CC-BY-4.0, Apache 2.0, as mentioned in the NeurIPS checklist from OCR).
|
| 46 |
|
| 47 |
## Model Description
|
| 48 |
|
| 49 |
-
The Empathic-Insight-Voice-Small suite consists of over
|
| 50 |
|
| 51 |
-
The models were trained on a large dataset of synthetic
|
| 52 |
|
| 53 |
-
**Key Features:**
|
| 54 |
-
* **Fine-grained Emotions & Attributes:** Covers a 40-category emotion taxonomy plus additional vocal attributes.
|
| 55 |
-
* **Synthetic Data Foundation:** Trained on LAION'S GOT TALENT, a large-scale (5,000+ hours) synthetic voice-acting dataset across 11 voices, 40 emotions, and 4 languages.
|
| 56 |
-
* **Expert-Verified Benchmark:** The EMONET-VOICE subset features rigorous validation by human experts with psychology degrees.
|
| 57 |
-
* **Multilingual Potential:** The foundation dataset includes English, German, Spanish, and French.
|
| 58 |
-
* **Open:** Publicly released models, datasets, and taxonomy are planned.
|
| 59 |
|
| 60 |
## Intended Use
|
| 61 |
|
|
@@ -306,14 +296,6 @@ def process_audio_file(audio_file_path_str: str) -> Dict[str, float]:
|
|
| 306 |
# print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
|
| 307 |
```
|
| 308 |
|
| 309 |
-
Batch Processing and Reporting:
|
| 310 |
-
The Google Colab Notebook provides a complete pipeline (Cells 3 and 4) for:
|
| 311 |
-
|
| 312 |
-
Processing all audio files in a specified input folder.
|
| 313 |
-
|
| 314 |
-
Generating a detailed HTML report summarizing predictions for all files, including waveforms, audio players, and scores for all dimensions.
|
| 315 |
-
|
| 316 |
-
Saving per-file JSON outputs containing all raw prediction scores.
|
| 317 |
|
| 318 |
## Taxonomy
|
| 319 |
|
|
|
|
| 38 |
|
| 39 |
This work is based on the research paper:
|
| 40 |
**"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**
|
|
|
|
|
|
|
|
|
|
| 41 |
|
|
|
|
| 42 |
|
| 43 |
## Model Description
|
| 44 |
|
| 45 |
+
The Empathic-Insight-Voice-Small suite consists of over 54 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set.
|
| 46 |
|
| 47 |
+
The models were trained on a large dataset of synthetic & "in the wild" speech (both each ~ 5.000 hours).
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
## Intended Use
|
| 51 |
|
|
|
|
| 296 |
# print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
|
| 297 |
```
|
| 298 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 299 |
|
| 300 |
## Taxonomy
|
| 301 |
|