laion
/

Empathic-Insight-Voice-Small

Model card Files Files and versions

xet

Community

ChristophSchuhmann commited on May 21, 2025

Commit

d3949db

verified ·

1 Parent(s): 67074fe

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -20

README.md CHANGED Viewed

@@ -38,24 +38,14 @@ license: cc-by-4.0
 This work is based on the research paper:
 **"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**
-*Authors: Anonymous Author(s) (as per the provided OCR text)*
-*(Please refer to the full paper when published for complete author list and affiliations).*
-*Paper link: (To be added when EMONET-VOICE paper is available, e.g., ArXiv/Conference link)*
-The models and datasets (LAION'S GOT TALENT, EMONET-VOICE benchmark) are intended to be released under a permissive license (e.g., CC-BY-4.0, Apache 2.0, as mentioned in the NeurIPS checklist from OCR).
 ## Model Description
-The Empathic-Insight-Voice-Small suite consists of over 50 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set.
-The models were trained on a large dataset of synthetic speech, with labels generated via a sophisticated process involving LLMs (Gemini Flash 2.0) for initial 40-dimensional emotion scores and subsequent expert verification for the EMONET-VOICE benchmark subset.
-**Key Features:**
-*   **Fine-grained Emotions & Attributes:** Covers a 40-category emotion taxonomy plus additional vocal attributes.
-*   **Synthetic Data Foundation:** Trained on LAION'S GOT TALENT, a large-scale (5,000+ hours) synthetic voice-acting dataset across 11 voices, 40 emotions, and 4 languages.
-*   **Expert-Verified Benchmark:** The EMONET-VOICE subset features rigorous validation by human experts with psychology degrees.
-*   **Multilingual Potential:** The foundation dataset includes English, German, Spanish, and French.
-*   **Open:** Publicly released models, datasets, and taxonomy are planned.
 ## Intended Use
@@ -306,14 +296,6 @@ def process_audio_file(audio_file_path_str: str) -> Dict[str, float]:
 #    print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
 ```
-Batch Processing and Reporting:
-The Google Colab Notebook provides a complete pipeline (Cells 3 and 4) for:
-Processing all audio files in a specified input folder.
-Generating a detailed HTML report summarizing predictions for all files, including waveforms, audio players, and scores for all dimensions.
-Saving per-file JSON outputs containing all raw prediction scores.
 ## Taxonomy

 This work is based on the research paper:
 **"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**
 ## Model Description
+The Empathic-Insight-Voice-Small suite consists of over 54 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set.
+The models were trained on a large dataset of synthetic & "in the wild" speech (both each ~ 5.000 hours).
 ## Intended Use
 #    print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
 ```
 ## Taxonomy