ChristophSchuhmann commited on
Commit
d3949db
·
verified ·
1 Parent(s): 67074fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -20
README.md CHANGED
@@ -38,24 +38,14 @@ license: cc-by-4.0
38
 
39
  This work is based on the research paper:
40
  **"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**
41
- *Authors: Anonymous Author(s) (as per the provided OCR text)*
42
- *(Please refer to the full paper when published for complete author list and affiliations).*
43
- *Paper link: (To be added when EMONET-VOICE paper is available, e.g., ArXiv/Conference link)*
44
 
45
- The models and datasets (LAION'S GOT TALENT, EMONET-VOICE benchmark) are intended to be released under a permissive license (e.g., CC-BY-4.0, Apache 2.0, as mentioned in the NeurIPS checklist from OCR).
46
 
47
  ## Model Description
48
 
49
- The Empathic-Insight-Voice-Small suite consists of over 50 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set.
50
 
51
- The models were trained on a large dataset of synthetic speech, with labels generated via a sophisticated process involving LLMs (Gemini Flash 2.0) for initial 40-dimensional emotion scores and subsequent expert verification for the EMONET-VOICE benchmark subset.
52
 
53
- **Key Features:**
54
- * **Fine-grained Emotions & Attributes:** Covers a 40-category emotion taxonomy plus additional vocal attributes.
55
- * **Synthetic Data Foundation:** Trained on LAION'S GOT TALENT, a large-scale (5,000+ hours) synthetic voice-acting dataset across 11 voices, 40 emotions, and 4 languages.
56
- * **Expert-Verified Benchmark:** The EMONET-VOICE subset features rigorous validation by human experts with psychology degrees.
57
- * **Multilingual Potential:** The foundation dataset includes English, German, Spanish, and French.
58
- * **Open:** Publicly released models, datasets, and taxonomy are planned.
59
 
60
  ## Intended Use
61
 
@@ -306,14 +296,6 @@ def process_audio_file(audio_file_path_str: str) -> Dict[str, float]:
306
  # print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
307
  ```
308
 
309
- Batch Processing and Reporting:
310
- The Google Colab Notebook provides a complete pipeline (Cells 3 and 4) for:
311
-
312
- Processing all audio files in a specified input folder.
313
-
314
- Generating a detailed HTML report summarizing predictions for all files, including waveforms, audio players, and scores for all dimensions.
315
-
316
- Saving per-file JSON outputs containing all raw prediction scores.
317
 
318
  ## Taxonomy
319
 
 
38
 
39
  This work is based on the research paper:
40
  **"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**
 
 
 
41
 
 
42
 
43
  ## Model Description
44
 
45
+ The Empathic-Insight-Voice-Small suite consists of over 54 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set.
46
 
47
+ The models were trained on a large dataset of synthetic & "in the wild" speech (both each ~ 5.000 hours).
48
 
 
 
 
 
 
 
49
 
50
  ## Intended Use
51
 
 
296
  # print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
297
  ```
298
 
 
 
 
 
 
 
 
 
299
 
300
  ## Taxonomy
301