ChristophSchuhmann commited on
Commit
08b1dc5
·
verified ·
1 Parent(s): 99f5eb4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +323 -3
README.md CHANGED
@@ -1,3 +1,323 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Okay, I will generate a model card for "Empathic Insight Voice" based on the structure of the "Empathic Insight Face" model card and the provided Python code. This will include:
2
+
3
+ An adapted model card in Markdown format.
4
+
5
+ Embedded YouTube video previews from your Cell 0.
6
+
7
+ A "How to Use" section with example Python code for:
8
+
9
+ Performing inference on a single audio file using all emotion MLPs (experts).
10
+
11
+ Mention of batch processing to JSON and HTML reports as available in the full Colab.
12
+
13
+ A separate section for the video emotion subtitling script (from your Cell 5).
14
+
15
+ Here's the generated model card:
16
+
17
+ ---
18
+ license: cc-by-4.0
19
+ ---
20
+
21
+ # Empathic-Insight-Voice-Small
22
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WR-B6j--Y5RdhIyRGF_tJ3YdFF8BkUA2)
23
+
24
+ ## Example Video Analyses (Top 3 Emotions)
25
+ <!-- This section will be populated by the HTML from Cell 0 -->
26
+ {{YOUTUBE_PREVIEWS_HTML}}
27
+
28
+ **Empathic-Insight-Voice-Small** is a suite of 40+ emotion and attribute regression models trained on the EMONET-VOICE benchmark dataset, which is derived from the large-scale, multilingual synthetic voice-acting dataset LAION'S GOT TALENT. Each model is designed to predict the intensity of a specific fine-grained emotion or attribute from speech audio. These models leverage embeddings from a fine-tuned Whisper model (mkrausio/EmoWhisper-AnS-Small-v0.1) followed by dedicated MLP regression heads for each dimension.
29
+
30
+ This work is based on the research paper:
31
+ **"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**
32
+ *Authors: Anonymous Author(s) (as per the provided OCR text)*
33
+ *(Please refer to the full paper when published for complete author list and affiliations).*
34
+ *Paper link: (To be added when EMONET-VOICE paper is available, e.g., ArXiv/Conference link)*
35
+
36
+ The models and datasets (LAION'S GOT TALENT, EMONET-VOICE benchmark) are intended to be released under a permissive license (e.g., CC-BY-4.0, Apache 2.0, as mentioned in the NeurIPS checklist from OCR).
37
+
38
+ ## Model Description
39
+
40
+ The Empathic-Insight-Voice-Small suite consists of over 50 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set.
41
+
42
+ The models were trained on a large dataset of synthetic speech, with labels generated via a sophisticated process involving LLMs (Gemini Flash 2.0) for initial 40-dimensional emotion scores and subsequent expert verification for the EMONET-VOICE benchmark subset.
43
+
44
+ **Key Features:**
45
+ * **Fine-grained Emotions & Attributes:** Covers a 40-category emotion taxonomy plus additional vocal attributes.
46
+ * **Synthetic Data Foundation:** Trained on LAION'S GOT TALENT, a large-scale (5,000+ hours) synthetic voice-acting dataset across 11 voices, 40 emotions, and 4 languages.
47
+ * **Expert-Verified Benchmark:** The EMONET-VOICE subset features rigorous validation by human experts with psychology degrees.
48
+ * **Multilingual Potential:** The foundation dataset includes English, German, Spanish, and French.
49
+ * **Open:** Publicly released models, datasets, and taxonomy are planned.
50
+
51
+ ## Intended Use
52
+
53
+ These models are intended for research purposes in affective computing, speech emotion recognition (SER), human-AI interaction, and voice AI development. They can be used to:
54
+ * Analyze and predict fine-grained emotional states and vocal attributes from speech.
55
+ * Serve as a baseline for developing more advanced SER systems.
56
+ * Facilitate research into nuanced emotional understanding in voice AI.
57
+ * Explore multilingual and cross-cultural aspects of speech emotion (given the foundation dataset).
58
+
59
+ **Out-of-Scope Use:**
60
+ These models are trained on synthetic speech and their generalization to spontaneous real-world speech needs further evaluation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes or infringe on privacy without due diligence and ethical review.
61
+
62
+ ## How to Use
63
+
64
+ The primary way to use these models is through the provided [Google Colab Notebook](https://colab.research.google.com/drive/1WR-B6j--Y5RdhIyRGF_tJ3YdFF8BkUA2). The notebook handles dependencies, model loading, audio processing, and provides examples for:
65
+ * Batch processing a folder of audio files.
66
+ * Generating a comprehensive HTML report with per-file emotion scores, waveforms, and audio players.
67
+ * Generating individual JSON files with all predicted scores for each audio file.
68
+
69
+ Below is a conceptual example of how to perform inference for a single audio file, extracting all emotion and attribute scores. For the full, runnable version, please refer to the Colab notebook.
70
+
71
+ **Conceptual Python Example for Single Audio File Inference:**
72
+
73
+ ```python
74
+ import torch
75
+ import torch.nn as nn
76
+ import librosa
77
+ import numpy as np
78
+ from pathlib import Path
79
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
80
+ from huggingface_hub import snapshot_download # For downloading MLP models
81
+ import gc # For memory management
82
+
83
+ # --- Configuration (should match Cell 2 of the Colab) ---
84
+ SAMPLING_RATE = 16000
85
+ MAX_AUDIO_SECONDS = 30.0
86
+ WHISPER_MODEL_ID = "mkrausio/EmoWhisper-AnS-Small-v0.1"
87
+ HF_MLP_REPO_ID = "laion/Empathic-Insight-Voice-Small" # Or -Large if using those
88
+ LOCAL_MLP_MODELS_DOWNLOAD_DIR = Path("./empathic_insight_voice_small_models_downloaded")
89
+
90
+ WHISPER_SEQ_LEN = 1500
91
+ WHISPER_EMBED_DIM = 768
92
+ PROJECTION_DIM_FOR_FULL_EMBED = 64 # For 'Small' models
93
+ MLP_HIDDEN_DIMS = [64, 32, 16] # For 'Small' models
94
+ MLP_DROPOUTS = [0.0, 0.1, 0.1, 0.1] # For 'Small' models
95
+
96
+ # Mapping from .pth file name parts to human-readable dimension keys
97
+ # (Abridged, full map in Colab Cell 2)
98
+ FILENAME_PART_TO_TARGET_KEY_MAP = {
99
+ "Affection": "Affection", "Amusement": "Amusement", "Anger": "Anger",
100
+ "Arousal": "Arousal", "Valence": "Valence", # ... and many more
101
+ # Add all 40 emotions and other attributes as per Colab Cell 2
102
+ }
103
+ TARGET_EMOTION_KEYS_FOR_REPORT = [ # The 40 primary emotions
104
+ "Amusement", "Elation", # ... (full list from Colab Cell 2)
105
+ ]
106
+
107
+
108
+ # --- MLP Model Definition (from Colab Cell 2) ---
109
+ class FullEmbeddingMLP(nn.Module):
110
+ def __init__(self, seq_len, embed_dim, projection_dim, mlp_hidden_dims, mlp_dropout_rates):
111
+ super().__init__()
112
+ if len(mlp_dropout_rates) != len(mlp_hidden_dims) + 1:
113
+ raise ValueError("Dropout rates length error.")
114
+ self.flatten = nn.Flatten()
115
+ self.proj = nn.Linear(seq_len * embed_dim, projection_dim)
116
+ layers = [nn.ReLU(), nn.Dropout(mlp_dropout_rates[0])]
117
+ current_dim = projection_dim
118
+ for i, h_dim in enumerate(mlp_hidden_dims):
119
+ layers.extend([nn.Linear(current_dim, h_dim), nn.ReLU(), nn.Dropout(mlp_dropout_rates[i+1])])
120
+ current_dim = h_dim
121
+ layers.append(nn.Linear(current_dim, 1))
122
+ self.mlp = nn.Sequential(*layers)
123
+ def forward(self, x):
124
+ if x.ndim == 4 and x.shape[1] == 1: x = x.squeeze(1)
125
+ return self.mlp(self.proj(self.flatten(x)))
126
+
127
+ # --- Global Model Placeholders ---
128
+ whisper_model_global = None
129
+ whisper_processor_global = None
130
+ all_mlp_model_paths_dict = {} # To be populated
131
+ WHISPER_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
132
+ MLP_DEVICE = torch.device("cpu") # As per USE_CPU_OFFLOADING_FOR_MLPS in Colab
133
+
134
+ def initialize_models():
135
+ global whisper_model_global, whisper_processor_global, all_mlp_model_paths_dict
136
+
137
+ print(f"Whisper will run on: {WHISPER_DEVICE}")
138
+ print(f"MLPs will run on: {MLP_DEVICE}")
139
+
140
+ # Load Whisper
141
+ if whisper_model_global is None:
142
+ print(f"Loading Whisper model '{WHISPER_MODEL_ID}'...")
143
+ whisper_processor_global = WhisperProcessor.from_pretrained(WHISPER_MODEL_ID)
144
+ whisper_model_global = WhisperForConditionalGeneration.from_pretrained(WHISPER_MODEL_ID).to(WHISPER_DEVICE).eval()
145
+ print("Whisper model loaded.")
146
+
147
+ # Download and map MLPs (paths only, models loaded on-demand)
148
+ if not all_mlp_model_paths_dict:
149
+ print(f"Downloading MLP checkpoints from {HF_MLP_REPO_ID} to {LOCAL_MLP_MODELS_DOWNLOAD_DIR}...")
150
+ LOCAL_MLP_MODELS_DOWNLOAD_DIR.mkdir(parents=True, exist_ok=True)
151
+ snapshot_download(
152
+ repo_id=HF_MLP_REPO_ID,
153
+ local_dir=LOCAL_MLP_MODELS_DOWNLOAD_DIR,
154
+ local_dir_use_symlinks=False,
155
+ allow_patterns=["*.pth"],
156
+ repo_type="model"
157
+ )
158
+ print("MLP checkpoints downloaded.")
159
+
160
+ # Map .pth files to target keys (simplified from Colab Cell 2)
161
+ for pth_file in LOCAL_MLP_MODELS_DOWNLOAD_DIR.glob("model_*_best.pth"):
162
+ try:
163
+ filename_part = pth_file.name.split("model_")[1].split("_best.pth")[0]
164
+ if filename_part in FILENAME_PART_TO_TARGET_KEY_MAP:
165
+ target_key = FILENAME_PART_TO_TARGET_KEY_MAP[filename_part]
166
+ all_mlp_model_paths_dict[target_key] = pth_file
167
+ except IndexError:
168
+ print(f"Warning: Could not parse filename part from {pth_file.name}")
169
+ print(f"Mapped {len(all_mlp_model_paths_dict)} MLP model paths.")
170
+ if not all_mlp_model_paths_dict:
171
+ raise RuntimeError("No MLP model paths could be mapped. Check FILENAME_PART_TO_TARGET_KEY_MAP and downloaded files.")
172
+
173
+
174
+ @torch.no_grad()
175
+ def get_whisper_embedding(audio_waveform_np):
176
+ if whisper_model_global is None or whisper_processor_global is None:
177
+ raise RuntimeError("Whisper model not initialized. Call initialize_models() first.")
178
+
179
+ input_features = whisper_processor_global(
180
+ audio_waveform_np, sampling_rate=SAMPLING_RATE, return_tensors="pt"
181
+ ).input_features.to(WHISPER_DEVICE).to(whisper_model_global.dtype)
182
+
183
+ encoder_outputs = whisper_model_global.get_encoder()(input_features=input_features)
184
+ embedding = encoder_outputs.last_hidden_state
185
+
186
+ current_seq_len = embedding.shape[1]
187
+ if current_seq_len < WHISPER_SEQ_LEN:
188
+ padding = torch.zeros((1, WHISPER_SEQ_LEN - current_seq_len, WHISPER_EMBED_DIM),
189
+ device=WHISPER_DEVICE, dtype=embedding.dtype)
190
+ embedding = torch.cat((embedding, padding), dim=1)
191
+ elif current_seq_len > WHISPER_SEQ_LEN:
192
+ embedding = embedding[:, :WHISPER_SEQ_LEN, :]
193
+ return embedding
194
+
195
+ def load_single_mlp(model_path, target_key):
196
+ # Simplified loading for example (Colab Cell 2 has more robust loading)
197
+ # For this example, assumes USE_HALF_PRECISION_FOR_MLPS=False, USE_TORCH_COMPILE_FOR_MLPS=False
198
+ print(f" Loading MLP for '{target_key}'...")
199
+ model_instance = FullEmbeddingMLP(
200
+ WHISPER_SEQ_LEN, WHISPER_EMBED_DIM, PROJECTION_DIM_FOR_FULL_EMBED,
201
+ MLP_HIDDEN_DIMS, MLP_DROPOUTS
202
+ )
203
+ state_dict = torch.load(model_path, map_location='cpu')
204
+ # Handle potential '_orig_mod.' prefix if model was torch.compile'd during training
205
+ if any(k.startswith("_orig_mod.") for k in state_dict.keys()):
206
+ state_dict = {k.replace("_orig_mod.", ""): v for k, v in state_dict.items()}
207
+ model_instance.load_state_dict(state_dict)
208
+ model_instance = model_instance.to(MLP_DEVICE).eval()
209
+ return model_instance
210
+
211
+ @torch.no_grad()
212
+ def predict_with_mlp(embedding, mlp_model):
213
+ embedding_for_mlp = embedding.to(MLP_DEVICE)
214
+ # Ensure dtype matches (simplified)
215
+ mlp_dtype = next(mlp_model.parameters()).dtype
216
+ prediction = mlp_model(embedding_for_mlp.to(mlp_dtype))
217
+ return prediction.item()
218
+
219
+ def process_audio_file(audio_file_path_str: str) -> Dict[str, float]:
220
+ if not all_mlp_model_paths_dict:
221
+ initialize_models() # Ensure models are ready
222
+
223
+ print(f"Processing audio file: {audio_file_path_str}")
224
+ try:
225
+ waveform, sr = librosa.load(audio_file_path_str, sr=SAMPLING_RATE, mono=True)
226
+ max_samples = int(MAX_AUDIO_SECONDS * SAMPLING_RATE)
227
+ if len(waveform) > max_samples:
228
+ waveform = waveform[:max_samples]
229
+ print(f"Audio loaded. Duration: {len(waveform)/SAMPLING_RATE:.2f}s")
230
+ except Exception as e:
231
+ print(f"Error loading audio {audio_file_path_str}: {e}")
232
+ return {}
233
+
234
+ embedding = get_whisper_embedding(waveform)
235
+ del waveform; gc.collect();
236
+ if WHISPER_DEVICE.type == 'cuda': torch.cuda.empty_cache()
237
+
238
+ all_scores: Dict[str, float] = {}
239
+ for target_key, mlp_model_path in all_mlp_model_paths_dict.items():
240
+ if target_key not in FILENAME_PART_TO_TARGET_KEY_MAP.values(): # Only process mapped keys
241
+ continue
242
+
243
+ current_mlp_model = load_single_mlp(mlp_model_path, target_key)
244
+ if current_mlp_model:
245
+ score = predict_with_mlp(embedding, current_mlp_model)
246
+ all_scores[target_key] = score
247
+ print(f" {target_key}: {score:.4f}")
248
+ del current_mlp_model # Unload after use
249
+ gc.collect()
250
+ if MLP_DEVICE.type == 'cuda': torch.cuda.empty_cache()
251
+ else:
252
+ all_scores[target_key] = float('nan')
253
+
254
+ del embedding; gc.collect();
255
+ if WHISPER_DEVICE.type == 'cuda': torch.cuda.empty_cache()
256
+
257
+ # Optional: Calculate Softmax for the 40 primary emotions
258
+ emotion_raw_scores = [all_scores.get(k, -float('inf')) for k in TARGET_EMOTION_KEYS_FOR_REPORT if k in all_scores]
259
+ if emotion_raw_scores:
260
+ softmax_probs = torch.softmax(torch.tensor(emotion_raw_scores, dtype=torch.float32), dim=0)
261
+ print("\nTop 3 Emotions (Softmax Probabilities):")
262
+ # Create a dictionary of {emotion_key: softmax_prob}
263
+ emotion_softmax_dict = {
264
+ key: prob.item()
265
+ for key, prob in zip(
266
+ [k for k in TARGET_EMOTION_KEYS_FOR_REPORT if k in all_scores], # only keys that had scores
267
+ softmax_probs
268
+ )
269
+ }
270
+ sorted_emotions = sorted(emotion_softmax_dict.items(), key=lambda item: item[1], reverse=True)
271
+ for i, (emotion, prob) in enumerate(sorted_emotions[:3]):
272
+ print(f" {i+1}. {emotion}: {prob:.4f} (Raw: {all_scores.get(emotion, float('nan')):.4f})")
273
+ return all_scores
274
+
275
+ # --- Example Usage (Run this after defining functions and initializing models) ---
276
+ # Make sure to have an audio file (e.g., "sample.mp3") in your current directory or provide a full path.
277
+ # And ensure FILENAME_PART_TO_TARGET_KEY_MAP and TARGET_EMOTION_KEYS_FOR_REPORT are fully populated.
278
+ #
279
+ # initialize_models() # Call this once
280
+ #
281
+ # # Create a dummy sample.mp3 for testing if it doesn't exist
282
+ # if not Path("sample.mp3").exists():
283
+ # print("Creating dummy sample.mp3 for testing...")
284
+ # dummy_sr = 16000
285
+ # dummy_duration = 5 # seconds
286
+ # dummy_tone_freq = 440 # A4 note
287
+ # t = np.linspace(0, dummy_duration, int(dummy_sr * dummy_duration), endpoint=False)
288
+ # dummy_waveform = 0.5 * np.sin(2 * np.pi * dummy_tone_freq * t)
289
+ # import soundfile as sf
290
+ # sf.write("sample.mp3", dummy_waveform, dummy_sr)
291
+ # print("Dummy sample.mp3 created.")
292
+ #
293
+ # if Path("sample.mp3").exists() and FILENAME_PART_TO_TARGET_KEY_MAP and TARGET_EMOTION_KEYS_FOR_REPORT:
294
+ # results = process_audio_file("sample.mp3")
295
+ # # print("\nFull Scores Dictionary:", results)
296
+ # else:
297
+ # print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
298
+ ```
299
+
300
+ Batch Processing and Reporting:
301
+ The Google Colab Notebook provides a complete pipeline (Cells 3 and 4) for:
302
+
303
+ Processing all audio files in a specified input folder.
304
+
305
+ Generating a detailed HTML report summarizing predictions for all files, including waveforms, audio players, and scores for all dimensions.
306
+
307
+ Saving per-file JSON outputs containing all raw prediction scores.
308
+
309
+ ## Taxonomy
310
+
311
+ The core 40 emotion categories are (from EMONET-VOICE, Appendix A.1):
312
+ Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States of Consciousness, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph.
313
+
314
+ Additional vocal attributes (e.g., Valence, Arousal, Gender, Age, Pitch characteristics) are also predicted by corresponding MLP models in the suite. The full list of predictable dimensions can be inferred from the FILENAME_PART_TO_TARGET_KEY_MAP in the Colab notebook (Cell 2).
315
+
316
+
317
+ ## Ethical Considerations
318
+
319
+ The EMONET-VOICE suite was developed with ethical considerations as a priority:
320
+
321
+ Privacy Preservation: The use of synthetic voice generation fundamentally circumvents privacy concerns associated with collecting real human emotional expressions, especially for sensitive states.
322
+
323
+ Responsible Use: These models are released for research. Users are urged to consider the ethical implications of their applications and avoid misuse, such as for emotional manipulation, surveillance, or in ways that could lead to unfair, biased, or harmful outcomes. The broader societal implications and mitigation of potential misuse of SER technology remain important ongoing considerations.