File size: 20,080 Bytes
08b1dc5
 
 
 
 
 
 
5e697df
1446270
 
 
 
 
08b1dc5
 
67074fe
1fda629
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67074fe
08b1dc5
 
 
d3949db
08b1dc5
d3949db
08b1dc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66775f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08b1dc5
 
66775f0
 
 
 
 
 
 
 
 
 
08b1dc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
license: cc-by-4.0
---

# Empathic-Insight-Voice-Small
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WR-B6j--Y5RdhIyRGF_tJ3YdFF8BkUA2)

**Empathic-Insight-Voice-Small** is a suite of 40+ emotion and attribute regression models trained on the large-scale, multilingual synthetic voice-acting dataset LAION'S GOT TALENT (~ 5.000 hours) & an "in the wild" dataset of voice snippets (also ~ 5.000 hours). Each model is designed to predict the intensity of a specific fine-grained emotion or attribute from speech audio. These models leverage embeddings from a fine-tuned Whisper model (laion/BUD-E-Whisper) followed by dedicated MLP regression heads for each dimension.

This work is based on the research paper:
**"EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection"**


## Example Video Analyses (Top 3 Emotions)
<!-- This section will be populated by the HTML from Cell 0 -->
<div style='display: flex; flex-wrap: wrap; justify-content: flex-start; gap: 15px;'>
            <div style='flex: 0 1 auto; margin-bottom: 15px; text-align: center; width: 480px; max-width: 480px;'>
                <a href='https://www.youtube.com/watch?v=TsTVKCmqHhk' target='_blank' title='Watch video TsTVKCmqHhk'>
                    <img src='https://img.youtube.com/vi/TsTVKCmqHhk/hqdefault.jpg' alt='YouTube Thumbnail for TsTVKCmqHhk' style='width: 100%; height: auto; border: 1px solid #ccc; border-radius: 4px; display: block;'>
                </a>
                <p style='font-size: 0.8em; margin-top: 5px; word-break: break-all;'>ID: TsTVKCmqHhk</p>
            </div>
            <div style='flex: 0 1 auto; margin-bottom: 15px; text-align: center; width: 480px; max-width: 480px;'>
                <a href='https://www.youtube.com/watch?v=sErqFgL4vA8' target='_blank' title='Watch video sErqFgL4vA8'>
                    <img src='https://img.youtube.com/vi/sErqFgL4vA8/hqdefault.jpg' alt='YouTube Thumbnail for sErqFgL4vA8' style='width: 100%; height: auto; border: 1px solid #ccc; border-radius: 4px; display: block;'>
                </a>
                <p style='font-size: 0.8em; margin-top: 5px; word-break: break-all;'>ID: sErqFgL4vA8</p>
            </div>
            <div style='flex: 0 1 auto; margin-bottom: 15px; text-align: center; width: 480px; max-width: 480px;'>
                <a href='https://www.youtube.com/watch?v=BUnfuiwE_IM' target='_blank' title='Watch video BUnfuiwE_IM'>
                    <img src='https://img.youtube.com/vi/BUnfuiwE_IM/hqdefault.jpg' alt='YouTube Thumbnail for BUnfuiwE_IM' style='width: 100%; height: auto; border: 1px solid #ccc; border-radius: 4px; display: block;'>
                </a>
                <p style='font-size: 0.8em; margin-top: 5px; word-break: break-all;'>ID: BUnfuiwE_IM</p>
            </div>
            <div style='flex: 0 1 auto; margin-bottom: 15px; text-align: center; width: 480px; max-width: 480px;'>
                <a href='https://www.youtube.com/watch?v=dDrmjcUq8W4' target='_blank' title='Watch video dDrmjcUq8W4'>
                    <img src='https://img.youtube.com/vi/dDrmjcUq8W4/hqdefault.jpg' alt='YouTube Thumbnail for dDrmjcUq8W4' style='width: 100%; height: auto; border: 1px solid #ccc; border-radius: 4px; display: block;'>
                </a>
                <p style='font-size: 0.8em; margin-top: 5px; word-break: break-all;'>ID: dDrmjcUq8W4</p>
            </div>
            </div>

## Model Description

The Empathic-Insight-Voice-Small suite consists of over 54 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set.

The models were trained on a large dataset of synthetic & "in the wild" speech (both each ~ 5.000 hours).


## Intended Use

These models are intended for research purposes in affective computing, speech emotion recognition (SER), human-AI interaction, and voice AI development. They can be used to:
*   Analyze and predict fine-grained emotional states and vocal attributes from speech.
*   Serve as a baseline for developing more advanced SER systems.
*   Facilitate research into nuanced emotional understanding in voice AI.
*   Explore multilingual and cross-cultural aspects of speech emotion (given the foundation dataset).

**Out-of-Scope Use:**
These models are trained on synthetic speech and their generalization to spontaneous real-world speech needs further evaluation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes or infringe on privacy without due diligence and ethical review.

## How to Use

The primary way to use these models is through the provided [Google Colab Notebook](https://colab.research.google.com/drive/1WR-B6j--Y5RdhIyRGF_tJ3YdFF8BkUA2). The notebook handles dependencies, model loading, audio processing, and provides examples for:
*   Batch processing a folder of audio files.
*   Generating a comprehensive HTML report with per-file emotion scores, waveforms, and audio players.
*   Generating individual JSON files with all predicted scores for each audio file.

Below is a conceptual example of how to perform inference for a single audio file, extracting all emotion and attribute scores. For the full, runnable version, please refer to the Colab notebook.

**Conceptual Python Example for Single Audio File Inference:**

```python
import torch
import torch.nn as nn
import librosa
import numpy as np
from pathlib import Path
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from huggingface_hub import snapshot_download # For downloading MLP models
import gc # For memory management

# --- Configuration (should match Cell 2 of the Colab) ---
SAMPLING_RATE = 16000
MAX_AUDIO_SECONDS = 30.0
WHISPER_MODEL_ID = "mkrausio/EmoWhisper-AnS-Small-v0.1"
HF_MLP_REPO_ID = "laion/Empathic-Insight-Voice-Small" # Or -Large if using those
LOCAL_MLP_MODELS_DOWNLOAD_DIR = Path("./empathic_insight_voice_small_models_downloaded")

WHISPER_SEQ_LEN = 1500
WHISPER_EMBED_DIM = 768
PROJECTION_DIM_FOR_FULL_EMBED = 64 # For 'Small' models
MLP_HIDDEN_DIMS = [64, 32, 16]    # For 'Small' models
MLP_DROPOUTS = [0.0, 0.1, 0.1, 0.1] # For 'Small' models

# Mapping from .pth file name parts to human-readable dimension keys
# (Abridged, full map in Colab Cell 2)

FILENAME_PART_TO_TARGET_KEY_MAP: Dict[str, str] = {
    "Affection": "Affection", "Age": "Age", "Amusement": "Amusement", "Anger": "Anger",
    "Arousal": "Arousal", "Astonishment_Surprise": "Astonishment/Surprise",
    "Authenticity": "Authenticity", "Awe": "Awe", "Background_Noise": "Background_Noise",
    "Bitterness": "Bitterness", "Concentration": "Concentration",
    "Confident_vs._Hesitant": "Confident_vs._Hesitant", "Confusion": "Confusion",
    "Contemplation": "Contemplation", "Contempt": "Contempt", "Contentment": "Contentment",
    "Disappointment": "Disappointment", "Disgust": "Disgust", "Distress": "Distress",
    "Doubt": "Doubt", "Elation": "Elation", "Embarrassment": "Embarrassment",
    "Emotional_Numbness": "Emotional Numbness", "Fatigue_Exhaustion": "Fatigue/Exhaustion",
    "Fear": "Fear", "Gender": "Gender", "Helplessness": "Helplessness",
    "High-Pitched_vs._Low-Pitched": "High-Pitched_vs._Low-Pitched",
    "Hope_Enthusiasm_Optimism": "Hope/Enthusiasm/Optimism",
    "Impatience_and_Irritability": "Impatience and Irritability",
    "Infatuation": "Infatuation", "Interest": "Interest",
    "Intoxication_Altered_States_of_Consciousness": "Intoxication/Altered States of Consciousness",
    "Jealousy_&_Envy": "Jealousy / Envy", "Longing": "Longing",
    "Malevolence_Malice": "Malevolence/Malice",
    "Monotone_vs._Expressive": "Monotone_vs._Expressive", "Pain": "Pain",
    "Pleasure_Ecstasy": "Pleasure/Ecstasy", "Pride": "Pride",
    "Recording_Quality": "Recording_Quality", "Relief": "Relief", "Sadness": "Sadness",
    "Serious_vs._Humorous": "Serious_vs._Humorous", "Sexual_Lust": "Sexual Lust",
    "Shame": "Shame", "Soft_vs._Harsh": "Soft_vs._Harsh", "Sourness": "Sourness",
    "Submissive_vs._Dominant": "Submissive_vs._Dominant", "Teasing": "Teasing",
    "Thankfulness_Gratitude": "Thankfulness/Gratitude", "Triumph": "Triumph",
    "Valence": "Valence",
    "Vulnerable_vs._Emotionally_Detached": "Vulnerable_vs._Emotionally_Detached",
    "Warm_vs._Cold": "Warm_vs._Cold"
}

TARGET_EMOTION_KEYS_FOR_REPORT: List[str] = [
    "Amusement", "Elation", "Pleasure/Ecstasy", "Contentment", "Thankfulness/Gratitude",
    "Affection", "Infatuation", "Hope/Enthusiasm/Optimism", "Triumph", "Pride",
    "Interest", "Awe", "Astonishment/Surprise", "Concentration", "Contemplation",
    "Relief", "Longing", "Teasing", "Impatience and Irritability",
    "Sexual Lust", "Doubt", "Fear", "Distress", "Confusion", "Embarrassment", "Shame",
    "Disappointment", "Sadness", "Bitterness", "Contempt", "Disgust", "Anger",
    "Malevolence/Malice", "Sourness", "Pain", "Helplessness", "Fatigue/Exhaustion",
    "Emotional Numbness", "Intoxication/Altered States of Consciousness", "Jealousy / Envy"
]

# --- MLP Model Definition (from Colab Cell 2) ---
class FullEmbeddingMLP(nn.Module):
    def __init__(self, seq_len, embed_dim, projection_dim, mlp_hidden_dims, mlp_dropout_rates):
        super().__init__()
        if len(mlp_dropout_rates) != len(mlp_hidden_dims) + 1:
            raise ValueError("Dropout rates length error.")
        self.flatten = nn.Flatten()
        self.proj = nn.Linear(seq_len * embed_dim, projection_dim)
        layers = [nn.ReLU(), nn.Dropout(mlp_dropout_rates[0])]
        current_dim = projection_dim
        for i, h_dim in enumerate(mlp_hidden_dims):
            layers.extend([nn.Linear(current_dim, h_dim), nn.ReLU(), nn.Dropout(mlp_dropout_rates[i+1])])
            current_dim = h_dim
        layers.append(nn.Linear(current_dim, 1))
        self.mlp = nn.Sequential(*layers)
    def forward(self, x):
        if x.ndim == 4 and x.shape[1] == 1: x = x.squeeze(1)
        return self.mlp(self.proj(self.flatten(x)))

# --- Global Model Placeholders ---
whisper_model_global = None
whisper_processor_global = None
all_mlp_model_paths_dict = {} # To be populated
WHISPER_DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MLP_DEVICE = torch.device("cpu") # As per USE_CPU_OFFLOADING_FOR_MLPS in Colab

def initialize_models():
    global whisper_model_global, whisper_processor_global, all_mlp_model_paths_dict

    print(f"Whisper will run on: {WHISPER_DEVICE}")
    print(f"MLPs will run on: {MLP_DEVICE}")

    # Load Whisper
    if whisper_model_global is None:
        print(f"Loading Whisper model '{WHISPER_MODEL_ID}'...")
        whisper_processor_global = WhisperProcessor.from_pretrained(WHISPER_MODEL_ID)
        whisper_model_global = WhisperForConditionalGeneration.from_pretrained(WHISPER_MODEL_ID).to(WHISPER_DEVICE).eval()
        print("Whisper model loaded.")

    # Download and map MLPs (paths only, models loaded on-demand)
    if not all_mlp_model_paths_dict:
        print(f"Downloading MLP checkpoints from {HF_MLP_REPO_ID} to {LOCAL_MLP_MODELS_DOWNLOAD_DIR}...")
        LOCAL_MLP_MODELS_DOWNLOAD_DIR.mkdir(parents=True, exist_ok=True)
        snapshot_download(
            repo_id=HF_MLP_REPO_ID,
            local_dir=LOCAL_MLP_MODELS_DOWNLOAD_DIR,
            local_dir_use_symlinks=False,
            allow_patterns=["*.pth"],
            repo_type="model"
        )
        print("MLP checkpoints downloaded.")

        # Map .pth files to target keys (simplified from Colab Cell 2)
        for pth_file in LOCAL_MLP_MODELS_DOWNLOAD_DIR.glob("model_*_best.pth"):
            try:
                filename_part = pth_file.name.split("model_")[1].split("_best.pth")[0]
                if filename_part in FILENAME_PART_TO_TARGET_KEY_MAP:
                    target_key = FILENAME_PART_TO_TARGET_KEY_MAP[filename_part]
                    all_mlp_model_paths_dict[target_key] = pth_file
            except IndexError:
                print(f"Warning: Could not parse filename part from {pth_file.name}")
        print(f"Mapped {len(all_mlp_model_paths_dict)} MLP model paths.")
        if not all_mlp_model_paths_dict:
             raise RuntimeError("No MLP model paths could be mapped. Check FILENAME_PART_TO_TARGET_KEY_MAP and downloaded files.")


@torch.no_grad()
def get_whisper_embedding(audio_waveform_np):
    if whisper_model_global is None or whisper_processor_global is None:
        raise RuntimeError("Whisper model not initialized. Call initialize_models() first.")

    input_features = whisper_processor_global(
        audio_waveform_np, sampling_rate=SAMPLING_RATE, return_tensors="pt"
    ).input_features.to(WHISPER_DEVICE).to(whisper_model_global.dtype)

    encoder_outputs = whisper_model_global.get_encoder()(input_features=input_features)
    embedding = encoder_outputs.last_hidden_state

    current_seq_len = embedding.shape[1]
    if current_seq_len < WHISPER_SEQ_LEN:
        padding = torch.zeros((1, WHISPER_SEQ_LEN - current_seq_len, WHISPER_EMBED_DIM),
                              device=WHISPER_DEVICE, dtype=embedding.dtype)
        embedding = torch.cat((embedding, padding), dim=1)
    elif current_seq_len > WHISPER_SEQ_LEN:
        embedding = embedding[:, :WHISPER_SEQ_LEN, :]
    return embedding

def load_single_mlp(model_path, target_key):
    # Simplified loading for example (Colab Cell 2 has more robust loading)
    # For this example, assumes USE_HALF_PRECISION_FOR_MLPS=False, USE_TORCH_COMPILE_FOR_MLPS=False
    print(f"  Loading MLP for '{target_key}'...")
    model_instance = FullEmbeddingMLP(
        WHISPER_SEQ_LEN, WHISPER_EMBED_DIM, PROJECTION_DIM_FOR_FULL_EMBED,
        MLP_HIDDEN_DIMS, MLP_DROPOUTS
    )
    state_dict = torch.load(model_path, map_location='cpu')
    # Handle potential '_orig_mod.' prefix if model was torch.compile'd during training
    if any(k.startswith("_orig_mod.") for k in state_dict.keys()):
        state_dict = {k.replace("_orig_mod.", ""): v for k, v in state_dict.items()}
    model_instance.load_state_dict(state_dict)
    model_instance = model_instance.to(MLP_DEVICE).eval()
    return model_instance

@torch.no_grad()
def predict_with_mlp(embedding, mlp_model):
    embedding_for_mlp = embedding.to(MLP_DEVICE)
    # Ensure dtype matches (simplified)
    mlp_dtype = next(mlp_model.parameters()).dtype
    prediction = mlp_model(embedding_for_mlp.to(mlp_dtype))
    return prediction.item()

def process_audio_file(audio_file_path_str: str) -> Dict[str, float]:
    if not all_mlp_model_paths_dict:
        initialize_models() # Ensure models are ready

    print(f"Processing audio file: {audio_file_path_str}")
    try:
        waveform, sr = librosa.load(audio_file_path_str, sr=SAMPLING_RATE, mono=True)
        max_samples = int(MAX_AUDIO_SECONDS * SAMPLING_RATE)
        if len(waveform) > max_samples:
            waveform = waveform[:max_samples]
        print(f"Audio loaded. Duration: {len(waveform)/SAMPLING_RATE:.2f}s")
    except Exception as e:
        print(f"Error loading audio {audio_file_path_str}: {e}")
        return {}

    embedding = get_whisper_embedding(waveform)
    del waveform; gc.collect();
    if WHISPER_DEVICE.type == 'cuda': torch.cuda.empty_cache()

    all_scores: Dict[str, float] = {}
    for target_key, mlp_model_path in all_mlp_model_paths_dict.items():
        if target_key not in FILENAME_PART_TO_TARGET_KEY_MAP.values(): # Only process mapped keys
            continue

        current_mlp_model = load_single_mlp(mlp_model_path, target_key)
        if current_mlp_model:
            score = predict_with_mlp(embedding, current_mlp_model)
            all_scores[target_key] = score
            print(f"    {target_key}: {score:.4f}")
            del current_mlp_model # Unload after use
            gc.collect()
            if MLP_DEVICE.type == 'cuda': torch.cuda.empty_cache()
        else:
            all_scores[target_key] = float('nan')

    del embedding; gc.collect();
    if WHISPER_DEVICE.type == 'cuda': torch.cuda.empty_cache()

    # Optional: Calculate Softmax for the 40 primary emotions
    emotion_raw_scores = [all_scores.get(k, -float('inf')) for k in TARGET_EMOTION_KEYS_FOR_REPORT if k in all_scores]
    if emotion_raw_scores:
        softmax_probs = torch.softmax(torch.tensor(emotion_raw_scores, dtype=torch.float32), dim=0)
        print("\nTop 3 Emotions (Softmax Probabilities):")
        # Create a dictionary of {emotion_key: softmax_prob}
        emotion_softmax_dict = {
            key: prob.item()
            for key, prob in zip(
                [k for k in TARGET_EMOTION_KEYS_FOR_REPORT if k in all_scores], # only keys that had scores
                softmax_probs
            )
        }
        sorted_emotions = sorted(emotion_softmax_dict.items(), key=lambda item: item[1], reverse=True)
        for i, (emotion, prob) in enumerate(sorted_emotions[:3]):
            print(f"  {i+1}. {emotion}: {prob:.4f} (Raw: {all_scores.get(emotion, float('nan')):.4f})")
    return all_scores

# --- Example Usage (Run this after defining functions and initializing models) ---
# Make sure to have an audio file (e.g., "sample.mp3") in your current directory or provide a full path.
# And ensure FILENAME_PART_TO_TARGET_KEY_MAP and TARGET_EMOTION_KEYS_FOR_REPORT are fully populated.
#
# initialize_models() # Call this once
#
# # Create a dummy sample.mp3 for testing if it doesn't exist
# if not Path("sample.mp3").exists():
#     print("Creating dummy sample.mp3 for testing...")
#     dummy_sr = 16000
#     dummy_duration = 5 # seconds
#     dummy_tone_freq = 440 # A4 note
#     t = np.linspace(0, dummy_duration, int(dummy_sr * dummy_duration), endpoint=False)
#     dummy_waveform = 0.5 * np.sin(2 * np.pi * dummy_tone_freq * t)
#     import soundfile as sf
#     sf.write("sample.mp3", dummy_waveform, dummy_sr)
#     print("Dummy sample.mp3 created.")
#
# if Path("sample.mp3").exists() and FILENAME_PART_TO_TARGET_KEY_MAP and TARGET_EMOTION_KEYS_FOR_REPORT:
#    results = process_audio_file("sample.mp3")
#    # print("\nFull Scores Dictionary:", results)
# else:
#    print("Skipping example usage: 'sample.mp3' not found or maps are not fully populated.")
```


## Taxonomy

The core 40 emotion categories are (from EMONET-VOICE, Appendix A.1):
Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States of Consciousness, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph.

Additional vocal attributes (e.g., Valence, Arousal, Gender, Age, Pitch characteristics) are also predicted by corresponding MLP models in the suite. The full list of predictable dimensions can be inferred from the FILENAME_PART_TO_TARGET_KEY_MAP in the Colab notebook (Cell 2).


## Ethical Considerations

The EMONET-VOICE suite was developed with ethical considerations as a priority:

Privacy Preservation: The use of synthetic voice generation fundamentally circumvents privacy concerns associated with collecting real human emotional expressions, especially for sensitive states.

Responsible Use: These models are released for research. Users are urged to consider the ethical implications of their applications and avoid misuse, such as for emotional manipulation, surveillance, or in ways that could lead to unfair, biased, or harmful outcomes. The broader societal implications and mitigation of potential misuse of SER technology remain important ongoing considerations.