st192011 commited on
Commit
2987b4b
Β·
verified Β·
1 Parent(s): 2e1a0bf

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +5 -18
app.py CHANGED
@@ -11,8 +11,7 @@ df_cached = pd.read_csv("emodb_full_zeroshot_predictions.csv")
11
  X_embeddings = np.load("emodb_full_embeddings.npy")
12
 
13
  print("🧠 Phase 2: Dynamically Training Both Linear Classification Heads...")
14
-
15
- # 1. Cleanse PyArrow strings into native NumPy string arrays to avoid Python 3.13 indexing crashes
16
  labels = df_cached['True_Emotion'].to_numpy().astype(str)
17
  indices = np.arange(len(labels))
18
 
@@ -25,21 +24,19 @@ global_head.fit(X_train, y_train)
25
 
26
  # --- Head B: The Cross-Speaker Head (Train on Speaker 31 & 34) ---
27
  train_speakers = ['Speaker_31.0', 'Speaker_34.0']
28
-
29
- # Convert the boolean mask to a native NumPy boolean array
30
  cross_train_mask = df_cached['Speaker_Info'].isin(train_speakers).to_numpy()
31
 
32
  X_train_cross = X_embeddings[cross_train_mask]
33
- # Index the native NumPy labels array directly using our mask
34
  y_train_cross = labels[cross_train_mask]
35
 
36
  cross_head = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
37
  cross_head.fit(X_train_cross, y_train_cross)
38
-
39
  print("βœ… Classification heads successfully trained with native NumPy types!")
40
 
41
  print("🌍 Phase 3: Attaching to EmoDB on Hugging Face Hub for Audio Streaming...")
42
- hf_dataset = load_dataset("harritaylor/er_emodb", split="train")
 
 
43
 
44
  # --- UI Functions ---
45
  def process_sample(index):
@@ -120,26 +117,18 @@ with gr.Blocks(theme=gr.themes.Default(primary_hue="orange", secondary_hue="neut
120
  When forcing a large multimodal model to output speech interpretations as text tokens, a massive **information bottleneck** occurs. This dashboard showcases that extracting the raw mathematical embeddings hidden behind the model's text decoder unlocks an entirely new layer of granular acoustic intelligence.
121
 
122
  ### πŸ“Š Comparative Performance Summary
123
- """)
124
 
125
- # Main comparison table
126
- gr.Markdown("""
127
  | Evaluation Architecture | Test Method | Dataset Coverage | Accuracy |
128
  | :--- | :--- | :--- | :--- |
129
  | **Zero-Shot Text Prompting** | Direct Generation | Full Dataset (535 files) | **67.3%** |
130
  | **Linear Embedding Classifier** | Stratified 80/20 Split | Unseen 20% Subset | **97.2%** |
131
  | **Linear Embedding Classifier** | Cross-Speaker Generalization | 6 Unseen Speakers (Blind) | **92.2%** |
132
- """)
133
 
134
- gr.Markdown("""
135
  ### 🌍 Cross-Speaker Generalization Breakdown
136
  To determine if the internal representation generalizes across unique human vocal anatomies, accents, and pitches, we trained a linear classifier **strictly on 2 speakers** (Speaker 31 and 34) and evaluated blindly on the remaining **6 unseen speakers**.
137
 
138
  The results confirm a highly robust, universal acoustic map:
139
- """)
140
 
141
- # Speaker breakdown table
142
- gr.Markdown("""
143
  | Unseen Test Speaker ID | Extracted Audio Samples | Downstream Classification Accuracy |
144
  | :--- | :--- | :--- |
145
  | **Speaker_21.0** | 43 samples | **88.4%** |
@@ -149,11 +138,9 @@ with gr.Blocks(theme=gr.themes.Default(primary_hue="orange", secondary_hue="neut
149
  | **Speaker_35.0** | 69 samples | **97.1%** |
150
  | **Speaker_25.0** | 56 samples | **96.4%** |
151
  | **COMBINED BLIND AVERAGE** | **357 samples** | **92.2%** |
152
- """)
153
 
154
- gr.Markdown("""
155
  ### πŸ”‘ Primary Insights & Observations
156
- 1. **The Linear Advantage:** Complex non-linear architectures (XGBoost, Random Forests) easily fall prey to overfitting due to high dimensionality ($4096\\text{D}$) and low sample sizes. Simple `LogisticRegression` bounds generalize beautifully.
157
  2. **Acoustic Edge Cases:** Misclassifications are bounded tightly by the physics of sound. The embedding head's rare failures occur strictly between acoustic "twins" like *Boredom/Neutral* (shared low-energy profiles) or *Anger/Fear* (shared high-energy profiles).
158
  3. **The Synergistic Save:** In rare instances where raw audio signals blur high-arousal acoustics, the textual deep reasoning layers of Qwen occasionally navigate structural nuances to succeed where raw vectors misalign.
159
  """)
 
11
  X_embeddings = np.load("emodb_full_embeddings.npy")
12
 
13
  print("🧠 Phase 2: Dynamically Training Both Linear Classification Heads...")
14
+ # Cleanse PyArrow strings into native NumPy string arrays to avoid Python 3.13 indexing crashes
 
15
  labels = df_cached['True_Emotion'].to_numpy().astype(str)
16
  indices = np.arange(len(labels))
17
 
 
24
 
25
  # --- Head B: The Cross-Speaker Head (Train on Speaker 31 & 34) ---
26
  train_speakers = ['Speaker_31.0', 'Speaker_34.0']
 
 
27
  cross_train_mask = df_cached['Speaker_Info'].isin(train_speakers).to_numpy()
28
 
29
  X_train_cross = X_embeddings[cross_train_mask]
 
30
  y_train_cross = labels[cross_train_mask]
31
 
32
  cross_head = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
33
  cross_head.fit(X_train_cross, y_train_cross)
 
34
  print("βœ… Classification heads successfully trained with native NumPy types!")
35
 
36
  print("🌍 Phase 3: Attaching to EmoDB on Hugging Face Hub for Audio Streaming...")
37
+ # Using the correct original dataset path
38
+ hf_dataset = load_dataset("renumics/emodb", split="train")
39
+ print("βœ… Dataset streaming connected successfully!")
40
 
41
  # --- UI Functions ---
42
  def process_sample(index):
 
117
  When forcing a large multimodal model to output speech interpretations as text tokens, a massive **information bottleneck** occurs. This dashboard showcases that extracting the raw mathematical embeddings hidden behind the model's text decoder unlocks an entirely new layer of granular acoustic intelligence.
118
 
119
  ### πŸ“Š Comparative Performance Summary
 
120
 
 
 
121
  | Evaluation Architecture | Test Method | Dataset Coverage | Accuracy |
122
  | :--- | :--- | :--- | :--- |
123
  | **Zero-Shot Text Prompting** | Direct Generation | Full Dataset (535 files) | **67.3%** |
124
  | **Linear Embedding Classifier** | Stratified 80/20 Split | Unseen 20% Subset | **97.2%** |
125
  | **Linear Embedding Classifier** | Cross-Speaker Generalization | 6 Unseen Speakers (Blind) | **92.2%** |
 
126
 
 
127
  ### 🌍 Cross-Speaker Generalization Breakdown
128
  To determine if the internal representation generalizes across unique human vocal anatomies, accents, and pitches, we trained a linear classifier **strictly on 2 speakers** (Speaker 31 and 34) and evaluated blindly on the remaining **6 unseen speakers**.
129
 
130
  The results confirm a highly robust, universal acoustic map:
 
131
 
 
 
132
  | Unseen Test Speaker ID | Extracted Audio Samples | Downstream Classification Accuracy |
133
  | :--- | :--- | :--- |
134
  | **Speaker_21.0** | 43 samples | **88.4%** |
 
138
  | **Speaker_35.0** | 69 samples | **97.1%** |
139
  | **Speaker_25.0** | 56 samples | **96.4%** |
140
  | **COMBINED BLIND AVERAGE** | **357 samples** | **92.2%** |
 
141
 
 
142
  ### πŸ”‘ Primary Insights & Observations
143
+ 1. **The Linear Advantage:** Complex non-linear architectures (XGBoost, Random Forests) easily fall prey to overfitting due to high dimensionality (**4096D**) and low sample sizes. Simple `LogisticRegression` bounds generalize beautifully.
144
  2. **Acoustic Edge Cases:** Misclassifications are bounded tightly by the physics of sound. The embedding head's rare failures occur strictly between acoustic "twins" like *Boredom/Neutral* (shared low-energy profiles) or *Anger/Fear* (shared high-energy profiles).
145
  3. **The Synergistic Save:** In rare instances where raw audio signals blur high-arousal acoustics, the textual deep reasoning layers of Qwen occasionally navigate structural nuances to succeed where raw vectors misalign.
146
  """)