Spaces:

st192011
/

EmoDB-ALM-Protocol

Running

App Files Files Community

st192011 commited on 10 days ago

Commit

2987b4b

verified ·

1 Parent(s): 2e1a0bf

Update app.py

Browse files

Files changed (1) hide show

app.py +5 -18

app.py CHANGED Viewed

@@ -11,8 +11,7 @@ df_cached = pd.read_csv("emodb_full_zeroshot_predictions.csv")
 X_embeddings = np.load("emodb_full_embeddings.npy")
 print("🧠 Phase 2: Dynamically Training Both Linear Classification Heads...")
-# 1. Cleanse PyArrow strings into native NumPy string arrays to avoid Python 3.13 indexing crashes
 labels = df_cached['True_Emotion'].to_numpy().astype(str)
 indices = np.arange(len(labels))
@@ -25,21 +24,19 @@ global_head.fit(X_train, y_train)
 # --- Head B: The Cross-Speaker Head (Train on Speaker 31 & 34) ---
 train_speakers = ['Speaker_31.0', 'Speaker_34.0']
-# Convert the boolean mask to a native NumPy boolean array
 cross_train_mask = df_cached['Speaker_Info'].isin(train_speakers).to_numpy()
 X_train_cross = X_embeddings[cross_train_mask]
-# Index the native NumPy labels array directly using our mask
 y_train_cross = labels[cross_train_mask]
 cross_head = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
 cross_head.fit(X_train_cross, y_train_cross)
 print("✅ Classification heads successfully trained with native NumPy types!")
 print("🌍 Phase 3: Attaching to EmoDB on Hugging Face Hub for Audio Streaming...")
-hf_dataset = load_dataset("harritaylor/er_emodb", split="train")
 # --- UI Functions ---
 def process_sample(index):
@@ -120,26 +117,18 @@ with gr.Blocks(theme=gr.themes.Default(primary_hue="orange", secondary_hue="neut
             When forcing a large multimodal model to output speech interpretations as text tokens, a massive **information bottleneck** occurs. This dashboard showcases that extracting the raw mathematical embeddings hidden behind the model's text decoder unlocks an entirely new layer of granular acoustic intelligence.
             ### 📊 Comparative Performance Summary
-            """)
-            # Main comparison table
-            gr.Markdown("""
             | Evaluation Architecture | Test Method | Dataset Coverage | Accuracy |
             | :--- | :--- | :--- | :--- |
             | **Zero-Shot Text Prompting** | Direct Generation | Full Dataset (535 files) | **67.3%** |
             | **Linear Embedding Classifier** | Stratified 80/20 Split | Unseen 20% Subset | **97.2%** |
             | **Linear Embedding Classifier** | Cross-Speaker Generalization | 6 Unseen Speakers (Blind) | **92.2%** |
-            """)
-            gr.Markdown("""
             ### 🌍 Cross-Speaker Generalization Breakdown
             To determine if the internal representation generalizes across unique human vocal anatomies, accents, and pitches, we trained a linear classifier **strictly on 2 speakers** (Speaker 31 and 34) and evaluated blindly on the remaining **6 unseen speakers**.
             The results confirm a highly robust, universal acoustic map:
-            """)
-            # Speaker breakdown table
-            gr.Markdown("""
             | Unseen Test Speaker ID | Extracted Audio Samples | Downstream Classification Accuracy |
             | :--- | :--- | :--- |
             | **Speaker_21.0** | 43 samples | **88.4%** |
@@ -149,11 +138,9 @@ with gr.Blocks(theme=gr.themes.Default(primary_hue="orange", secondary_hue="neut
             | **Speaker_35.0** | 69 samples | **97.1%** |
             | **Speaker_25.0** | 56 samples | **96.4%** |
             | **COMBINED BLIND AVERAGE** | **357 samples** | **92.2%** |
-            """)
-            gr.Markdown("""
             ### 🔑 Primary Insights & Observations
-            1. **The Linear Advantage:** Complex non-linear architectures (XGBoost, Random Forests) easily fall prey to overfitting due to high dimensionality ($4096\\text{D}$) and low sample sizes. Simple `LogisticRegression` bounds generalize beautifully.
             2. **Acoustic Edge Cases:** Misclassifications are bounded tightly by the physics of sound. The embedding head's rare failures occur strictly between acoustic "twins" like *Boredom/Neutral* (shared low-energy profiles) or *Anger/Fear* (shared high-energy profiles).
             3. **The Synergistic Save:** In rare instances where raw audio signals blur high-arousal acoustics, the textual deep reasoning layers of Qwen occasionally navigate structural nuances to succeed where raw vectors misalign.
             """)

 X_embeddings = np.load("emodb_full_embeddings.npy")
 print("🧠 Phase 2: Dynamically Training Both Linear Classification Heads...")
+# Cleanse PyArrow strings into native NumPy string arrays to avoid Python 3.13 indexing crashes
 labels = df_cached['True_Emotion'].to_numpy().astype(str)
 indices = np.arange(len(labels))
 # --- Head B: The Cross-Speaker Head (Train on Speaker 31 & 34) ---
 train_speakers = ['Speaker_31.0', 'Speaker_34.0']
 cross_train_mask = df_cached['Speaker_Info'].isin(train_speakers).to_numpy()
 X_train_cross = X_embeddings[cross_train_mask]
 y_train_cross = labels[cross_train_mask]
 cross_head = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
 cross_head.fit(X_train_cross, y_train_cross)
 print("✅ Classification heads successfully trained with native NumPy types!")
 print("🌍 Phase 3: Attaching to EmoDB on Hugging Face Hub for Audio Streaming...")
+# Using the correct original dataset path
+hf_dataset = load_dataset("renumics/emodb", split="train")
+print("✅ Dataset streaming connected successfully!")
 # --- UI Functions ---
 def process_sample(index):
             When forcing a large multimodal model to output speech interpretations as text tokens, a massive **information bottleneck** occurs. This dashboard showcases that extracting the raw mathematical embeddings hidden behind the model's text decoder unlocks an entirely new layer of granular acoustic intelligence.
             ### 📊 Comparative Performance Summary
             | Evaluation Architecture | Test Method | Dataset Coverage | Accuracy |
             | :--- | :--- | :--- | :--- |
             | **Zero-Shot Text Prompting** | Direct Generation | Full Dataset (535 files) | **67.3%** |
             | **Linear Embedding Classifier** | Stratified 80/20 Split | Unseen 20% Subset | **97.2%** |
             | **Linear Embedding Classifier** | Cross-Speaker Generalization | 6 Unseen Speakers (Blind) | **92.2%** |
             ### 🌍 Cross-Speaker Generalization Breakdown
             To determine if the internal representation generalizes across unique human vocal anatomies, accents, and pitches, we trained a linear classifier **strictly on 2 speakers** (Speaker 31 and 34) and evaluated blindly on the remaining **6 unseen speakers**.
             The results confirm a highly robust, universal acoustic map:
             | Unseen Test Speaker ID | Extracted Audio Samples | Downstream Classification Accuracy |
             | :--- | :--- | :--- |
             | **Speaker_21.0** | 43 samples | **88.4%** |
             | **Speaker_35.0** | 69 samples | **97.1%** |
             | **Speaker_25.0** | 56 samples | **96.4%** |
             | **COMBINED BLIND AVERAGE** | **357 samples** | **92.2%** |
             ### 🔑 Primary Insights & Observations
+            1. **The Linear Advantage:** Complex non-linear architectures (XGBoost, Random Forests) easily fall prey to overfitting due to high dimensionality (**4096D**) and low sample sizes. Simple `LogisticRegression` bounds generalize beautifully.
             2. **Acoustic Edge Cases:** Misclassifications are bounded tightly by the physics of sound. The embedding head's rare failures occur strictly between acoustic "twins" like *Boredom/Neutral* (shared low-energy profiles) or *Anger/Fear* (shared high-energy profiles).
             3. **The Synergistic Save:** In rare instances where raw audio signals blur high-arousal acoustics, the textual deep reasoning layers of Qwen occasionally navigate structural nuances to succeed where raw vectors misalign.
             """)