Spaces:
Running
Running
Update app.py
Browse files
app.py
CHANGED
|
@@ -11,8 +11,7 @@ df_cached = pd.read_csv("emodb_full_zeroshot_predictions.csv")
|
|
| 11 |
X_embeddings = np.load("emodb_full_embeddings.npy")
|
| 12 |
|
| 13 |
print("π§ Phase 2: Dynamically Training Both Linear Classification Heads...")
|
| 14 |
-
|
| 15 |
-
# 1. Cleanse PyArrow strings into native NumPy string arrays to avoid Python 3.13 indexing crashes
|
| 16 |
labels = df_cached['True_Emotion'].to_numpy().astype(str)
|
| 17 |
indices = np.arange(len(labels))
|
| 18 |
|
|
@@ -25,21 +24,19 @@ global_head.fit(X_train, y_train)
|
|
| 25 |
|
| 26 |
# --- Head B: The Cross-Speaker Head (Train on Speaker 31 & 34) ---
|
| 27 |
train_speakers = ['Speaker_31.0', 'Speaker_34.0']
|
| 28 |
-
|
| 29 |
-
# Convert the boolean mask to a native NumPy boolean array
|
| 30 |
cross_train_mask = df_cached['Speaker_Info'].isin(train_speakers).to_numpy()
|
| 31 |
|
| 32 |
X_train_cross = X_embeddings[cross_train_mask]
|
| 33 |
-
# Index the native NumPy labels array directly using our mask
|
| 34 |
y_train_cross = labels[cross_train_mask]
|
| 35 |
|
| 36 |
cross_head = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
|
| 37 |
cross_head.fit(X_train_cross, y_train_cross)
|
| 38 |
-
|
| 39 |
print("β
Classification heads successfully trained with native NumPy types!")
|
| 40 |
|
| 41 |
print("π Phase 3: Attaching to EmoDB on Hugging Face Hub for Audio Streaming...")
|
| 42 |
-
|
|
|
|
|
|
|
| 43 |
|
| 44 |
# --- UI Functions ---
|
| 45 |
def process_sample(index):
|
|
@@ -120,26 +117,18 @@ with gr.Blocks(theme=gr.themes.Default(primary_hue="orange", secondary_hue="neut
|
|
| 120 |
When forcing a large multimodal model to output speech interpretations as text tokens, a massive **information bottleneck** occurs. This dashboard showcases that extracting the raw mathematical embeddings hidden behind the model's text decoder unlocks an entirely new layer of granular acoustic intelligence.
|
| 121 |
|
| 122 |
### π Comparative Performance Summary
|
| 123 |
-
""")
|
| 124 |
|
| 125 |
-
# Main comparison table
|
| 126 |
-
gr.Markdown("""
|
| 127 |
| Evaluation Architecture | Test Method | Dataset Coverage | Accuracy |
|
| 128 |
| :--- | :--- | :--- | :--- |
|
| 129 |
| **Zero-Shot Text Prompting** | Direct Generation | Full Dataset (535 files) | **67.3%** |
|
| 130 |
| **Linear Embedding Classifier** | Stratified 80/20 Split | Unseen 20% Subset | **97.2%** |
|
| 131 |
| **Linear Embedding Classifier** | Cross-Speaker Generalization | 6 Unseen Speakers (Blind) | **92.2%** |
|
| 132 |
-
""")
|
| 133 |
|
| 134 |
-
gr.Markdown("""
|
| 135 |
### π Cross-Speaker Generalization Breakdown
|
| 136 |
To determine if the internal representation generalizes across unique human vocal anatomies, accents, and pitches, we trained a linear classifier **strictly on 2 speakers** (Speaker 31 and 34) and evaluated blindly on the remaining **6 unseen speakers**.
|
| 137 |
|
| 138 |
The results confirm a highly robust, universal acoustic map:
|
| 139 |
-
""")
|
| 140 |
|
| 141 |
-
# Speaker breakdown table
|
| 142 |
-
gr.Markdown("""
|
| 143 |
| Unseen Test Speaker ID | Extracted Audio Samples | Downstream Classification Accuracy |
|
| 144 |
| :--- | :--- | :--- |
|
| 145 |
| **Speaker_21.0** | 43 samples | **88.4%** |
|
|
@@ -149,11 +138,9 @@ with gr.Blocks(theme=gr.themes.Default(primary_hue="orange", secondary_hue="neut
|
|
| 149 |
| **Speaker_35.0** | 69 samples | **97.1%** |
|
| 150 |
| **Speaker_25.0** | 56 samples | **96.4%** |
|
| 151 |
| **COMBINED BLIND AVERAGE** | **357 samples** | **92.2%** |
|
| 152 |
-
""")
|
| 153 |
|
| 154 |
-
gr.Markdown("""
|
| 155 |
### π Primary Insights & Observations
|
| 156 |
-
1. **The Linear Advantage:** Complex non-linear architectures (XGBoost, Random Forests) easily fall prey to overfitting due to high dimensionality (
|
| 157 |
2. **Acoustic Edge Cases:** Misclassifications are bounded tightly by the physics of sound. The embedding head's rare failures occur strictly between acoustic "twins" like *Boredom/Neutral* (shared low-energy profiles) or *Anger/Fear* (shared high-energy profiles).
|
| 158 |
3. **The Synergistic Save:** In rare instances where raw audio signals blur high-arousal acoustics, the textual deep reasoning layers of Qwen occasionally navigate structural nuances to succeed where raw vectors misalign.
|
| 159 |
""")
|
|
|
|
| 11 |
X_embeddings = np.load("emodb_full_embeddings.npy")
|
| 12 |
|
| 13 |
print("π§ Phase 2: Dynamically Training Both Linear Classification Heads...")
|
| 14 |
+
# Cleanse PyArrow strings into native NumPy string arrays to avoid Python 3.13 indexing crashes
|
|
|
|
| 15 |
labels = df_cached['True_Emotion'].to_numpy().astype(str)
|
| 16 |
indices = np.arange(len(labels))
|
| 17 |
|
|
|
|
| 24 |
|
| 25 |
# --- Head B: The Cross-Speaker Head (Train on Speaker 31 & 34) ---
|
| 26 |
train_speakers = ['Speaker_31.0', 'Speaker_34.0']
|
|
|
|
|
|
|
| 27 |
cross_train_mask = df_cached['Speaker_Info'].isin(train_speakers).to_numpy()
|
| 28 |
|
| 29 |
X_train_cross = X_embeddings[cross_train_mask]
|
|
|
|
| 30 |
y_train_cross = labels[cross_train_mask]
|
| 31 |
|
| 32 |
cross_head = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
|
| 33 |
cross_head.fit(X_train_cross, y_train_cross)
|
|
|
|
| 34 |
print("β
Classification heads successfully trained with native NumPy types!")
|
| 35 |
|
| 36 |
print("π Phase 3: Attaching to EmoDB on Hugging Face Hub for Audio Streaming...")
|
| 37 |
+
# Using the correct original dataset path
|
| 38 |
+
hf_dataset = load_dataset("renumics/emodb", split="train")
|
| 39 |
+
print("β
Dataset streaming connected successfully!")
|
| 40 |
|
| 41 |
# --- UI Functions ---
|
| 42 |
def process_sample(index):
|
|
|
|
| 117 |
When forcing a large multimodal model to output speech interpretations as text tokens, a massive **information bottleneck** occurs. This dashboard showcases that extracting the raw mathematical embeddings hidden behind the model's text decoder unlocks an entirely new layer of granular acoustic intelligence.
|
| 118 |
|
| 119 |
### π Comparative Performance Summary
|
|
|
|
| 120 |
|
|
|
|
|
|
|
| 121 |
| Evaluation Architecture | Test Method | Dataset Coverage | Accuracy |
|
| 122 |
| :--- | :--- | :--- | :--- |
|
| 123 |
| **Zero-Shot Text Prompting** | Direct Generation | Full Dataset (535 files) | **67.3%** |
|
| 124 |
| **Linear Embedding Classifier** | Stratified 80/20 Split | Unseen 20% Subset | **97.2%** |
|
| 125 |
| **Linear Embedding Classifier** | Cross-Speaker Generalization | 6 Unseen Speakers (Blind) | **92.2%** |
|
|
|
|
| 126 |
|
|
|
|
| 127 |
### π Cross-Speaker Generalization Breakdown
|
| 128 |
To determine if the internal representation generalizes across unique human vocal anatomies, accents, and pitches, we trained a linear classifier **strictly on 2 speakers** (Speaker 31 and 34) and evaluated blindly on the remaining **6 unseen speakers**.
|
| 129 |
|
| 130 |
The results confirm a highly robust, universal acoustic map:
|
|
|
|
| 131 |
|
|
|
|
|
|
|
| 132 |
| Unseen Test Speaker ID | Extracted Audio Samples | Downstream Classification Accuracy |
|
| 133 |
| :--- | :--- | :--- |
|
| 134 |
| **Speaker_21.0** | 43 samples | **88.4%** |
|
|
|
|
| 138 |
| **Speaker_35.0** | 69 samples | **97.1%** |
|
| 139 |
| **Speaker_25.0** | 56 samples | **96.4%** |
|
| 140 |
| **COMBINED BLIND AVERAGE** | **357 samples** | **92.2%** |
|
|
|
|
| 141 |
|
|
|
|
| 142 |
### π Primary Insights & Observations
|
| 143 |
+
1. **The Linear Advantage:** Complex non-linear architectures (XGBoost, Random Forests) easily fall prey to overfitting due to high dimensionality (**4096D**) and low sample sizes. Simple `LogisticRegression` bounds generalize beautifully.
|
| 144 |
2. **Acoustic Edge Cases:** Misclassifications are bounded tightly by the physics of sound. The embedding head's rare failures occur strictly between acoustic "twins" like *Boredom/Neutral* (shared low-energy profiles) or *Anger/Fear* (shared high-energy profiles).
|
| 145 |
3. **The Synergistic Save:** In rare instances where raw audio signals blur high-arousal acoustics, the textual deep reasoning layers of Qwen occasionally navigate structural nuances to succeed where raw vectors misalign.
|
| 146 |
""")
|