Spaces:

st192011
/

EmoDB-ALM-Protocol

Running

App Files Files Community

st192011 commited on 12 days ago

Commit

75c2596

verified ·

1 Parent(s): 97430fd

Update app.py

Browse files

Files changed (1) hide show

app.py +3 -3

app.py CHANGED Viewed

@@ -124,15 +124,15 @@ with gr.Blocks() as demo:
             **A Comparative Study of Zero-Shot Text Generation vs. Downstream Embedding Classification on EmoDB**
             ### Abstract
-            This report evaluates the capacity of the **Qwen2-Audio-7B-Instruct** model to interpret human vocal emotion across 535 audio samples from the Berlin Emotional Speech Database (EmoDB). We contrast direct zero-shot text-prompting against a downstream machine learning layer trained on the model's final hidden-state embeddings ($4096\text{D}$). Our experiments demonstrate that while the text-generation layer suffers from an informational bottleneck, the internal embedding space contains an incredibly robust, speaker-independent acoustic map of human emotion.
             ---
             ### Core Findings
-            1. **The Representation Is Superior to the Output:** Direct zero-shot text generation yields an overall accuracy of **67.3%**. Extracting the raw $4096\text{D}$ mathematical vectors and fitting a simple linear classification head achieves **97.2%** accuracy—a **+29.9% absolute performance leap**.
             2. **Universal Cross-Speaker Generalization:** When trained on only two speakers (178 samples) and evaluated blindly on six entirely unseen speakers (357 samples), the embedding head maintains a remarkable **92.2% accuracy**. This proves the model identifies universal acoustic physics of emotion rather than speaker-specific identities.
-            3. **The Power of Linear Restraint:** Due to the extreme high-dimensionality low-sample size nature of the data ($N \ll D$), simple **Logistic Regression** completely outperforms flexible non-linear algorithms (Random Forest, SVM, XGBoost) by resisting overfitting.
             4. **Complimentary Cognitive Profiles:** In the rare instances where the embedding head fails on acoustic "twins" (e.g., mistaking a high-arousal *Anger* sample for *Fear*), the deep reasoning layers of the full text-generation pipeline occasionally correct the mistake.
             ---

             **A Comparative Study of Zero-Shot Text Generation vs. Downstream Embedding Classification on EmoDB**
             ### Abstract
+            This report evaluates the capacity of the **Qwen2-Audio-7B-Instruct** model to interpret human vocal emotion across 535 audio samples from the Berlin Emotional Speech Database (EmoDB). We contrast direct zero-shot text-prompting against a downstream machine learning layer trained on the model's final hidden-state embeddings (**4096D**). Our experiments demonstrate that while the text-generation layer suffers from an informational bottleneck, the internal embedding space contains an incredibly robust, speaker-independent acoustic map of human emotion.
             ---
             ### Core Findings
+            1. **The Representation Is Superior to the Output:** Direct zero-shot text generation yields an overall accuracy of **67.3%**. Extracting the raw **4096D** mathematical vectors and fitting a simple linear classification head achieves **97.2%** accuracy—a **+29.9% absolute performance leap**.
             2. **Universal Cross-Speaker Generalization:** When trained on only two speakers (178 samples) and evaluated blindly on six entirely unseen speakers (357 samples), the embedding head maintains a remarkable **92.2% accuracy**. This proves the model identifies universal acoustic physics of emotion rather than speaker-specific identities.
+            3. **The Power of Linear Restraint:** Due to the extreme high-dimensionality low-sample size nature of the data (`N << D`), simple **Logistic Regression** completely outperforms flexible non-linear algorithms (Random Forest, SVM, XGBoost) by resisting overfitting.
             4. **Complimentary Cognitive Profiles:** In the rare instances where the embedding head fails on acoustic "twins" (e.g., mistaking a high-arousal *Anger* sample for *Fear*), the deep reasoning layers of the full text-generation pipeline occasionally correct the mistake.
             ---