st192011 commited on
Commit
75c2596
·
verified ·
1 Parent(s): 97430fd

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +3 -3
app.py CHANGED
@@ -124,15 +124,15 @@ with gr.Blocks() as demo:
124
  **A Comparative Study of Zero-Shot Text Generation vs. Downstream Embedding Classification on EmoDB**
125
 
126
  ### Abstract
127
- This report evaluates the capacity of the **Qwen2-Audio-7B-Instruct** model to interpret human vocal emotion across 535 audio samples from the Berlin Emotional Speech Database (EmoDB). We contrast direct zero-shot text-prompting against a downstream machine learning layer trained on the model's final hidden-state embeddings ($4096\text{D}$). Our experiments demonstrate that while the text-generation layer suffers from an informational bottleneck, the internal embedding space contains an incredibly robust, speaker-independent acoustic map of human emotion.
128
 
129
  ---
130
 
131
  ### Core Findings
132
 
133
- 1. **The Representation Is Superior to the Output:** Direct zero-shot text generation yields an overall accuracy of **67.3%**. Extracting the raw $4096\text{D}$ mathematical vectors and fitting a simple linear classification head achieves **97.2%** accuracy—a **+29.9% absolute performance leap**.
134
  2. **Universal Cross-Speaker Generalization:** When trained on only two speakers (178 samples) and evaluated blindly on six entirely unseen speakers (357 samples), the embedding head maintains a remarkable **92.2% accuracy**. This proves the model identifies universal acoustic physics of emotion rather than speaker-specific identities.
135
- 3. **The Power of Linear Restraint:** Due to the extreme high-dimensionality low-sample size nature of the data ($N \ll D$), simple **Logistic Regression** completely outperforms flexible non-linear algorithms (Random Forest, SVM, XGBoost) by resisting overfitting.
136
  4. **Complimentary Cognitive Profiles:** In the rare instances where the embedding head fails on acoustic "twins" (e.g., mistaking a high-arousal *Anger* sample for *Fear*), the deep reasoning layers of the full text-generation pipeline occasionally correct the mistake.
137
 
138
  ---
 
124
  **A Comparative Study of Zero-Shot Text Generation vs. Downstream Embedding Classification on EmoDB**
125
 
126
  ### Abstract
127
+ This report evaluates the capacity of the **Qwen2-Audio-7B-Instruct** model to interpret human vocal emotion across 535 audio samples from the Berlin Emotional Speech Database (EmoDB). We contrast direct zero-shot text-prompting against a downstream machine learning layer trained on the model's final hidden-state embeddings (**4096D**). Our experiments demonstrate that while the text-generation layer suffers from an informational bottleneck, the internal embedding space contains an incredibly robust, speaker-independent acoustic map of human emotion.
128
 
129
  ---
130
 
131
  ### Core Findings
132
 
133
+ 1. **The Representation Is Superior to the Output:** Direct zero-shot text generation yields an overall accuracy of **67.3%**. Extracting the raw **4096D** mathematical vectors and fitting a simple linear classification head achieves **97.2%** accuracy—a **+29.9% absolute performance leap**.
134
  2. **Universal Cross-Speaker Generalization:** When trained on only two speakers (178 samples) and evaluated blindly on six entirely unseen speakers (357 samples), the embedding head maintains a remarkable **92.2% accuracy**. This proves the model identifies universal acoustic physics of emotion rather than speaker-specific identities.
135
+ 3. **The Power of Linear Restraint:** Due to the extreme high-dimensionality low-sample size nature of the data (`N << D`), simple **Logistic Regression** completely outperforms flexible non-linear algorithms (Random Forest, SVM, XGBoost) by resisting overfitting.
136
  4. **Complimentary Cognitive Profiles:** In the rare instances where the embedding head fails on acoustic "twins" (e.g., mistaking a high-arousal *Anger* sample for *Fear*), the deep reasoning layers of the full text-generation pipeline occasionally correct the mistake.
137
 
138
  ---