Spaces:
Running
Running
Update app.py
Browse files
app.py
CHANGED
|
@@ -124,15 +124,15 @@ with gr.Blocks() as demo:
|
|
| 124 |
**A Comparative Study of Zero-Shot Text Generation vs. Downstream Embedding Classification on EmoDB**
|
| 125 |
|
| 126 |
### Abstract
|
| 127 |
-
This report evaluates the capacity of the **Qwen2-Audio-7B-Instruct** model to interpret human vocal emotion across 535 audio samples from the Berlin Emotional Speech Database (EmoDB). We contrast direct zero-shot text-prompting against a downstream machine learning layer trained on the model's final hidden-state embeddings (
|
| 128 |
|
| 129 |
---
|
| 130 |
|
| 131 |
### Core Findings
|
| 132 |
|
| 133 |
-
1. **The Representation Is Superior to the Output:** Direct zero-shot text generation yields an overall accuracy of **67.3%**. Extracting the raw
|
| 134 |
2. **Universal Cross-Speaker Generalization:** When trained on only two speakers (178 samples) and evaluated blindly on six entirely unseen speakers (357 samples), the embedding head maintains a remarkable **92.2% accuracy**. This proves the model identifies universal acoustic physics of emotion rather than speaker-specific identities.
|
| 135 |
-
3. **The Power of Linear Restraint:** Due to the extreme high-dimensionality low-sample size nature of the data (
|
| 136 |
4. **Complimentary Cognitive Profiles:** In the rare instances where the embedding head fails on acoustic "twins" (e.g., mistaking a high-arousal *Anger* sample for *Fear*), the deep reasoning layers of the full text-generation pipeline occasionally correct the mistake.
|
| 137 |
|
| 138 |
---
|
|
|
|
| 124 |
**A Comparative Study of Zero-Shot Text Generation vs. Downstream Embedding Classification on EmoDB**
|
| 125 |
|
| 126 |
### Abstract
|
| 127 |
+
This report evaluates the capacity of the **Qwen2-Audio-7B-Instruct** model to interpret human vocal emotion across 535 audio samples from the Berlin Emotional Speech Database (EmoDB). We contrast direct zero-shot text-prompting against a downstream machine learning layer trained on the model's final hidden-state embeddings (**4096D**). Our experiments demonstrate that while the text-generation layer suffers from an informational bottleneck, the internal embedding space contains an incredibly robust, speaker-independent acoustic map of human emotion.
|
| 128 |
|
| 129 |
---
|
| 130 |
|
| 131 |
### Core Findings
|
| 132 |
|
| 133 |
+
1. **The Representation Is Superior to the Output:** Direct zero-shot text generation yields an overall accuracy of **67.3%**. Extracting the raw **4096D** mathematical vectors and fitting a simple linear classification head achieves **97.2%** accuracy—a **+29.9% absolute performance leap**.
|
| 134 |
2. **Universal Cross-Speaker Generalization:** When trained on only two speakers (178 samples) and evaluated blindly on six entirely unseen speakers (357 samples), the embedding head maintains a remarkable **92.2% accuracy**. This proves the model identifies universal acoustic physics of emotion rather than speaker-specific identities.
|
| 135 |
+
3. **The Power of Linear Restraint:** Due to the extreme high-dimensionality low-sample size nature of the data (`N << D`), simple **Logistic Regression** completely outperforms flexible non-linear algorithms (Random Forest, SVM, XGBoost) by resisting overfitting.
|
| 136 |
4. **Complimentary Cognitive Profiles:** In the rare instances where the embedding head fails on acoustic "twins" (e.g., mistaking a high-arousal *Anger* sample for *Fear*), the deep reasoning layers of the full text-generation pipeline occasionally correct the mistake.
|
| 137 |
|
| 138 |
---
|