PetraMicanovic commited on
Commit
3a14dfb
·
1 Parent(s): b5f1536

Change README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -50
README.md CHANGED
@@ -7,72 +7,49 @@ tags:
7
  - speech
8
  - meld
9
  - wav2vec2
10
- - pooling
11
- ---
12
- # Audio Emotion Recognition Model
13
-
14
- This repository contains an audio-based emotion recognition model trained on the MELD dataset.
15
- The model is designed as a strong unimodal baseline and is later used as a component in a
16
- multimodal emotion recognition system.
17
-
18
  ---
19
 
20
- ## Model Overview
21
 
22
- - **Task:** Speech Emotion Recognition
23
- - **Dataset:** MELD
24
- - **Backbone:** Wav2Vec 2.0 (pretrained)
25
- - **Pooling Strategy:** Temporal mean pooling
26
- - **Classifier:** MLP with class-weighted loss
27
 
28
- The model processes raw audio signals and predicts one of the seven emotion classes defined
29
- in the MELD dataset.
30
 
31
- ---
 
 
 
 
 
32
 
33
  ## Architecture
34
 
35
- 1. **Feature Extraction**
36
- - Pretrained `facebook/wav2vec2-base`
37
- - Outputs frame-level hidden representations
38
 
39
- 2. **Temporal Pooling**
40
- - Mean pooling across the time dimension
41
- - Converts variable-length sequences into fixed-size utterance embeddings
42
 
43
- 3. **Classification Head**
44
- - Fully connected MLP
45
- - Dropout regularization
46
- - ReLU activation
47
- - Softmax output layer
48
 
49
- ---
50
 
51
- ## Handling Class Imbalance
52
-
53
- Due to significant class imbalance in the MELD dataset, **class weights** are applied in the
54
- cross-entropy loss function.
55
- This improves performance on underrepresented emotion classes and leads to better macro-F1
56
- scores.
57
-
58
- ---
59
 
60
  ## Training Details
61
 
62
- - Sampling rate: 16 kHz
63
- - Maximum utterance duration: 6 seconds
64
- - Optimizer: AdamW
65
- - Loss function: CrossEntropyLoss with class weights
66
- - Evaluation metrics:
67
- - Accuracy
68
- - Macro F1-score
69
- - Weighted F1-score
70
-
71
- ---
72
 
73
  ## Usage
74
 
75
- This model serves as:
76
- - A standalone audio emotion classifier
77
- - An audio encoder in a multimodal emotion recognition framework
78
 
 
7
  - speech
8
  - meld
9
  - wav2vec2
10
+ - temporal-pooling
 
 
 
 
 
 
 
11
  ---
12
 
13
+ # Audio Emotion Recognition Model
14
 
15
+ An audio-based emotion recognition model trained on the **MELD** dataset.
16
+ It serves as a strong **unimodal audio baseline** and as the audio encoder in a multimodal emotion recognition system.
 
 
 
17
 
18
+ ## Model Summary
 
19
 
20
+ - **Task:** Speech Emotion Recognition
21
+ - **Dataset:** MELD
22
+ - **Backbone:** `facebook/wav2vec2-base`
23
+ - **Pooling:** Temporal pooling (mean + std over time)
24
+ - **Classifier:** MLP with class-weighted loss
25
+ - **Classes:** 7 emotion categories
26
 
27
  ## Architecture
28
 
29
+ 1. **Wav2Vec 2.0 Encoder**
30
+ Extracts frame-level representations from raw audio.
 
31
 
32
+ 2. **Temporal Pooling**
33
+ Mean and standard deviation pooling over the time dimension to obtain a fixed-size utterance embedding.
 
34
 
35
+ 3. **MLP Classifier**
36
+ Fully connected layers with ReLU and dropout, followed by a softmax output layer.
 
 
 
37
 
38
+ ## Class Imbalance Handling
39
 
40
+ Class imbalance in MELD is addressed using **class weights** in the cross-entropy loss, improving macro-level performance on underrepresented emotions.
 
 
 
 
 
 
 
41
 
42
  ## Training Details
43
 
44
+ - Sampling rate: 16 kHz
45
+ - Max utterance length: 6 seconds
46
+ - Optimizer: Adam
47
+ - Loss: CrossEntropyLoss (with class weights)
48
+ - Metrics: Accuracy, Macro F1, Weighted F1
 
 
 
 
 
49
 
50
  ## Usage
51
 
52
+ - Standalone audio emotion classifier
53
+ - Audio branch for early and late fusion in multimodal emotion recognition
54
+
55