yezdata commited on
Commit
e717fdd
·
verified ·
1 Parent(s): 326e148

update V1.5 README

Browse files
Files changed (1) hide show
  1. README.md +58 -94
README.md CHANGED
@@ -30,20 +30,20 @@ model-index:
30
  metrics:
31
  - name: Macro F1
32
  type: f1
33
- value: 0.447
34
  - name: Macro Precision
35
  type: precision
36
- value: 0.464
37
  - name: Macro Recall
38
  type: recall
39
- value: 0.478
40
  ---
41
 
42
  # EmCoder
43
  <blockquote>
44
  <b>Probabilistic Emotion Recognition & Uncertainty Quantification</b><br>
45
- <b>28 Emotion multi-label Transformer-based classifier trained with MC Dropout methodology</b>
46
- </blockquote>
47
 
48
 
49
  Unlike standard classifiers, EmCoder quantifies what it doesn't know using Monte Carlo Dropout, making it suitable for high-stakes AI pipelines.<br>
@@ -56,7 +56,7 @@ EmCoder is optimized for **MC Dropout inference**.
56
  EmCoder achieves competitive F1-score with its compact size (~35% smaller than RoBERTa-base and ~45% smaller than ModernBERT), while providing per-class epistemic uncertainty quantification.
57
  | Model | Precision | Recall | F1-Score | Params |
58
  | :--- | :--- | :--- | :--- | :--- |
59
- | **EmCoder** | **0.464** | **0.478** | **0.447** | **82.1M** |
60
  | Google BERT (Original) | 0.400 | 0.630 | 0.460 | 110M |
61
  | RoBERTa-base | 0.575 | 0.396 | 0.450 | 125M |
62
  | ModernBERT-base | 0.583 | 0.535 | 0.550 | 149M |
@@ -83,12 +83,13 @@ To obtain probabilistic outputs and uncertainty metrics, use the `mc_forward` me
83
  ```python
84
  # Perform 50 stochastic passes
85
  N_SAMPLES = 50
 
86
 
87
  inputs = tokenizer("I am so happy you are here!", return_tensors="pt")
88
 
89
  model.eval()
90
- with torch.inference_mode():
91
- mc_logits = model.mc_forward(inputs['input_ids'], inputs['attention_mask'], n_samples=N_SAMPLES) # Automatically keeps Dropout active, even when in model.eval
92
 
93
  # Bayesian Post-processing
94
  all_probs = torch.sigmoid(mc_logits) # (n_samples, B, 28)
@@ -120,13 +121,8 @@ for idx in sorted_indices:
120
 
121
 
122
  ### Optimization
123
- The model is trained using a Weighted Bayesian Binary Cross Entropy loss:
124
-
125
- $$
126
- \mathcal{L}_{Bayesian} = \frac{1}{T} \sum_{t=1}^{T} \text{BCEWithLogits}(z^{(t)}, y; w)
127
- $$
128
-
129
- Where weights $w$ are calculated using a logarithmic class-balancing scale to handle extreme label imbalance:
130
 
131
  $$
132
  w_{c} = \max\left( 0.1, \min\left( 20, 1 + \ln \left( \frac{N_{neg,c} + \epsilon}{N_{pos,c} + \epsilon} \right) \right) \right)
@@ -135,94 +131,62 @@ $$
135
 
136
 
137
  ## Performance on test set
138
- **Using `thresholds.json` optimization from val set (both probability and uncertainty thresholds) for binarizing predictions**
139
- | | precision | recall | f1-score | support |
140
- |:---------------|------------:|---------:|-----------:|----------:|
141
- | micro avg | 0.476 | 0.611 | 0.535 | 6329 |
142
- | macro avg | 0.464 | 0.478 | 0.447 | 6329 |
143
- | weighted avg | 0.511 | 0.611 | 0.542 | 6329 |
144
- | samples avg | 0.524 | 0.637 | 0.55 | 6329 |
145
- |----------------|-------------|----------|------------|-----------|
146
- | admiration | 0.635 | 0.565 | 0.598 | 504 |
147
- | amusement | 0.713 | 0.894 | 0.793 | 264 |
148
- | anger | 0.367 | 0.525 | 0.432 | 198 |
149
- | annoyance | 0.215 | 0.406 | 0.281 | 320 |
150
- | approval | 0.226 | 0.396 | 0.288 | 351 |
151
- | caring | 0.199 | 0.304 | 0.24 | 135 |
152
- | confusion | 0.268 | 0.412 | 0.325 | 153 |
153
- | curiosity | 0.423 | 0.704 | 0.528 | 284 |
154
- | desire | 0.585 | 0.373 | 0.456 | 83 |
155
- | disappointment | 0.176 | 0.146 | 0.159 | 151 |
156
- | disapproval | 0.222 | 0.506 | 0.309 | 267 |
157
- | disgust | 0.56 | 0.382 | 0.454 | 123 |
158
- | embarrassment | 0.423 | 0.297 | 0.349 | 37 |
159
- | excitement | 0.423 | 0.398 | 0.41 | 103 |
160
- | fear | 0.538 | 0.641 | 0.585 | 78 |
161
- | gratitude | 0.943 | 0.886 | 0.914 | 352 |
162
- | grief | 0.111 | 0.333 | 0.167 | 6 |
163
- | joy | 0.503 | 0.602 | 0.548 | 161 |
164
- | love | 0.75 | 0.832 | 0.789 | 238 |
165
- | nervousness | 0.429 | 0.13 | 0.2 | 23 |
166
- | optimism | 0.681 | 0.505 | 0.58 | 186 |
167
- | pride | 0.75 | 0.375 | 0.5 | 16 |
168
- | realization | 0.4 | 0.097 | 0.156 | 145 |
169
- | relief | 0.2 | 0.182 | 0.19 | 11 |
170
- | remorse | 0.527 | 0.857 | 0.653 | 56 |
171
- | sadness | 0.624 | 0.372 | 0.466 | 156 |
172
- | surprise | 0.534 | 0.447 | 0.486 | 141 |
173
- | neutral | 0.567 | 0.804 | 0.665 | 1787 |
174
-
175
-
176
-
177
- **Using default threshold of 0.5 for binarizing predictions**
178
  | | precision | recall | f1-score | support |
179
  |:---------------|------------:|---------:|-----------:|----------:|
180
- | micro avg | 0.494 | 0.596 | 0.54 | 6329 |
181
- | macro avg | 0.408 | 0.495 | 0.44 | 6329 |
182
- | weighted avg | 0.492 | 0.596 | 0.535 | 6329 |
183
- | samples avg | 0.525 | 0.616 | 0.544 | 6329 |
184
  |----------------|-------------|----------|------------|-----------|
185
- | admiration | 0.541 | 0.673 | 0.599 | 504 |
186
- | amusement | 0.688 | 0.909 | 0.783 | 264 |
187
- | anger | 0.419 | 0.47 | 0.443 | 198 |
188
- | annoyance | 0.31 | 0.25 | 0.277 | 320 |
189
- | approval | 0.304 | 0.271 | 0.287 | 351 |
190
- | caring | 0.229 | 0.281 | 0.252 | 135 |
191
- | confusion | 0.26 | 0.497 | 0.342 | 153 |
192
- | curiosity | 0.432 | 0.764 | 0.552 | 284 |
193
- | desire | 0.453 | 0.518 | 0.483 | 83 |
194
- | disappointment | 0.176 | 0.152 | 0.163 | 151 |
195
- | disapproval | 0.279 | 0.404 | 0.33 | 267 |
196
- | disgust | 0.447 | 0.545 | 0.491 | 123 |
197
- | embarrassment | 0.325 | 0.351 | 0.338 | 37 |
198
- | excitement | 0.288 | 0.427 | 0.344 | 103 |
199
- | fear | 0.47 | 0.692 | 0.56 | 78 |
200
- | gratitude | 0.834 | 0.943 | 0.885 | 352 |
201
- | grief | 0 | 0 | 0 | 6 |
202
- | joy | 0.445 | 0.652 | 0.529 | 161 |
203
- | love | 0.724 | 0.895 | 0.801 | 238 |
204
- | nervousness | 0.24 | 0.261 | 0.25 | 23 |
205
- | optimism | 0.483 | 0.543 | 0.511 | 186 |
206
- | pride | 0.667 | 0.375 | 0.48 | 16 |
207
- | realization | 0.226 | 0.166 | 0.191 | 145 |
208
- | relief | 0.222 | 0.182 | 0.2 | 11 |
209
- | remorse | 0.516 | 0.857 | 0.644 | 56 |
210
- | sadness | 0.405 | 0.545 | 0.464 | 156 |
211
- | surprise | 0.429 | 0.539 | 0.478 | 141 |
212
- | neutral | 0.602 | 0.695 | 0.645 | 1787 |
213
-
214
-
215
-
 
216
 
217
  **Model uncertainty quantification on GoEmotions test set**
218
- The distribution demonstrates strong calibration, as the highest error density correlates with increased epistemic uncertainty. While most high-probability predictions are correct, a small fragment of overconfident incorrects remains likely due to dataset bias or linguistic nuances like sarcasm. These outliers identify a clear opportunity for further refinement using **temperature scaling**.
219
- ![epistemic_unc](outputs/epistemic_unc_scatter.png)
 
220
 
221
 
222
- **Confusion matrix**
223
- ![multi_label_confusion_matrix](outputs/confusion_matrix.png)
 
224
 
225
 
 
 
 
 
226
 
227
  ## Workflow
228
  ![EmCoder Workflow](outputs/workflow.png)
 
30
  metrics:
31
  - name: Macro F1
32
  type: f1
33
+ value: 0.463
34
  - name: Macro Precision
35
  type: precision
36
+ value: 0.469
37
  - name: Macro Recall
38
  type: recall
39
+ value: 0.486
40
  ---
41
 
42
  # EmCoder
43
  <blockquote>
44
  <b>Probabilistic Emotion Recognition & Uncertainty Quantification</b><br>
45
+ <b>28 Emotion multi-label Transformer classifier</b>
46
+ </blockquote>
47
 
48
 
49
  Unlike standard classifiers, EmCoder quantifies what it doesn't know using Monte Carlo Dropout, making it suitable for high-stakes AI pipelines.<br>
 
56
  EmCoder achieves competitive F1-score with its compact size (~35% smaller than RoBERTa-base and ~45% smaller than ModernBERT), while providing per-class epistemic uncertainty quantification.
57
  | Model | Precision | Recall | F1-Score | Params |
58
  | :--- | :--- | :--- | :--- | :--- |
59
+ | **EmCoder** | **0.469** | **0.486** | **0.463** | **82.1M** |
60
  | Google BERT (Original) | 0.400 | 0.630 | 0.460 | 110M |
61
  | RoBERTa-base | 0.575 | 0.396 | 0.450 | 125M |
62
  | ModernBERT-base | 0.583 | 0.535 | 0.550 | 149M |
 
83
  ```python
84
  # Perform 50 stochastic passes
85
  N_SAMPLES = 50
86
+ MAX_BATCH_SIZE = 10 # optional sub-batching of N_SAMPLES
87
 
88
  inputs = tokenizer("I am so happy you are here!", return_tensors="pt")
89
 
90
  model.eval()
91
+ with torch.no_grad():
92
+ mc_logits = model.mc_forward(inputs['input_ids'], inputs['attention_mask'], n_samples=N_SAMPLES, max_batch_size=MAX_BATCH_SIZE) # Automatically keeps Dropout active, even when in model.eval
93
 
94
  # Bayesian Post-processing
95
  all_probs = torch.sigmoid(mc_logits) # (n_samples, B, 28)
 
121
 
122
 
123
  ### Optimization
124
+ The model is trained using a **Weighted Binary Cross Entropy loss**
125
+ Where weights **w** are calculated using a logarithmic class-balancing scale to handle extreme label imbalance:
 
 
 
 
 
126
 
127
  $$
128
  w_{c} = \max\left( 0.1, \min\left( 20, 1 + \ln \left( \frac{N_{neg,c} + \epsilon}{N_{pos,c} + \epsilon} \right) \right) \right)
 
131
 
132
 
133
  ## Performance on test set
134
+ **Using `thresholds.json` optimization of probabilty thresholds for binarizing predictions (from val set)**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  | | precision | recall | f1-score | support |
136
  |:---------------|------------:|---------:|-----------:|----------:|
137
+ | micro avg | 0.482 | 0.627 | 0.545 | 6329 |
138
+ | **macro avg** | **0.469** |**0.486** | **0.463** | 6329 |
139
+ | weighted avg | 0.508 | 0.627 | 0.550 | 6329 |
140
+ | samples avg | 0.532 | 0.651 | 0.560 | 6329 |
141
  |----------------|-------------|----------|------------|-----------|
142
+ | admiration | 0.613 | 0.607 | 0.610 | 504 |
143
+ | amusement | 0.724 | 0.886 | 0.797 | 264 |
144
+ | anger | 0.384 | 0.535 | 0.447 | 198 |
145
+ | annoyance | 0.230 | 0.431 | 0.300 | 320 |
146
+ | approval | 0.229 | 0.436 | 0.300 | 351 |
147
+ | caring | 0.262 | 0.281 | 0.271 | 135 |
148
+ | confusion | 0.395 | 0.320 | 0.354 | 153 |
149
+ | curiosity | 0.441 | 0.736 | 0.551 | 284 |
150
+ | desire | 0.538 | 0.422 | 0.473 | 83 |
151
+ | disappointment | 0.221 | 0.152 | 0.180 | 151 |
152
+ | disapproval | 0.242 | 0.536 | 0.333 | 267 |
153
+ | disgust | 0.595 | 0.407 | 0.483 | 123 |
154
+ | embarrassment | 0.556 | 0.405 | 0.469 | 37 |
155
+ | excitement | 0.375 | 0.379 | 0.377 | 103 |
156
+ | fear | 0.575 | 0.538 | 0.556 | 78 |
157
+ | gratitude | 0.948 | 0.886 | 0.916 | 352 |
158
+ | grief | 0.200 | 0.167 | 0.182 | 6 |
159
+ | joy | 0.566 | 0.559 | 0.562 | 161 |
160
+ | love | 0.762 | 0.861 | 0.809 | 238 |
161
+ | nervousness | 0.333 | 0.174 | 0.229 | 23 |
162
+ | optimism | 0.632 | 0.516 | 0.568 | 186 |
163
+ | pride | 0.750 | 0.375 | 0.500 | 16 |
164
+ | realization | 0.250 | 0.159 | 0.194 | 145 |
165
+ | relief | 0.286 | 0.182 | 0.222 | 11 |
166
+ | remorse | 0.547 | 0.839 | 0.662 | 56 |
167
+ | sadness | 0.432 | 0.513 | 0.469 | 156 |
168
+ | surprise | 0.483 | 0.504 | 0.493 | 141 |
169
+ | neutral | 0.555 | 0.811 | 0.659 | 1787 |
170
+
171
+
172
+
173
+ ### Entropy-based uncertainty quantification
174
 
175
  **Model uncertainty quantification on GoEmotions test set**
176
+ | Mean probability vs Epistemic | Mean probability vs Aleatoric |
177
+ | :---: | :---: |
178
+ | ![Epistemic Scatter](outputs/epistemic_unc_scatter.png) | ![Aleatoric Scatter](outputs/aleatoric_unc_scatter.png) |
179
 
180
 
181
+ **Demonstration of model uncertainty utilization**
182
+ Compute F1 score while removing the most uncertain (epistemic) x % of positive and negative classified test samples
183
+ ![F1 Rejection curve](outputs/f1_rejection_epistemic.png)
184
 
185
 
186
+ **Emotion uncertainty distribution**
187
+ | Epistemic | Aleatoric |
188
+ | :---: | :---: |
189
+ | ![Epistemic Ridge](outputs/ridge_epistemic.png) | ![Aleatoric Ridge](outputs/ridge_aleatoric.png) |
190
 
191
  ## Workflow
192
  ![EmCoder Workflow](outputs/workflow.png)