yezdata
/

EmCoder

@@ -1,191 +0,0 @@
----
-language:
-- en
-license: cc-by-nc-nd-4.0
-library_name: generic
-tags:
-- emotion-recognition
-- bayesian-deep-learning
-- mc-dropout
-- uncertainty-quantification
-- multi-label-classification
-datasets:
-- Skylion007/openwebtext
-- google-research-datasets/go_emotions
-metrics:
-- precision
-- recall
-- f1
-model-index:
-- name: EmCoder (v1)
-  results:
-  - task:
-      type: text-classification
-      name: Multi-label Emotion Classification
-    dataset:
-      name: GoEmotions
-      type: go_emotions
-      split: test
-    metrics:
-    - name: Macro F1
-      type: f1
-      value: 0.44
-    - name: Macro Precision
-      type: precision
-      value: 0.408
-    - name: Macro Recall
-      type: recall
-      value: 0.495
----
-# EmCoder
-<blockquote>
-  <b>Probabilistic Emotion Recognition & Uncertainty Quantification</b><br>
-  <b>28 Emotion multi-label classifier trained with MC Dropout methodology</b>
-</blockquote>
-Unlike standard classifiers, EmCoder quantifies what it doesn't know using Monte Carlo Dropout, making it suitable for high-stakes AI pipelines.<br>
-EmCoder is optimized for **MC Dropout inference**.
-## SOTA benchmark
-### Evaluation on the GoEmotions test split (macro avg metrics)
-EmCoder achieves competitive F1-scores while being ~35% smaller than RoBERTa-base and ~45% smaller than ModernBERT, offering a superior efficiency-to-uncertainty ratio.
-| Model | Precision | Recall | F1-Score | Params |
-| :--- | :--- | :--- | :--- | :--- |
-| **EmCoder (v1)** | **0.408** | **0.495** | **0.440** | **82.1M** |
-| Google BERT (Original) | 0.400 | 0.630 | 0.460 | 110M |
-| RoBERTa-base | 0.575 | 0.396 | 0.450 | 125M |
-| ModernBERT-base | 0.652 | 0.443 | 0.500 | 149M |
-## How to use
-Since `.safetensors` files only store model weights and not the class logic, you need to use the provided `emcoder.py` to enable **MC Dropout inference**.<br>EmCoder v1.0 requires the `roberta-base` tokenizer for correct token-to-embedding mapping.
-### 1. Setup & Tokenization
-```python
-from transformers import AutoTokenizer
-from emcoder import EmCoder # Ensure emcoder.py is in your directory
-# Load the same tokenizer used during training
-tokenizer = AutoTokenizer.from_pretrained("roberta-base")
-EMCODER_PATH = "path/to/emcoder"
-# Initialize with same config as training
-model = EmCoder.from_pretrained(EMCODER_PATH)
-```
-### 2. Bayesian inference
-To obtain probabilistic outputs and uncertainty metrics, use the mc_forward method:
-```python
-import torch
-# Perform 50 stochastic passes
-N_SAMPLES = 50
-model.eval()
-inputs = tokenizer("I am so happy you are here!", return_tensors="pt")
-logits_mc = model.mc_forward(inputs['input_ids'], inputs['attention_mask'], n_samples=N_SAMPLES) # Automatically keeps Dropout active, even when in model.eval
-# Bayesian Post-processing
-# logits_mc shape: (n_samples, batch_size, 28)
-probs_all = torch.sigmoid(logits_mc)
-mean_probs = probs_all.mean(dim=0) # Mean Predicted Probability
-uncertainty = probs_all.std(dim=0) # Epistemic Uncertainty (Standard Deviation)
-```
-## Model Architecture
-![EmCoder Architecture](outputs/architecture.png)
-### Optimization
-The model is trained using a Weighted Bayesian Binary Cross Entropy loss:
-$$
-\mathcal{L}_{Bayesian} = \frac{1}{T} \sum_{t=1}^{T} \text{BCEWithLogits}(z^{(t)}, y; w)
-$$
-Where weights $w$ are calculated using a logarithmic class-balancing scale to handle extreme label imbalance:
-$$
-w_{c} = \max\left( 0.1, \min\left( 20, 1 + \ln \left( \frac{N_{neg,c} + \epsilon}{N_{pos,c} + \epsilon} \right) \right) \right)
-$$
-## Performance
-**Using threshold of 0.5 for binarizing predictions**
-|                |   precision |   recall |   f1-score |   support |
-|:---------------|------------:|---------:|-----------:|----------:|
-| micro avg      |       0.494 |    0.596 |      0.54  |      6329 |
-| macro avg      |       0.408 |    0.495 |      0.44  |      6329 |
-| weighted avg   |       0.492 |    0.596 |      0.535 |      6329 |
-| samples avg    |       0.525 |    0.616 |      0.544 |      6329 |
-|----------------|-------------|----------|------------|-----------|
-| admiration     |       0.541 |    0.673 |      0.599 |       504 |
-| amusement      |       0.688 |    0.909 |      0.783 |       264 |
-| anger          |       0.419 |    0.47  |      0.443 |       198 |
-| annoyance      |       0.31  |    0.25  |      0.277 |       320 |
-| approval       |       0.304 |    0.271 |      0.287 |       351 |
-| caring         |       0.229 |    0.281 |      0.252 |       135 |
-| confusion      |       0.26  |    0.497 |      0.342 |       153 |
-| curiosity      |       0.432 |    0.764 |      0.552 |       284 |
-| desire         |       0.453 |    0.518 |      0.483 |        83 |
-| disappointment |       0.176 |    0.152 |      0.163 |       151 |
-| disapproval    |       0.279 |    0.404 |      0.33  |       267 |
-| disgust        |       0.447 |    0.545 |      0.491 |       123 |
-| embarrassment  |       0.325 |    0.351 |      0.338 |        37 |
-| excitement     |       0.288 |    0.427 |      0.344 |       103 |
-| fear           |       0.47  |    0.692 |      0.56  |        78 |
-| gratitude      |       0.834 |    0.943 |      0.885 |       352 |
-| grief          |       0     |    0     |      0     |         6 |
-| joy            |       0.445 |    0.652 |      0.529 |       161 |
-| love           |       0.724 |    0.895 |      0.801 |       238 |
-| nervousness    |       0.24  |    0.261 |      0.25  |        23 |
-| optimism       |       0.483 |    0.543 |      0.511 |       186 |
-| pride          |       0.667 |    0.375 |      0.48  |        16 |
-| realization    |       0.226 |    0.166 |      0.191 |       145 |
-| relief         |       0.222 |    0.182 |      0.2   |        11 |
-| remorse        |       0.516 |    0.857 |      0.644 |        56 |
-| sadness        |       0.405 |    0.545 |      0.464 |       156 |
-| surprise       |       0.429 |    0.539 |      0.478 |       141 |
-| neutral        |       0.602 |    0.695 |      0.645 |      1787 |
-**Model uncertainty estimation**
-![epistemic_unc](outputs/epistemic_unc_scatter.png)
-**Confusion matrix**
-![multi_label_confusion_matrix](outputs/confusion_matrix.png)
-## Workflow
-![EmCoder Workflow](outputs/workflow.png)
-### Note
-Note that this model was trained on GoEmotions dataset (social networks domain) and it may not generalize well to other domains.
-## Citation
-If you use this model, please cite it as follows:
-```bibtex
-@software{jez2026emcoder,
-  author = {Václav Jež},
-  title = {EmCoder: Probabilistic Emotion Recognition & Uncertainty Quantification},
-  year = {2026},
-  publisher = {GitHub},
-  journal = {GitHub repository},
-  howpublished = {\url{https://github.com/yezdata/emcoder}},
-  version = {1.0.0}
-}
-```