yezdata commited on May 22

Commit

a10898b

verified ·

1 Parent(s): 56c0146

UPDATE EmCoder TO V2

Browse files

Files changed (18) hide show

.gitattributes +3 -0
README.md +237 -0
configuration_emcoder.py +32 -0
model.safetensors +3 -0
modeling_emcoder.py +301 -0
outputs/admiration_scatters.png +3 -0
outputs/confusion_matrix.png +0 -0
outputs/f1_rejection_epistemic.png +0 -0
outputs/fear_scatters.png +3 -0
outputs/neutral_scatters.png +3 -0
outputs/ridge_aleatoric.png +3 -0
outputs/ridge_epistemic.png +3 -0
rope_embeddings.py +270 -0
thresholds.json +114 -0
tokenizer.json +0 -0
tokenizer_config.json +17 -0
train_config.json +11 -0
train_state.json +4 -0

.gitattributes CHANGED Viewed

@@ -4,3 +4,6 @@ outputs/epistemic_unc_scatter.png filter=lfs diff=lfs merge=lfs -text
 outputs/aleatoric_unc_scatter.png filter=lfs diff=lfs merge=lfs -text
 outputs/ridge_aleatoric.png filter=lfs diff=lfs merge=lfs -text
 outputs/ridge_epistemic.png filter=lfs diff=lfs merge=lfs -text

 outputs/aleatoric_unc_scatter.png filter=lfs diff=lfs merge=lfs -text
 outputs/ridge_aleatoric.png filter=lfs diff=lfs merge=lfs -text
 outputs/ridge_epistemic.png filter=lfs diff=lfs merge=lfs -text
+outputs/admiration_scatters.png filter=lfs diff=lfs merge=lfs -text
+outputs/fear_scatters.png filter=lfs diff=lfs merge=lfs -text
+outputs/neutral_scatters.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,237 @@

+---
+language:
+- en
+license: cc-by-4.0
+library_name: transformers
+pipeline_tag: text-classification
+tags:
+- emotion-recognition
+- bayesian-deep-learning
+- mc-dropout
+- uncertainty-quantification
+- multi-label-classification
+datasets:
+- Skylion007/openwebtext
+- google-research-datasets/go_emotions
+metrics:
+- precision
+- recall
+- f1
+model-index:
+- name: EmCoder
+  results:
+  - task:
+      type: text-classification
+      name: Multi-label Emotion Classification
+    dataset:
+      name: GoEmotions
+      type: go_emotions
+      split: test
+    metrics:
+    - name: Macro F1
+      type: f1
+      value: 0.488
+    - name: Macro Precision
+      type: precision
+      value: 0.503
+    - name: Macro Recall
+      type: recall
+      value: 0.503
+---
+# EmCoder
+<blockquote>
+  <b>Probabilistic Emotion Recognition & Uncertainty Quantification</b><br>
+  <b>28 Emotion multi-label Transformer classifier</b>
+</blockquote>
+Unlike standard classifiers, EmCoder quantifies what it doesn't know using Monte Carlo Dropout, making it suitable for high-stakes AI pipelines.<br>
+EmCoder is optimized for **MC Dropout inference**.
+## SOTA benchmark
+### Evaluation on the GoEmotions test split (macro avg metrics)
+<!-- TODO: UPDATE % SIZE-->
+EmCoder achieves highly competitive Macro F1-score with its compact size (~35% smaller than RoBERTa-base and ~45% smaller than ModernBERT), while providing per-class epistemic uncertainty quantification.
+<!-- TODO: UPDATE PARAM COUNT -->
+| Model | Precision | Recall | F1-Score | Params |
+| :--- | :--- | :--- | :--- | :--- |
+| **EmCoder** | **0.503** | **0.503** | **0.488** | **82.1M** |
+| Google BERT (Original) | 0.400 | 0.630 | 0.460 | 110M |
+| RoBERTa-base | 0.575 | 0.396 | 0.450 | 125M |
+| ModernBERT-base | 0.583 | 0.535 | 0.550 | 149M |
+## How to use
+### 1. Setup & Tokenization
+EmCoder uses the `ModernBERT` tokenizer for correct token-to-embedding mapping.
+Ensure you allow remote code execution since it's a custom architecture.
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+repo_id = "yezdata/EmCoder"
+# Load the same tokenizer used during training
+tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
+# Initialize with same config as training
+model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
+```
+### 2. Bayesian inference
+To obtain probabilistic outputs and uncertainty metrics, use the `mc_forward` method:
+```python
+# Perform 50 stochastic passes
+N_SAMPLES = 50
+MAX_BATCH_SIZE = 10 # optional sub-batching of N_SAMPLES
+inputs = tokenizer("I am so happy you are here!", return_tensors="pt")
+model.eval()
+with torch.no_grad():
+    # Automatically keeps Dropout active, even when in model.eval
+    mc_logits = model.mc_forward(
+        **inputs,
+        n_samples=N_SAMPLES,
+        max_batch_size=MAX_BATCH_SIZE
+    )
+# Bayesian Post-processing
+all_probs = torch.sigmoid(mc_logits) # (n_samples, B, 28)
+mean_probs = all_probs.mean(dim=0) # Mean Predicted Probability
+# base std estimation of Epistemic Uncertainty
+uncertainty = all_probs.std(dim=0)
+# Formatted Output
+m_probs = mean_probs.squeeze(0)
+u_vals = uncertainty.squeeze(0)
+print(f"{'Emotion':<15} | {'Prob':<10} | {'Uncertainty':<10}")
+print("-" * 40)
+sorted_indices = torch.argsort(m_probs, descending=True)
+for idx in sorted_indices:
+    prob, unc = m_probs[idx].item(), u_vals[idx].item()
+    label = model.config.id2label[idx.item()]
+    if prob > 0.05: # Print only emotions with prob > 5%
+        print(f"{label:<15} | {prob:>8.2%} | ±{unc:>8.4f}")
+```
+## Model Architecture
+![EmCoder Architecture](outputs/architecture.png)
+### Optimization
+The model is trained using a **Weighted Binary Cross Entropy loss**
+Where weights **w** are calculated using a logarithmic class-balancing scale to handle extreme label imbalance:
+$$
+w_{c} = \max\left( 0.1, \min\left( 20, 1 + \ln \left( \frac{N_{neg,c} + \epsilon}{N_{pos,c} + \epsilon} \right) \right) \right)
+$$
+## Performance on test set
+**Using `thresholds.json` optimization of probabilty thresholds for binarizing predictions (from val set)**
+|                | precision |   recall | f1-score |   support |
+|:---------------|----------:|---------:|---------:|----------:|
+| micro avg      |     0.524 |    0.635 |    0.574 |      6329 |
+| **macro avg** | **0.503** |**0.503** |**0.488** |      6329 |
+| weighted avg   |     0.537 |    0.635 |    0.573 |      6329 |
+| samples avg    |     0.562 |    0.661 |    0.584 |      6329 |
+|----------------|-----------|----------|----------|-----------|
+| admiration     |     0.642 |    0.681 |    0.661 |       504 |
+| amusement      |     0.731 |    0.898 |    0.806 |       264 |
+| anger          |     0.491 |    0.434 |    0.461 |       198 |
+| annoyance      |     0.352 |    0.316 |    0.333 |       320 |
+| approval       |     0.273 |    0.501 |    0.354 |       351 |
+| caring         |     0.271 |    0.415 |    0.327 |       135 |
+| confusion      |     0.377 |    0.392 |    0.385 |       153 |
+| curiosity      |     0.496 |    0.648 |    0.562 |       284 |
+| desire         |     0.525 |    0.373 |    0.437 |        83 |
+| disappointment |     0.272 |    0.305 |    0.288 |       151 |
+| disapproval    |     0.333 |    0.461 |    0.387 |       267 |
+| disgust        |     0.422 |    0.528 |    0.469 |       123 |
+| embarrassment  |     0.545 |    0.324 |    0.407 |        37 |
+| excitement     |     0.467 |    0.340 |    0.393 |       103 |
+| fear           |     0.565 |    0.667 |    0.612 |        78 |
+| gratitude      |     0.946 |    0.889 |    0.917 |       352 |
+| grief          |     0.667 |    0.333 |    0.444 |         6 |
+| joy            |     0.603 |    0.584 |    0.593 |       161 |
+| love           |     0.809 |    0.782 |    0.795 |       238 |
+| nervousness    |     0.500 |    0.174 |    0.258 |        23 |
+| optimism       |     0.614 |    0.478 |    0.538 |       186 |
+| pride          |     0.583 |    0.438 |    0.500 |        16 |
+| realization    |     0.270 |    0.214 |    0.238 |       145 |
+| relief         |     0.118 |    0.364 |    0.178 |        11 |
+| remorse        |     0.551 |    0.768 |    0.642 |        56 |
+| sadness        |     0.576 |    0.462 |    0.512 |       156 |
+| surprise       |     0.511 |    0.482 |    0.496 |       141 |
+| neutral        |     0.564 |    0.838 |    0.674 |      1787 |
+### Entropy-based Uncertainty Decomposition
+EmCoder computes probabilistic uncertainty using Information Theory metrics over $N$ stochastic forward passes
+**Demonstration of model uncertainty utilization**
+To validate uncertainty quantification, reject the top **X%** most uncertain (epistemic) classifications. The model's Macro F1 jumps from 0.488 to above 0.70, proving that the model's self-reported uncertainty is highly correlated with its actual error rate
+![F1 Rejection curve](outputs/f1_rejection_epistemic.png)
+**Uncertainty quantification on GoEmotions test set for selected emotions**
+- `admiration`: medium appereance
+- `fear`: minority representation
+- `neutral`: the most samples
+Admiration | Fear |
+| :---: | :---: |
+| ![Admiration Scatter](outputs/admiration_scatters.png) | ![Fear Scatter](outputs/fear_scatters.png) |
+**Neutral**
+![Neutral Scatter](outputs/neutral_scatters.png)
+**Emotion uncertainty distribution**
+| Epistemic | Aleatoric |
+| :---: | :---: |
+| ![Epistemic Ridge](outputs/ridge_epistemic.png) | ![Aleatoric Ridge](outputs/ridge_aleatoric.png) |
+**Co-occurrence Confusion Matrix (normalized to Recall %)**
+![Confusion Matrix](outputs/confusion_matrix.png)
+## Workflow
+![EmCoder Workflow](outputs/workflow.png)
+## Concrete Dropout Experiment
+An experimental branch of EmCoder integrated Concrete Dropout (Gal et al., 2017) to dynamically learn optimal dropout probabilities. While this marginally sharpened the isolation of extreme edge-cases (yielding a slightly steeper first part on the F1-Rejection curve with an optimized p=0.15), the resulting heavier regularization constrained the capacity of compact EmCoder. This caused a slight degradation in standard macro metrics. Consequently, the production EmCoder model utilizes a fixed **p=0.1** to maintain optimal encoder-classifier synergy.
+## Note
+Note that this model was trained on GoEmotions dataset (social networks domain) and it may not generalize well to other domains.
+## Citation
+If you use this model, please cite it as follows:
+```bibtex
+@misc{jez2026emcoder,
+  author = {Václav Jež},
+  title = {EmCoder},
+  year = {2026},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/yezdata/EmCoder}},
+  version = {1.0.0}
+}
+```

configuration_emcoder.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from transformers import PretrainedConfig
+class EmCoderConfig(PretrainedConfig):
+    model_type = "emcoder"
+    def __init__(
+        self,
+        vocab_size=50368,
+        d_model=768,
+        n_head=12,
+        n_layers=6,
+        d_ffn=2048,
+        dropout=0.1,
+        num_labels=28,
+        base_encoder_path="",
+        id2label=None,
+        label2id=None,
+        **kwargs,
+    ):
+        if id2label is not None:
+            id2label = {int(k): v for k, v in id2label.items()}
+        super().__init__(id2label=id2label, label2id=label2id, **kwargs)
+        self.vocab_size = vocab_size
+        self.d_model = d_model
+        self.n_head = n_head
+        self.n_layers = n_layers
+        self.d_ffn = d_ffn
+        self.dropout = dropout
+        self.num_labels = num_labels
+        self.base_encoder_path = base_encoder_path

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5013a3b32923fa719eea0597d593d64f0e824d611531d1259d8bf81ae13aa5be
+size 327097416

modeling_emcoder.py ADDED Viewed

	@@ -0,0 +1,301 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .rope_embeddings import RotaryEmbedding
+from transformers import PreTrainedModel, AutoConfig, AutoModel
+from transformers.modeling_outputs import SequenceClassifierOutput
+from .configuration_emcoder import EmCoderConfig
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        variance = x.pow(2).mean(-1, keepdim=True)
+        return x * torch.rsqrt(variance + self.eps) * self.weight
+class SwiGLU(nn.Module):
+    def __init__(self, d_model: int, d_ffn: int):
+        super().__init__()
+        self.wi = nn.Linear(d_model, 2 * d_ffn, bias=False)
+        self.wo = nn.Linear(d_ffn, d_model, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x1, x2 = self.wi(x).chunk(2, dim=-1)
+        return self.wo(x1 * F.silu(x2))
+class EmCoderEncoderLayer(nn.Module):
+    """Custom Pre-LN Transformer Encoder Layer with RoPE and FlashAttention."""
+    def __init__(self, config: EmCoderConfig, rope: RotaryEmbedding):
+        super().__init__()
+        self.n_head = config.n_head
+        self.d_head = config.d_model // config.n_head
+        self.rope = rope
+        # Attention projections
+        self.q_proj = nn.Linear(config.d_model, config.d_model, bias=False)
+        self.k_proj = nn.Linear(config.d_model, config.d_model, bias=False)
+        self.v_proj = nn.Linear(config.d_model, config.d_model, bias=False)
+        self.out_proj = nn.Linear(config.d_model, config.d_model, bias=False)
+        self.ln1 = RMSNorm(config.d_model)
+        self.ln2 = RMSNorm(config.d_model)
+        self.ffn = SwiGLU(config.d_model, config.d_ffn)
+        self.dropout = nn.Dropout(config.dropout)
+        # mark for initialization
+        self.out_proj._is_residual = True
+        self.ffn.wo._is_residual = True
+    def forward(self, x: torch.Tensor, attn_mask: torch.Tensor) -> torch.Tensor:
+        # MULTI-HEAD ATTENTION
+        residual = x
+        nx = self.ln1(x)
+        B, S, _ = nx.shape
+        # Projections -> (B, H, S, D_head)
+        q = self.q_proj(nx).view(B, S, self.n_head, self.d_head).transpose(1, 2)
+        k = self.k_proj(nx).view(B, S, self.n_head, self.d_head).transpose(1, 2)
+        v = self.v_proj(nx).view(B, S, self.n_head, self.d_head).transpose(1, 2)
+        q = self.rope.rotate_queries_or_keys(q)
+        k = self.rope.rotate_queries_or_keys(k)
+        attn_out = F.scaled_dot_product_attention(
+            q,
+            k,
+            v,
+            attn_mask=attn_mask,
+            dropout_p=self.dropout.p if self.dropout.training else 0.0,
+        )
+        # Join heads -> (B, S, D_model)
+        attn_out = attn_out.transpose(1, 2).contiguous().view(B, S, -1)
+        x = residual + self.dropout(self.out_proj(attn_out))
+        x = x + self.dropout(self.ffn(self.ln2(x)))
+        return x
+class EmCoderEncoder(nn.Module):
+    """The core encoder architecture of EmCoder Transformer."""
+    def __init__(self, config: EmCoderConfig):
+        super().__init__()
+        self.token_embedding = nn.Embedding(config.vocab_size, config.d_model)
+        self.embed_norm = RMSNorm(config.d_model)
+        self.dropout = nn.Dropout(config.dropout)
+        self.rope = RotaryEmbedding(dim=config.d_model // config.n_head)
+        self.layers = nn.ModuleList(
+            [EmCoderEncoderLayer(config, self.rope) for _ in range(config.n_layers)]
+        )
+        self.final_norm = RMSNorm(config.d_model)
+    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
+        """Standard forward pass through the encoder."""
+        x = self.token_embedding(x)
+        x = self.embed_norm(x)
+        x = self.dropout(x)
+        B, S = mask.shape
+        attn_mask = mask.view(B, 1, 1, S).to(dtype=torch.bool)
+        for layer in self.layers:
+            x = layer(x, attn_mask)
+        return self.final_norm(x)
+class EmCoder(PreTrainedModel):
+    """The full EmCoder model, including the backbone encoder and the classification head."""
+    config_class = EmCoderConfig
+    def __init__(self, config: EmCoderConfig):
+        super().__init__(config)
+        self.encoder = EmCoderEncoder(config)
+        self.classifier = nn.Sequential(
+            nn.Linear(config.d_model, config.d_model),
+            nn.GELU(),
+            nn.Dropout(config.dropout),
+            nn.Linear(config.d_model, config.num_labels),
+        )
+        self.post_init()
+    def _init_weights(self, module: nn.Module) -> None:
+        if isinstance(module, nn.Linear):
+            # scale down the init for residual connections
+            if getattr(module, "_is_residual", False):
+                std = 0.02 / ((2 * self.config.n_layers) ** 0.5)
+            else:
+                std = 0.02
+            nn.init.trunc_normal_(module.weight, std=std)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.trunc_normal_(module.weight, std=0.02)
+        elif isinstance(module, RMSNorm):
+            nn.init.ones_(module.weight)
+    def _set_mc_dropout(self, active: bool = True):
+        for m in self.modules():
+            if isinstance(m, nn.Dropout):
+                m.train(active)
+    @staticmethod
+    def _masked_mean_pooling(
+        features: torch.Tensor, mask: torch.Tensor
+    ) -> torch.Tensor:
+        mask = mask.unsqueeze(-1)  # (B, S, 1)
+        masked_features = features * mask  # (B, S, D)
+        sum_masked_features = masked_features.sum(dim=1)  # (B, D)
+        count_tokens = torch.clamp(mask.sum(dim=1), min=1e-9)  # (B, 1)
+        return sum_masked_features / count_tokens  # (B, D)
+    def mc_forward(
+        self,
+        input_ids: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        labels: torch.Tensor | None = None,
+        n_samples: int = 10,
+        max_batch_size: int | None = None,
+        return_dict: bool | None = None,
+        **kwargs,
+    ) -> tuple[torch.Tensor, ...] | SequenceClassifierOutput:
+        """
+        Performs Monte Carlo Dropout inference to quantify uncertainty.
+        Args:
+            input_ids: Input token IDs of shape (B, S).
+            attention_mask: Attention mask of shape (B, S).
+            n_samples: Total number of Monte Carlo samples.
+            max_batch_size: Maximum number of samples in one forward pass.
+        Returns:
+            Logits of shape (n_samples, B, num_labels).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        x = input_ids if input_ids is not None else kwargs.get("x")
+        mask = attention_mask if attention_mask is not None else kwargs.get("mask")
+        if x is None or mask is None:
+            raise ValueError("input_ids (x) and attention_mask (mask) must be provided")
+        if max_batch_size is None:
+            max_batch_size = n_samples
+        B, S = x.shape
+        num_labels = self.classifier[-1].out_features
+        all_logits = torch.empty((n_samples, B, num_labels), device=x.device)
+        is_training = self.training
+        self._set_mc_dropout(active=True)
+        try:
+            with torch.no_grad():
+                for i in range(0, n_samples, max_batch_size):
+                    batch_samples = min(max_batch_size, n_samples - i)
+                    x_stacked = x.repeat(batch_samples, 1) # (batch_samples * B, S)
+                    mask_stacked = mask.repeat(batch_samples, 1) # (batch_samples * B, S)
+                    features = self.encoder(
+                        x_stacked, mask_stacked
+                    )  # (batch_samples * B, S, D)
+                    pooled = self._masked_mean_pooling(features, mask_stacked)
+                    logits = self.classifier(pooled)  # (n_samples * B, num_labels)
+                    all_logits[i : i + batch_samples] = logits.view(batch_samples, B, -1)
+        finally:
+            self._set_mc_dropout(active=is_training)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.BCEWithLogitsLoss()
+            loss = loss_fct(all_logits.mean(dim=0), labels.to(all_logits.dtype))
+        if not return_dict:
+            output = (all_logits,)
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=all_logits,
+            hidden_states=None,
+            attentions=None,
+        )
+    def forward(
+        self,
+        input_ids: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        labels: torch.Tensor | None = None,
+        return_dict: bool | None = None,
+        **kwargs,
+    ) -> tuple[torch.Tensor, ...] | SequenceClassifierOutput:
+        """Standard forward pass without MC Dropout."""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        x = input_ids if input_ids is not None else kwargs.get("x")
+        mask = attention_mask if attention_mask is not None else kwargs.get("mask")
+        if x is None or mask is None:
+            raise ValueError("input_ids (x) and attention_mask (mask) must be provided")
+        features = self.encoder(x, mask)
+        pooled = self._masked_mean_pooling(features, mask)
+        logits = self.classifier(pooled)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.BCEWithLogitsLoss()
+            loss = loss_fct(logits, labels.to(logits.dtype))
+        if not return_dict:
+            output = (logits,)
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=None,
+            attentions=None,
+        )
+try:
+    AutoConfig.register("emcoder", EmCoderConfig)
+    AutoModel.register(EmCoderConfig, EmCoder)
+except ValueError:
+    pass

outputs/admiration_scatters.png ADDED Viewed

Git LFS Details

SHA256: 5cab43562862ea40bd700109f8dacf96ca5bd47598c5d13fda358659ec0304c9
Pointer size: 131 Bytes
Size of remote file: 249 kB

outputs/confusion_matrix.png ADDED Viewed

outputs/f1_rejection_epistemic.png ADDED Viewed

outputs/fear_scatters.png ADDED Viewed

Git LFS Details

SHA256: 1eb5e47f1b3366d93daf60ada82070330ddb520d68441b7978c982d1fcfba06e
Pointer size: 131 Bytes
Size of remote file: 130 kB

outputs/neutral_scatters.png ADDED Viewed

Git LFS Details

SHA256: bdfeab47c87920893ebf91eba8eca21ac5c0939e106b929cb478ca71322fb7b1
Pointer size: 131 Bytes
Size of remote file: 309 kB

outputs/ridge_aleatoric.png ADDED Viewed

Git LFS Details

SHA256: 3fa0a3aeb52fae0fbd585eda43262db6586c5fd0a84b11dd9b2d9077bb2c6ce8
Pointer size: 131 Bytes
Size of remote file: 168 kB

outputs/ridge_epistemic.png ADDED Viewed

Git LFS Details

SHA256: d208742be0a9e25c026230d7657ef3f1c9e7d8668c3b3a798168cac134457575
Pointer size: 131 Bytes
Size of remote file: 111 kB

rope_embeddings.py ADDED Viewed

	@@ -0,0 +1,270 @@

+from __future__ import annotations
+from math import pi, log
+import torch
+from torch.amp import autocast
+from torch.nn import Module
+from torch import nn, broadcast_tensors, is_tensor, tensor, Tensor
+from typing import Literal
+def exists(val):
+    return val is not None
+def default(val, d):
+    return val if exists(val) else d
+def broadcat(tensors, dim=-1):
+    broadcasted_tensors = broadcast_tensors(*tensors)
+    return torch.cat(broadcasted_tensors, dim=dim)
+def slice_at_dim(t, dim_slice: slice, *, dim):
+    dim += (t.ndim if dim < 0 else 0)
+    colons = [slice(None)] * t.ndim
+    colons[dim] = dim_slice
+    return t[tuple(colons)]
+def rotate_half(x):
+    orig_shape = x.shape
+    d_head = orig_shape[-1]
+    x = x.view(*orig_shape[:-1], d_head // 2, 2)
+    x1 = x[..., 0]
+    x2 = x[..., 1]
+    res = torch.stack((-x2, x1), dim=-1)
+    return res.view(*orig_shape)
+@autocast('cuda', enabled=False)
+def apply_rotary_emb(
+    freqs,
+    t,
+    start_index=0,
+    scale=1.,
+    seq_dim=-2,
+    freqs_seq_dim=None
+):
+    dtype = t.dtype
+    if not exists(freqs_seq_dim):
+        if freqs.ndim == 2 or t.ndim == 3:
+            freqs_seq_dim = 0
+    if t.ndim == 3 or exists(freqs_seq_dim):
+        seq_len = t.shape[seq_dim]
+        freqs = slice_at_dim(freqs, slice(-seq_len, None), dim=freqs_seq_dim)
+    rot_dim = freqs.shape[-1]
+    end_index = start_index + rot_dim
+    assert rot_dim <= t.shape[-1], f'feature dimension {t.shape[-1]} is not of sufficient size to rotate in all the positions {rot_dim}'
+    t_left = t[..., :start_index]
+    t_middle = t[..., start_index:end_index]
+    t_right = t[..., end_index:]
+    t_transformed = (t_middle * freqs.cos() * scale) + (rotate_half(t_middle) * freqs.sin() * scale)
+    out = torch.cat((t_left, t_transformed, t_right), dim=-1)
+    return out.type(dtype)
+def apply_learned_rotations(rotations, t, start_index=0, freq_ranges=None):
+    if exists(freq_ranges):
+        rotations = torch.einsum('..., f -> ... f', rotations, freq_ranges)
+        rotations = rotations.reshape(*rotations.shape[:-2], -1)
+    rotations = rotations.repeat_interleave(2, dim=-1)
+    return apply_rotary_emb(rotations, t, start_index=start_index)
+class RotaryEmbedding(Module):
+    def __init__(
+        self,
+        dim,
+        custom_freqs: Tensor | None = None,
+        freqs_for: Literal['lang', 'pixel', 'constant'] = 'lang',
+        theta = 10000,
+        max_freq = 10,
+        num_freqs = 1,
+        learned_freq = False,
+        use_xpos = False,
+        xpos_scale_base = 512,
+        interpolate_factor = 1.,
+        theta_rescale_factor = 1.,
+        seq_before_head_dim = False,
+        cache_if_possible = True,
+        cache_max_seq_len = 8192
+    ):
+        super().__init__()
+        theta *= theta_rescale_factor ** (dim / (dim - 2))
+        self.freqs_for = freqs_for
+        if exists(custom_freqs):
+            freqs = custom_freqs
+        elif freqs_for == 'lang':
+            freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))
+        elif freqs_for == 'pixel':
+            freqs = torch.linspace(1., max_freq / 2, dim // 2) * pi
+        elif freqs_for == 'constant':
+            freqs = torch.ones(num_freqs).float()
+        self.cache_if_possible = cache_if_possible
+        self.cache_max_seq_len = cache_max_seq_len
+        self.register_buffer('cached_freqs', torch.zeros(cache_max_seq_len, dim), persistent=False)
+        self.cached_freqs_seq_len = 0
+        self.freqs = nn.Parameter(freqs, requires_grad=learned_freq)
+        self.learned_freq = learned_freq
+        self.register_buffer('dummy', torch.tensor(0), persistent=False)
+        self.seq_before_head_dim = seq_before_head_dim
+        self.default_seq_dim = -3 if seq_before_head_dim else -2
+        assert interpolate_factor >= 1.
+        self.interpolate_factor = interpolate_factor
+        self.use_xpos = use_xpos
+        if not use_xpos:
+            return
+        scale = (torch.arange(0, dim, 2) + 0.4 * dim) / (1.4 * dim)
+        self.scale_base = xpos_scale_base
+        self.register_buffer('scale', scale, persistent=False)
+        self.register_buffer('cached_scales', torch.zeros(cache_max_seq_len, dim), persistent=False)
+        self.cached_scales_seq_len = 0
+        self.apply_rotary_emb = staticmethod(apply_rotary_emb)
+    @property
+    def device(self):
+        return self.dummy.device
+    def get_seq_pos(self, seq_len, device=None, dtype=None, offset=0):
+        device = default(device, self.device)
+        dtype = default(dtype, self.cached_freqs.dtype)
+        return (torch.arange(seq_len, device=device, dtype=dtype) + offset) / self.interpolate_factor
+    def rotate_queries_or_keys(self, t, seq_dim=None, offset=0, scale=None):
+        seq_dim = default(seq_dim, self.default_seq_dim)
+        assert not self.use_xpos or exists(scale), 'you must use `.rotate_queries_and_keys` method instead'
+        device, dtype, seq_len = t.device, t.dtype, t.shape[seq_dim]
+        seq = self.get_seq_pos(seq_len, device=device, dtype=dtype, offset=offset)
+        freqs = self.forward(seq, seq_len=seq_len, offset=offset)
+        if seq_dim == -3:
+            freqs = freqs.unsqueeze(1)
+        return apply_rotary_emb(freqs, t, scale=default(scale, 1.), seq_dim=seq_dim)
+    def rotate_queries_with_cached_keys(self, q, k, seq_dim=None, offset=0):
+        dtype, device, seq_dim = q.dtype, q.device, default(seq_dim, self.default_seq_dim)
+        q_len, k_len = q.shape[seq_dim], k.shape[seq_dim]
+        assert q_len <= k_len
+        q_scale = k_scale = 1.
+        if self.use_xpos:
+            seq = self.get_seq_pos(k_len, dtype=dtype, device=device)
+            q_scale = self.get_scale(seq[-q_len:]).type(dtype)
+            k_scale = self.get_scale(seq).type(dtype)
+        rotated_q = self.rotate_queries_or_keys(q, seq_dim=seq_dim, scale=q_scale, offset=k_len - q_len + offset)
+        rotated_k = self.rotate_queries_or_keys(k, seq_dim=seq_dim, scale=k_scale ** -1)
+        return rotated_q.type(q.dtype), rotated_k.type(k.dtype)
+    def rotate_queries_and_keys(self, q, k, seq_dim=None):
+        seq_dim = default(seq_dim, self.default_seq_dim)
+        assert self.use_xpos
+        device, dtype, seq_len = q.device, q.dtype, q.shape[seq_dim]
+        seq = self.get_seq_pos(seq_len, dtype=dtype, device=device)
+        freqs = self.forward(seq, seq_len=seq_len)
+        scale = self.get_scale(seq, seq_len=seq_len).to(dtype)
+        if seq_dim == -3:
+            freqs = freqs.unsqueeze(1)
+            scale = scale.unsqueeze(1)
+        rotated_q = apply_rotary_emb(freqs, q, scale=scale, seq_dim=seq_dim)
+        rotated_k = apply_rotary_emb(freqs, k, scale=scale ** -1, seq_dim=seq_dim)
+        return rotated_q.type(q.dtype), rotated_k.type(k.dtype)
+    def get_scale(self, t: Tensor, seq_len: int | None = None, offset=0):
+        assert self.use_xpos
+        should_cache = self.cache_if_possible and exists(seq_len) and (offset + seq_len) <= self.cache_max_seq_len
+        if should_cache and (seq_len + offset) <= self.cached_scales_seq_len:
+            return self.cached_scales[offset:(offset + seq_len)]
+        scale = 1.
+        if self.use_xpos:
+            power = (t - len(t) // 2) / self.scale_base
+            scale = self.scale ** power.unsqueeze(-1)
+            scale = scale.repeat_interleave(2, dim=-1)
+        if should_cache and offset == 0:
+            self.cached_scales[:seq_len] = scale.detach()
+            self.cached_scales_seq_len = seq_len
+        return scale
+    def get_axial_freqs(self, *dims, offsets: tuple[int | float, ...] | Tensor | None = None):
+        Colon = slice(None)
+        all_freqs = []
+        if exists(offsets):
+            if not is_tensor(offsets):
+                offsets = tensor(offsets)
+            assert len(offsets) == len(dims)
+        for ind, dim in enumerate(dims):
+            offset = 0
+            if exists(offsets):
+                offset = offsets[ind]
+            if self.freqs_for == 'pixel':
+                pos = torch.linspace(-1, 1, steps=dim, device=self.device)
+            else:
+                pos = torch.arange(dim, device=self.device)
+            pos = pos + offset
+            freqs = self.forward(pos, seq_len=dim)
+            all_axis = [None] * len(dims)
+            all_axis[ind] = Colon
+            new_axis_slice = (Ellipsis, *all_axis, Colon)
+            all_freqs.append(freqs[new_axis_slice])
+        all_freqs = broadcast_tensors(*all_freqs)
+        return torch.cat(all_freqs, dim=-1)
+    @autocast('cuda', enabled=False)
+    def forward(self, t: Tensor, seq_len: int | None = None, offset=0):
+        should_cache = (
+            self.cache_if_possible and not self.learned_freq and
+            exists(seq_len) and self.freqs_for != 'pixel' and
+            (offset + seq_len) <= self.cache_max_seq_len
+        )
+        if should_cache and (offset + seq_len) <= self.cached_freqs_seq_len:
+            return self.cached_freqs[offset:(offset + seq_len)].detach()
+        freqs = self.freqs
+        freqs = torch.einsum('..., f -> ... f', t.type(freqs.dtype), freqs)
+        freqs = freqs.repeat_interleave(2, dim=-1)
+        if should_cache and offset == 0:
+            self.cached_freqs[:seq_len] = freqs.detach()
+            self.cached_freqs_seq_len = seq_len
+        return freqs

thresholds.json ADDED Viewed

	@@ -0,0 +1,114 @@

+{
+    "admiration": {
+        "p": 0.6142857142857143,
+        "f1": 0.7186574531095755
+    },
+    "amusement": {
+        "p": 0.5,
+        "f1": 0.7870778267254038
+    },
+    "anger": {
+        "p": 0.6714285714285715,
+        "f1": 0.42744063324538256
+    },
+    "annoyance": {
+        "p": 0.5571428571428572,
+        "f1": 0.3525423728813559
+    },
+    "approval": {
+        "p": 0.3857142857142858,
+        "f1": 0.36084452975047987
+    },
+    "caring": {
+        "p": 0.44285714285714284,
+        "f1": 0.4715909090909091
+    },
+    "confusion": {
+        "p": 0.6142857142857143,
+        "f1": 0.4217252396166134
+    },
+    "curiosity": {
+        "p": 0.6714285714285715,
+        "f1": 0.5331125827814569
+    },
+    "desire": {
+        "p": 0.6142857142857143,
+        "f1": 0.5324675324675324
+    },
+    "disappointment": {
+        "p": 0.5,
+        "f1": 0.36416184971098264
+    },
+    "disapproval": {
+        "p": 0.5,
+        "f1": 0.41025641025641024
+    },
+    "disgust": {
+        "p": 0.5,
+        "f1": 0.425531914893617
+    },
+    "embarrassment": {
+        "p": 0.5,
+        "f1": 0.5294117647058824
+    },
+    "excitement": {
+        "p": 0.7857142857142857,
+        "f1": 0.33986928104575165
+    },
+    "fear": {
+        "p": 0.6142857142857143,
+        "f1": 0.632183908045977
+    },
+    "gratitude": {
+        "p": 0.7857142857142857,
+        "f1": 0.9131075110456554
+    },
+    "grief": {
+        "p": 0.6714285714285715,
+        "f1": 0.45454545454545453
+    },
+    "joy": {
+        "p": 0.6142857142857143,
+        "f1": 0.5688622754491018
+    },
+    "love": {
+        "p": 0.7285714285714286,
+        "f1": 0.8052930056710775
+    },
+    "nervousness": {
+        "p": 0.7857142857142857,
+        "f1": 0.375
+    },
+    "optimism": {
+        "p": 0.6714285714285715,
+        "f1": 0.6054054054054054
+    },
+    "pride": {
+        "p": 0.5,
+        "f1": 0.56
+    },
+    "realization": {
+        "p": 0.5,
+        "f1": 0.24892703862660945
+    },
+    "relief": {
+        "p": 0.3285714285714286,
+        "f1": 0.1935483870967742
+    },
+    "remorse": {
+        "p": 0.7285714285714286,
+        "f1": 0.7916666666666666
+    },
+    "sadness": {
+        "p": 0.6714285714285715,
+        "f1": 0.5255474452554745
+    },
+    "surprise": {
+        "p": 0.5,
+        "f1": 0.5128205128205128
+    },
+    "neutral": {
+        "p": 0.3857142857142858,
+        "f1": 0.6646788990825688
+    }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "is_local": false,
+  "local_files_only": false,
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "[UNK]"
+}

train_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+    "n_samples": 50,
+    "tokenized_ds_dir": "data/goemotions_v2_no_trunc",
+    "encoder_lr": 0.00001,
+    "head_lr": 0.0002,
+    "lr_warmup": 0.02,
+    "weight_decay": 0.01,
+    "batch_size": 64,
+    "gradient_accumulation_steps": 1,
+    "num_epochs": 10
+}

train_state.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "train_loss": 0.16548924763660894,
+    "eval_loss": 0.21261409854187685
+}