Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +122 -0
config.json +22 -0
configuration_utrlm.py +53 -0
modeling_utrlm.py +413 -0
pytorch_model.bin +3 -0
special_tokens_map.json +8 -0
tokenization_utrlm.py +128 -0
tokenizer_config.json +62 -0
vocab.json +12 -0

README.md ADDED Viewed

	@@ -0,0 +1,122 @@

+---
+language:
+- rna
+library_name: transformers
+tags:
+- RNA
+- language-model
+- UTR
+- genomics
+- biology
+license: gpl-3.0
+---
+# UTR-LM-MLMSS
+UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (`UTR-LM-MLMSS`) was trained with **MLM + secondary structure prediction** as a supervised auxiliary objective.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 6 |
+| Attention heads | 16 |
+| Embedding dimension | 128 |
+| Vocabulary size | 10 |
+| Positional encoding | Rotary (RoPE) |
+| Architecture | ESM2-style pre-LN Transformer |
+**Vocabulary:** `<pad>` (0), `<eos>` (1), `<unk>` (2), `A` (3), `G` (4), `C` (5), `T` (6), `<cls>` (7), `<mask>` (8), `<sep>` (9)
+## Pretraining
+- **Objective:** Masked language modeling + per-token secondary structure prediction (3-class: unpaired, stem, loop)
+- **Data:** Endogenous 5' UTRs from five species (human, mouse, zebrafish, *Drosophila*, yeast) combined with the Cao et al. random 5' UTR synthetic library
+- **Source checkpoint:** `ESM2SS_FS4.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_lr1e-05_structureweight1.0_MLMLossMin_epoch200.pkl`
+Only one `ESM2SS` (secondary structure only, no MFE regression) checkpoint was available; no selection decision was required.
+## Parity Verification
+Hidden-state representations produced by this HF model are verified to be **exactly identical** (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6.
+## Related Models
+| Model | Pretraining Objective | Notes |
+|---|---|---|
+| [UTR-LM-MLM](https://huggingface.co/Taykhoom/UTR-LM-MLM) | MLM | Base model |
+| [UTR-LM-MLMSI](https://huggingface.co/Taykhoom/UTR-LM-MLMSI) | MLM + MFE regression | Recommended for TE / EL tasks |
+| **[UTR-LM-MLMSS](https://huggingface.co/Taykhoom/UTR-LM-MLMSS)** | MLM + secondary structure | This model |
+| [UTR-LM-MLMSISS](https://huggingface.co/Taykhoom/UTR-LM-MLMSISS) | MLM + MFE + secondary structure | Recommended for MRL tasks |
+## Usage
+### Embedding generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
+model.eval()
+sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
+enc = tokenizer(sequences, return_tensors="pt", padding=True)
+with torch.no_grad():
+    out = model(**enc)
+# CLS token embedding (position 0) - recommended for sequence-level tasks
+cls_emb = out.last_hidden_state[:, 0, :]   # (batch, 128)
+# All-token embeddings
+token_emb = out.last_hidden_state           # (batch, seq_len, 128)
+# Intermediate layer representations
+out_all = model(**enc, output_hidden_states=True)
+layer3_emb = out_all.hidden_states[3]       # after layer 3, shape (batch, seq_len, 128)
+```
+### MLM logits
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
+model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
+model.eval()
+enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
+with torch.no_grad():
+    logits = model(**enc).logits   # (1, seq_len, 10)
+```
+### Fine-tuning
+The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).
+## Citation
+```bibtex
+@article{chu2023utrlm,
+  title   = {A 5'UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},
+  author  = {Chu, Yanyi and others},
+  journal = {bioRxiv},
+  year    = {2023},
+  doi     = {10.1101/2023.10.11.561938}
+}
+```
+## Implementation Notes
+The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for `attn_implementation="sdpa"` (PyTorch `F.scaled_dot_product_attention`) and `attn_implementation="flash_attention_2"` (requires `pip install flash-attn --no-build-isolation`), which were not part of the original codebase.
+## Credits
+Original model and code by Yanyi Chu et al. (Stanford). Source code: [UTR-LM GitHub repository](https://github.com/a96123155/UTR-LM). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal.
+## License
+GPL-3.0, following the original UTR-LM repository.

config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "alphabet_size": 10,
+  "append_eos": true,
+  "attention_heads": 16,
+  "auto_map": {
+    "AutoConfig": "configuration_utrlm.UtrLmConfig",
+    "AutoModel": "modeling_utrlm.UtrLmModel",
+    "AutoModelForMaskedLM": "modeling_utrlm.UtrLmForMaskedLM",
+    "AutoTokenizer": "tokenization_utrlm.UtrLmTokenizer"
+  },
+  "cls_idx": 7,
+  "embed_dim": 128,
+  "eos_idx": 1,
+  "mask_idx": 8,
+  "model_type": "utrlm",
+  "num_layers": 6,
+  "pad_token_id": 0,
+  "padding_idx": 0,
+  "prepend_bos": true,
+  "token_dropout": true,
+  "transformers_version": "4.57.6"
+}

configuration_utrlm.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from transformers import PretrainedConfig
+class UtrLmConfig(PretrainedConfig):
+    """
+    Configuration for UTR-LM (ESM2-based RNA language model).
+    Vocab (10 tokens):
+        <pad>:0  <eos>:1  <unk>:2  A:3  G:4  C:5  T:6  <cls>:7  <mask>:8  <sep>:9
+    """
+    model_type = "utrlm"
+    def __init__(
+        self,
+        num_layers: int = 6,
+        embed_dim: int = 128,
+        attention_heads: int = 16,
+        alphabet_size: int = 10,
+        padding_idx: int = 0,
+        mask_idx: int = 8,
+        cls_idx: int = 7,
+        eos_idx: int = 1,
+        prepend_bos: bool = True,
+        append_eos: bool = True,
+        token_dropout: bool = True,
+        **kwargs,
+    ):
+        kwargs.setdefault("pad_token_id", padding_idx)
+        super().__init__(**kwargs)
+        # Written into config.json so AutoModel / AutoModelForMaskedLM resolve
+        # the correct classes when loading from the Hub with trust_remote_code=True.
+        self.auto_map = {
+            "AutoConfig": "configuration_utrlm.UtrLmConfig",
+            "AutoTokenizer": "tokenization_utrlm.UtrLmTokenizer",
+            "AutoModel": "modeling_utrlm.UtrLmModel",
+            "AutoModelForMaskedLM": "modeling_utrlm.UtrLmForMaskedLM",
+        }
+        self.num_layers = num_layers
+        self.embed_dim = embed_dim
+        self.attention_heads = attention_heads
+        self.alphabet_size = alphabet_size
+        self.padding_idx = padding_idx
+        self.mask_idx = mask_idx
+        self.cls_idx = cls_idx
+        self.eos_idx = eos_idx
+        self.prepend_bos = prepend_bos
+        self.append_eos = append_eos
+        self.token_dropout = token_dropout
+    @property
+    def hidden_size(self) -> int:
+        return self.embed_dim

modeling_utrlm.py ADDED Viewed

	@@ -0,0 +1,413 @@

+"""UTR-LM ported to Hugging Face PreTrainedModel."""
+import math
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutput, MaskedLMOutput
+from .configuration_utrlm import UtrLmConfig
+# ---------------------------------------------------------------------------
+# Rotary embeddings
+# ---------------------------------------------------------------------------
+def _rotate_half(x: torch.Tensor) -> torch.Tensor:
+    x1, x2 = x.chunk(2, dim=-1)
+    return torch.cat((-x2, x1), dim=-1)
+def _apply_rotary_pos_emb(x, cos, sin):
+    cos = cos[:, : x.shape[-2], :].to(x.dtype)
+    sin = sin[:, : x.shape[-2], :].to(x.dtype)
+    return (x * cos) + (_rotate_half(x) * sin)
+class RotaryEmbedding(nn.Module):
+    def __init__(self, dim: int):
+        super().__init__()
+        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq)
+        self._seq_len_cached: Optional[int] = None
+        self._cos_cached: Optional[torch.Tensor] = None
+        self._sin_cached: Optional[torch.Tensor] = None
+    def _update_cos_sin_tables(self, x: torch.Tensor, seq_dimension: int = 1):
+        seq_len = x.shape[seq_dimension]
+        if seq_len != self._seq_len_cached or self._cos_cached.device != x.device:
+            self._seq_len_cached = seq_len
+            t = torch.arange(x.shape[seq_dimension], device=x.device).type_as(self.inv_freq)
+            freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
+            self._cos_cached = emb.cos()[None, :, :]
+            self._sin_cached = emb.sin()[None, :, :]
+        return self._cos_cached, self._sin_cached
+    def forward(self, q, k):
+        self._cos_cached, self._sin_cached = self._update_cos_sin_tables(k, seq_dimension=-2)
+        return (
+            _apply_rotary_pos_emb(q, self._cos_cached, self._sin_cached),
+            _apply_rotary_pos_emb(k, self._cos_cached, self._sin_cached),
+        )
+# ---------------------------------------------------------------------------
+# Attention variants
+# ---------------------------------------------------------------------------
+class UtrLmAttention(nn.Module):
+    """Eager (standard) attention."""
+    def __init__(self, embed_dim: int, num_heads: int):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.head_dim = embed_dim // num_heads
+        self.scaling = self.head_dim ** -0.5
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.out_proj = nn.Linear(embed_dim, embed_dim)
+        self.rot_emb = RotaryEmbedding(dim=self.head_dim)
+    def _project(self, x):
+        """Project and reshape x (T, B, E) -> q/k/v in (B*H, T, head_dim)."""
+        tgt_len, bsz, _ = x.size()
+        q = (self.q_proj(x) * self.scaling).contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        k = self.k_proj(x).contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        v = self.v_proj(x).contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        q, k = self.rot_emb(q, k)
+        return q, k, v
+    def forward(self, x, key_padding_mask, output_attentions: bool = False):
+        tgt_len, bsz, _ = x.size()
+        q, k, v = self._project(x)
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, tgt_len)
+        if key_padding_mask is not None:
+            attn_weights = attn_weights.masked_fill(
+                key_padding_mask.unsqueeze(1).unsqueeze(2).to(torch.bool), float("-inf")
+            )
+        attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, tgt_len)
+        attn_probs = F.softmax(attn_weights, dim=-1, dtype=torch.float32).type_as(attn_weights)
+        attn = torch.bmm(attn_probs, v)
+        attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, self.embed_dim)
+        out = self.out_proj(attn)
+        if output_attentions:
+            return out, attn_probs.view(bsz, self.num_heads, tgt_len, tgt_len)
+        return out, None
+class UtrLmSdpaAttention(UtrLmAttention):
+    """SDPA attention via torch.nn.functional.scaled_dot_product_attention."""
+    def forward(self, x, key_padding_mask, output_attentions: bool = False):
+        if output_attentions:
+            # SDPA doesn't expose attention weights; fall back to eager.
+            return super().forward(x, key_padding_mask, output_attentions=True)
+        tgt_len, bsz, _ = x.size()
+        q, k, v = self._project(x)  # (B*H, T, head_dim)
+        # Reshape to (B, H, T, head_dim) for SDPA
+        q = q.view(bsz, self.num_heads, tgt_len, self.head_dim)
+        k = k.view(bsz, self.num_heads, tgt_len, self.head_dim)
+        v = v.view(bsz, self.num_heads, tgt_len, self.head_dim)
+        # Convert bool padding mask -> additive float mask (B, 1, 1, T)
+        attn_mask = None
+        if key_padding_mask is not None:
+            attn_mask = torch.zeros(bsz, 1, 1, tgt_len, dtype=q.dtype, device=q.device)
+            attn_mask = attn_mask.masked_fill(key_padding_mask[:, None, None, :], float("-inf"))
+        # scale=1.0 because q is already pre-scaled by self.scaling
+        out = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, scale=1.0)
+        out = out.permute(2, 0, 1, 3).contiguous().view(tgt_len, bsz, self.embed_dim)
+        return self.out_proj(out), None
+class UtrLmFlashAttention2(UtrLmAttention):
+    """Flash Attention 2 via flash_attn (must be installed separately)."""
+    def forward(self, x, key_padding_mask, output_attentions: bool = False):
+        if output_attentions:
+            # Flash attention doesn't expose attention weights; fall back to eager.
+            return super().forward(x, key_padding_mask, output_attentions=True)
+        try:
+            from flash_attn import flash_attn_func
+            from flash_attn.bert_padding import pad_input, unpad_input
+        except ImportError as e:
+            raise ImportError("flash_attn is required for attn_implementation='flash_attention_2'. "
+                              "Install with: pip install flash-attn --no-build-isolation") from e
+        tgt_len, bsz, _ = x.size()
+        q, k, v = self._project(x)  # (B*H, T, head_dim)
+        # Reshape to (B, T, H, head_dim) - flash_attn's expected layout
+        q = q.view(bsz, self.num_heads, tgt_len, self.head_dim).permute(0, 2, 1, 3)
+        k = k.view(bsz, self.num_heads, tgt_len, self.head_dim).permute(0, 2, 1, 3)
+        v = v.view(bsz, self.num_heads, tgt_len, self.head_dim).permute(0, 2, 1, 3)
+        # Flash attention requires fp16 or bf16
+        orig_dtype = q.dtype
+        if orig_dtype not in (torch.float16, torch.bfloat16):
+            q, k, v = q.to(torch.bfloat16), k.to(torch.bfloat16), v.to(torch.bfloat16)
+        if key_padding_mask is not None:
+            # Unpad, run varlen flash attention, repad
+            from flash_attn import flash_attn_varlen_func
+            attention_mask = ~key_padding_mask  # True = valid token
+            q_unpad, indices, cu_seqlens, max_seqlen, _ = unpad_input(q, attention_mask)
+            k_unpad, _, _, _, _ = unpad_input(k, attention_mask)
+            v_unpad, _, _, _, _ = unpad_input(v, attention_mask)
+            out_unpad = flash_attn_varlen_func(
+                q_unpad, k_unpad, v_unpad,
+                cu_seqlens_q=cu_seqlens,
+                cu_seqlens_k=cu_seqlens,
+                max_seqlen_q=max_seqlen,
+                max_seqlen_k=max_seqlen,
+                softmax_scale=1.0,  # q already pre-scaled
+                causal=False,
+            )
+            out = pad_input(out_unpad, indices, bsz, tgt_len)
+        else:
+            out = flash_attn_func(q, k, v, softmax_scale=1.0, causal=False)
+        out = out.to(orig_dtype).permute(1, 0, 2, 3).contiguous().view(tgt_len, bsz, self.embed_dim)
+        return self.out_proj(out), None
+UTRLM_ATTENTION_CLASSES = {
+    "eager": UtrLmAttention,
+    "sdpa": UtrLmSdpaAttention,
+    "flash_attention_2": UtrLmFlashAttention2,
+}
+# ---------------------------------------------------------------------------
+# Transformer layer (pre-LN)
+# ---------------------------------------------------------------------------
+def _gelu(x):
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+class UtrLmLayer(nn.Module):
+    def __init__(self, embed_dim: int, attention_heads: int, config: UtrLmConfig):
+        super().__init__()
+        attn_cls = UTRLM_ATTENTION_CLASSES[getattr(config, "_attn_implementation", "eager")]
+        self.self_attn = attn_cls(embed_dim, attention_heads)
+        self.self_attn_layer_norm = nn.LayerNorm(embed_dim)
+        self.fc1 = nn.Linear(embed_dim, 4 * embed_dim)
+        self.fc2 = nn.Linear(4 * embed_dim, embed_dim)
+        self.final_layer_norm = nn.LayerNorm(embed_dim)
+    def forward(self, x, padding_mask, output_attentions: bool = False):
+        residual = x
+        x = self.self_attn_layer_norm(x)
+        x, attn_weights = self.self_attn(x, key_padding_mask=padding_mask, output_attentions=output_attentions)
+        x = residual + x
+        residual = x
+        x = self.final_layer_norm(x)
+        x = _gelu(self.fc1(x))
+        x = self.fc2(x)
+        return residual + x, attn_weights
+# ---------------------------------------------------------------------------
+# Backbone
+# ---------------------------------------------------------------------------
+class UtrLmModel(PreTrainedModel):
+    """
+    UTR-LM encoder backbone. Returns last_hidden_state (B, T, E).
+    The [CLS] token sits at position 0 (prepend_bos=True by default).
+    """
+    config_class = UtrLmConfig
+    base_model_prefix = "utrlm"
+    _supports_sdpa = True
+    _supports_flash_attn_2 = True
+    def __init__(self, config: UtrLmConfig):
+        super().__init__(config)
+        self.embed_scale = 1
+        self.embed_tokens = nn.Embedding(
+            config.alphabet_size, config.embed_dim, padding_idx=config.padding_idx
+        )
+        self.layers = nn.ModuleList(
+            [UtrLmLayer(config.embed_dim, config.attention_heads, config) for _ in range(config.num_layers)]
+        )
+        self.emb_layer_norm_after = nn.LayerNorm(config.embed_dim)
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.BoolTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        output_attentions = (
+            output_attentions if output_attentions is not None else self.config.output_attentions
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        cfg = self.config
+        # HF convention: attention_mask is 1=attend, 0=pad.
+        # Convert to bool padding_mask (True = ignore) or derive from input_ids.
+        if attention_mask is not None:
+            padding_mask = attention_mask.eq(0)
+        else:
+            padding_mask = input_ids.eq(cfg.padding_idx)
+        x = self.embed_scale * self.embed_tokens(input_ids)
+        if cfg.token_dropout:
+            x.masked_fill_((input_ids == cfg.mask_idx).unsqueeze(-1), 0.0)
+            mask_ratio_train = 0.15 * 0.8
+            src_lengths = (~padding_mask).sum(-1)
+            mask_ratio_observed = (input_ids == cfg.mask_idx).sum(-1).to(x.dtype) / src_lengths.to(x.dtype)
+            x = x * (1 - mask_ratio_train) / (1 - mask_ratio_observed)[:, None, None]
+        if padding_mask is not None:
+            x = x * (1 - padding_mask.unsqueeze(-1).type_as(x))
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        if output_hidden_states:
+            all_hidden_states += (x,)
+        x = x.transpose(0, 1)  # (B, T, E) -> (T, B, E)
+        effective_padding = padding_mask if padding_mask.any() else None
+        for layer in self.layers:
+            x, attn_weights = layer(x, padding_mask=effective_padding, output_attentions=output_attentions)
+            if output_hidden_states:
+                all_hidden_states += (x.transpose(0, 1),)
+            if output_attentions:
+                all_attentions += (attn_weights,)
+        x = self.emb_layer_norm_after(x)
+        x = x.transpose(0, 1)  # (T, B, E) -> (B, T, E)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states[:-1] + (x,)
+        if not return_dict:
+            return tuple(v for v in [x, all_hidden_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=x,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+        )
+# ---------------------------------------------------------------------------
+# MLM head
+# ---------------------------------------------------------------------------
+class UtrLmForMaskedLM(PreTrainedModel):
+    """
+    UTR-LM with a masked-language-modelling head.
+    Returns MaskedLMOutput with logits (B, T, vocab_size).
+    """
+    config_class = UtrLmConfig
+    base_model_prefix = "utrlm"
+    _supports_sdpa = True
+    _supports_flash_attn_2 = True
+    def __init__(self, config: UtrLmConfig):
+        super().__init__(config)
+        self.utrlm = UtrLmModel(config)
+        embed_dim = config.embed_dim
+        vocab_size = config.alphabet_size
+        self.lm_head = nn.ModuleDict({
+            "dense": nn.Linear(embed_dim, embed_dim),
+            "layer_norm": nn.LayerNorm(embed_dim),
+        })
+        self.lm_head_bias = nn.Parameter(torch.zeros(vocab_size))
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.utrlm.embed_tokens
+    def set_input_embeddings(self, value):
+        self.utrlm.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.utrlm.embed_tokens
+    def set_output_embeddings(self, new_embeddings):
+        self.utrlm.embed_tokens = new_embeddings
+    def _lm_head_forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.lm_head["dense"](x)
+        x = _gelu(x)
+        x = self.lm_head["layer_norm"](x)
+        return F.linear(x, self.utrlm.embed_tokens.weight) + self.lm_head_bias
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.BoolTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, MaskedLMOutput]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.utrlm(
+            input_ids,
+            attention_mask=attention_mask,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=True,
+        )
+        logits = self._lm_head_forward(outputs.last_hidden_state)
+        loss = None
+        if labels is not None:
+            loss = F.cross_entropy(
+                logits.view(-1, self.config.alphabet_size),
+                labels.view(-1),
+                ignore_index=self.config.padding_idx,
+            )
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return MaskedLMOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:33e7cfeb0d8b44636ee45e87de2c8af59f114abc963378eabe40c41073654e63
+size 4866715

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "cls_token": "<cls>",
+  "eos_token": "<eos>",
+  "mask_token": "<mask>",
+  "pad_token": "<pad>",
+  "sep_token": "<sep>",
+  "unk_token": "<unk>"
+}

tokenization_utrlm.py ADDED Viewed

	@@ -0,0 +1,128 @@

+"""Character-level RNA tokenizer for UTR-LM."""
+import json
+import os
+from typing import Dict, List, Optional, Tuple
+from transformers import PreTrainedTokenizer
+# Canonical vocab - fixed; never changes across checkpoints.
+_VOCAB: Dict[str, int] = {
+    "<pad>": 0,
+    "<eos>": 1,
+    "<unk>": 2,
+    "A": 3,
+    "G": 4,
+    "C": 5,
+    "T": 6,
+    "<cls>": 7,
+    "<mask>": 8,
+    "<sep>": 9,
+}
+_IDS_TO_TOKENS: Dict[int, str] = {v: k for k, v in _VOCAB.items()}
+class UtrLmTokenizer(PreTrainedTokenizer):
+    """
+    Character-level tokenizer for UTR-LM RNA sequences.
+    Each nucleotide (A / G / C / T) maps to a single token.
+    Sequences are automatically wrapped with [CLS] ... [EOS] on encoding.
+    Example::
+        tok = UtrLmTokenizer()
+        enc = tok("ATGCATG", return_tensors="pt")
+        # enc.input_ids: [[7, 3, 6, 4, 5, 3, 6, 1]]
+        #                  CLS A T G C A T  EOS
+    """
+    vocab_files_names = {"vocab_file": "vocab.json"}
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        vocab_file: Optional[str] = None,
+        cls_token: str = "<cls>",
+        pad_token: str = "<pad>",
+        mask_token: str = "<mask>",
+        eos_token: str = "<eos>",
+        unk_token: str = "<unk>",
+        sep_token: str = "<sep>",
+        **kwargs,
+    ):
+        # Build vocab from file if provided (allows future extension), else use default
+        if vocab_file is not None and os.path.isfile(vocab_file):
+            with open(vocab_file) as f:
+                self._vocab = json.load(f)
+        else:
+            self._vocab = dict(_VOCAB)
+        self._ids_to_tokens = {v: k for k, v in self._vocab.items()}
+        super().__init__(
+            cls_token=cls_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            **kwargs,
+        )
+    # ------------------------------------------------------------------
+    # Required overrides
+    # ------------------------------------------------------------------
+    @property
+    def vocab_size(self) -> int:
+        return len(self._vocab)
+    def get_vocab(self) -> Dict[str, int]:
+        return dict(self._vocab)
+    def _tokenize(self, text: str) -> List[str]:
+        """Split sequence into individual characters."""
+        return list(text)
+    def _convert_token_to_id(self, token: str) -> int:
+        return self._vocab.get(token, self._vocab["<unk>"])
+    def _convert_id_to_token(self, index: int) -> str:
+        return self._ids_to_tokens.get(index, "<unk>")
+    def save_vocabulary(
+        self, save_directory: str, filename_prefix: Optional[str] = None
+    ) -> Tuple[str]:
+        os.makedirs(save_directory, exist_ok=True)
+        fname = (filename_prefix + "-" if filename_prefix else "") + "vocab.json"
+        path = os.path.join(save_directory, fname)
+        with open(path, "w") as f:
+            json.dump(self._vocab, f, indent=2)
+        return (path,)
+    # ------------------------------------------------------------------
+    # Special-token wrapping: prepend [CLS], append [EOS]
+    # ------------------------------------------------------------------
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        cls = [self.cls_token_id]
+        eos = [self.eos_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0 + eos
+        return cls + token_ids_0 + eos + cls + token_ids_1 + eos
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None,
+                                already_has_special_tokens=False):
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0, token_ids_1, already_has_special_tokens=True
+            )
+        mask = [1] + [0] * len(token_ids_0) + [1]
+        if token_ids_1 is not None:
+            mask += [1] + [0] * len(token_ids_1) + [1]
+        return mask
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        if token_ids_1 is None:
+            return [0] + token_ids_0 + [0]
+        return [0] + token_ids_0 + [0, 0] + token_ids_1 + [0]

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<eos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<cls>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<cls>",
+  "eos_token": "<eos>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 1024,
+  "pad_token": "<pad>",
+  "sep_token": "<sep>",
+  "tokenizer_class": "UtrLmTokenizer",
+  "unk_token": "<unk>"
+}

vocab.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "<pad>": 0,
+  "<eos>": 1,
+  "<unk>": 2,
+  "A": 3,
+  "G": 4,
+  "C": 5,
+  "T": 6,
+  "<cls>": 7,
+  "<mask>": 8,
+  "<sep>": 9
+}