Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +150 -0
config.json +25 -0
configuration_rnaernie2.py +40 -0
model.safetensors +3 -0
modeling_rnaernie2.py +429 -0
special_tokens_map.json +9 -0
tokenization_rnaernie2.py +107 -0
tokenizer_config.json +17 -0
vocab.txt +11 -0

README.md ADDED Viewed

	@@ -0,0 +1,150 @@

+---
+language:
+- rna
+library_name: transformers
+tags:
+- RNA
+- language-model
+license: apache-2.0
+---
+# RNAErnie2
+RNAErnie2 is a BERT-based RNA language model trained from scratch on a large-scale RNA
+sequence dataset with up to 2048-nucleotide context length. It is a retrained successor
+to RNAErnie that replaces the PaddlePaddle-based ERNIE backbone with a standard PyTorch
+BERT architecture, extends the pretraining corpus to RNACentral v22 (~31M sequences,
+length <= 2048), and switches to an RNA-native vocabulary (U instead of T).
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 12 |
+| Attention heads | 12 |
+| Embedding dimension | 768 |
+| Intermediate size | 3072 |
+| Vocabulary size | 11 |
+| Positional encoding | Absolute learned |
+| Architecture | Pre-LN BERT / BertForMaskedLM |
+| Max sequence length | 2048 |
+**Vocabulary:** `[PAD]=0, [UNK]=1, [CLS]=2, [EOS]=3, [SEP]=4, [MASK]=5, A=6, U=7, C=8, G=9, N=10`
+## Pretraining
+- **Objective:** Masked language modelling (MLM)
+- **Data:** RNACentral v22, ~31 million RNA sequences with length <= 2048
+- **Source checkpoint:** [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) on HuggingFace Hub
+- **Tokenisation note:** Sequences use U (not T). Input T is silently converted to U by the tokenizer.
+### Checkpoint selection
+There is a single publicly released RNAErnie2 checkpoint. The weights are taken from
+[`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) with one minor
+adjustment: `cls.predictions.decoder.bias` is stored explicitly (it was implicitly
+tied to `cls.predictions.bias` in the original save and was absent from the file).
+## Parity Verification
+Hidden-state representations and MLM logits verified identical (max abs diff < 2e-5)
+to the original `BertForMaskedLM` at all 13 representation levels (embedding + 12 layers).
+Verified on GPU with PyTorch 2.7 / CUDA 12.
+## Implementation Notes
+Custom BERT implementation (`modeling_rnaernie2.py`) with eager, SDPA, and Flash
+Attention 2 backends, following the architecture of
+[`Taykhoom/BERT-updated`](https://huggingface.co/Taykhoom/BERT-updated).
+The original [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) used
+standard HF BERT with no custom attention backends.
+## Related Models
+See the full [RNAErnie collection](<COLLECTION_URL>).
+| Model | Context | Training data | Notes |
+|---|---|---|---|
+| [RNAErnie](../RNAErnie) | 512 | RNACentral (nts<=512) | Original; PaddlePaddle backbone |
+| **[RNAErnie2](./)** | **2048** | **RNACentral v22 (~31M seqs)** | **This model; PyTorch BERT** |
+## Usage
+### Embedding generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
+model.eval()
+sequences = ["AUGCAUGCAUGC", "GCUGCAUGCUAGC"]
+enc = tokenizer(sequences, return_tensors="pt", padding=True)
+with torch.no_grad():
+    out = model(**enc)
+cls_emb   = out.last_hidden_state[:, 0, :]  # (batch, 768) -- CLS token
+token_emb = out.last_hidden_state           # (batch, seq_len, 768)
+# Intermediate layers
+out_all = model(**enc, output_hidden_states=True)
+layer6_emb = out_all.hidden_states[6]       # (batch, seq_len, 768)
+```
+### MLM logits
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
+model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
+model.eval()
+enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt")
+with torch.no_grad():
+    logits = model(**enc).logits  # (1, seq_len, 11)
+```
+### SDPA / Flash Attention 2
+```python
+model = AutoModel.from_pretrained(
+    "Taykhoom/RNAErnie2",
+    attn_implementation="sdpa",   # or "flash_attention_2"
+    trust_remote_code=True,
+)
+```
+### Fine-tuning
+Standard HF conventions. For sequence-level tasks, use the CLS token embedding
+(`last_hidden_state[:, 0, :]`) as input to a classification head.
+## Citation
+```bibtex
+@article{wang2024_rnaernie,
+  title   = {Multi-purpose {RNA} language modelling with motif-aware pretraining and type-guided fine-tuning},
+  author  = {Wang, Ning and Bian, Jiang and Li, Yuchen and Li, Xuhong and Mumtaz, Shahid and Kong, Linghe and Xiong, Haoyi},
+  journal = {Nature Machine Intelligence},
+  volume  = {6},
+  pages   = {548--557},
+  year    = {2024},
+  doi     = {10.1038/s42256-024-00836-4}
+}
+```
+## Credits
+Original model and code by Wang et al. Source: [GitHub](https://github.com/CatIIIIIIII/RNAErnie) /
+[HuggingFace](https://huggingface.co/LLM-EDA/RNAErnie).
+The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
+and reviewed manually by Taykhoom Dalal.
+## License
+Apache 2.0, following the original repository.

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "RNAErnie2ForMaskedLM"
+  ],
+  "model_type": "rnaernie2",
+  "auto_map": {
+    "AutoConfig": "configuration_rnaernie2.RNAErnie2Config",
+    "AutoModel": "modeling_rnaernie2.RNAErnie2Model",
+    "AutoModelForMaskedLM": "modeling_rnaernie2.RNAErnie2ForMaskedLM"
+  },
+  "vocab_size": 11,
+  "hidden_size": 768,
+  "num_hidden_layers": 12,
+  "num_attention_heads": 12,
+  "intermediate_size": 3072,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "attention_probs_dropout_prob": 0.1,
+  "max_position_embeddings": 2048,
+  "type_vocab_size": 2,
+  "layer_norm_eps": 1e-05,
+  "pad_token_id": 0,
+  "initializer_range": 0.02,
+  "transformers_version": "4.57.6"
+}

configuration_rnaernie2.py ADDED Viewed

	@@ -0,0 +1,40 @@

+from transformers import PretrainedConfig
+class RNAErnie2Config(PretrainedConfig):
+    model_type = "rnaernie2"
+    auto_map = {
+        "AutoConfig": "configuration_rnaernie2.RNAErnie2Config",
+        "AutoModel": "modeling_rnaernie2.RNAErnie2Model",
+        "AutoModelForMaskedLM": "modeling_rnaernie2.RNAErnie2ForMaskedLM",
+    }
+    def __init__(
+        self,
+        vocab_size: int = 11,
+        hidden_size: int = 768,
+        num_hidden_layers: int = 12,
+        num_attention_heads: int = 12,
+        intermediate_size: int = 3072,
+        hidden_act: str = "gelu",
+        hidden_dropout_prob: float = 0.1,
+        attention_probs_dropout_prob: float = 0.1,
+        max_position_embeddings: int = 2048,
+        type_vocab_size: int = 2,
+        layer_norm_eps: float = 1e-5,
+        pad_token_id: int = 0,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.layer_norm_eps = layer_norm_eps

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5642f83f12205dcd729145578b1ab0d78e3124335fe9d13450ab363295456b33
+size 348947640

modeling_rnaernie2.py ADDED Viewed

	@@ -0,0 +1,429 @@

+import math
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutputWithPooling, MaskedLMOutput
+try:
+    from .configuration_rnaernie2 import RNAErnie2Config
+except ImportError:
+    from configuration_rnaernie2 import RNAErnie2Config
+# ---------------------------------------------------------------------------
+# Attention variants
+# ---------------------------------------------------------------------------
+class RNAErnie2SelfAttention(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+    def _split_heads(self, x: torch.Tensor) -> torch.Tensor:
+        B, T, _ = x.shape
+        return x.view(B, T, self.num_attention_heads, self.attention_head_size).permute(0, 2, 1, 3)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_padding_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        q = self._split_heads(self.query(hidden_states))
+        k = self._split_heads(self.key(hidden_states))
+        v = self._split_heads(self.value(hidden_states))
+        scale = math.sqrt(self.attention_head_size)
+        scores = torch.matmul(q, k.transpose(-1, -2)) / scale
+        if key_padding_mask is not None:
+            scores = scores.masked_fill(key_padding_mask[:, None, None, :], float("-inf"))
+        probs = F.softmax(scores, dim=-1)
+        probs = self.dropout(probs)
+        context = torch.matmul(probs, v)
+        B, _, T, _ = context.shape
+        context = context.permute(0, 2, 1, 3).contiguous().view(B, T, self.all_head_size)
+        if output_attentions:
+            return context, probs
+        return context, None
+class RNAErnie2SdpaSelfAttention(RNAErnie2SelfAttention):
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_padding_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        if output_attentions:
+            return super().forward(hidden_states, key_padding_mask, output_attentions=True)
+        B, T, _ = hidden_states.shape
+        q = self._split_heads(self.query(hidden_states))
+        k = self._split_heads(self.key(hidden_states))
+        v = self._split_heads(self.value(hidden_states))
+        attn_mask = None
+        if key_padding_mask is not None:
+            attn_mask = torch.zeros(B, 1, 1, T, dtype=q.dtype, device=q.device)
+            attn_mask = attn_mask.masked_fill(key_padding_mask[:, None, None, :], float("-inf"))
+        context = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
+        context = context.permute(0, 2, 1, 3).contiguous().view(B, T, self.all_head_size)
+        return context, None
+class RNAErnie2FlashSelfAttention(RNAErnie2SelfAttention):
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_padding_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        if output_attentions:
+            return super().forward(hidden_states, key_padding_mask, output_attentions=True)
+        try:
+            from flash_attn import flash_attn_func, flash_attn_varlen_func
+            from flash_attn.bert_padding import pad_input, unpad_input
+        except ImportError as e:
+            raise ImportError(
+                "flash_attn is required for attn_implementation='flash_attention_2'. "
+                "Install with: pip install flash-attn --no-build-isolation"
+            ) from e
+        B, T, _ = hidden_states.shape
+        q = self._split_heads(self.query(hidden_states))
+        k = self._split_heads(self.key(hidden_states))
+        v = self._split_heads(self.value(hidden_states))
+        q = q.permute(0, 2, 1, 3)
+        k = k.permute(0, 2, 1, 3)
+        v = v.permute(0, 2, 1, 3)
+        orig_dtype = q.dtype
+        if orig_dtype not in (torch.float16, torch.bfloat16):
+            q, k, v = q.to(torch.bfloat16), k.to(torch.bfloat16), v.to(torch.bfloat16)
+        if key_padding_mask is not None and key_padding_mask.any():
+            attend = ~key_padding_mask
+            q_u, indices, cu_seqlens, max_seqlen, _ = unpad_input(q, attend)
+            k_u, _, _, _, _ = unpad_input(k, attend)
+            v_u, _, _, _, _ = unpad_input(v, attend)
+            out_u = flash_attn_varlen_func(
+                q_u, k_u, v_u,
+                cu_seqlens_q=cu_seqlens, cu_seqlens_k=cu_seqlens,
+                max_seqlen_q=max_seqlen, max_seqlen_k=max_seqlen,
+                causal=False,
+            )
+            out = pad_input(out_u, indices, B, T)
+        else:
+            out = flash_attn_func(q, k, v, causal=False)
+        out = out.to(orig_dtype).reshape(B, T, self.all_head_size)
+        return out, None
+RNAERNIE2_SELF_ATTENTION_CLASSES = {
+    "eager": RNAErnie2SelfAttention,
+    "sdpa": RNAErnie2SdpaSelfAttention,
+    "flash_attention_2": RNAErnie2FlashSelfAttention,
+}
+# ---------------------------------------------------------------------------
+# Layer components -- attribute names match BertForMaskedLM weight keys exactly
+# ---------------------------------------------------------------------------
+class RNAErnie2SelfOutput(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dropout(self.dense(hidden_states))
+        return self.LayerNorm(hidden_states + input_tensor)
+class RNAErnie2Attention(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        attn_cls = RNAERNIE2_SELF_ATTENTION_CLASSES[getattr(config, "_attn_implementation", "eager")]
+        self.self = attn_cls(config)
+        self.output = RNAErnie2SelfOutput(config)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_padding_mask: Optional[torch.Tensor],
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        self_out, attn_weights = self.self(hidden_states, key_padding_mask, output_attentions)
+        return self.output(self_out, hidden_states), attn_weights
+class RNAErnie2Intermediate(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.act = nn.GELU() if config.hidden_act == "gelu" else nn.ReLU()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return self.act(self.dense(hidden_states))
+class RNAErnie2Output(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dropout(self.dense(hidden_states))
+        return self.LayerNorm(hidden_states + input_tensor)
+class RNAErnie2Layer(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.attention = RNAErnie2Attention(config)
+        self.intermediate = RNAErnie2Intermediate(config)
+        self.output = RNAErnie2Output(config)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_padding_mask: Optional[torch.Tensor],
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        attn_out, attn_weights = self.attention(hidden_states, key_padding_mask, output_attentions)
+        return self.output(self.intermediate(attn_out), attn_out), attn_weights
+class RNAErnie2Encoder(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.layer = nn.ModuleList([RNAErnie2Layer(config) for _ in range(config.num_hidden_layers)])
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_padding_mask: Optional[torch.Tensor],
+        output_hidden_states: bool = False,
+        output_attentions: bool = False,
+    ) -> Tuple:
+        all_hidden_states = (hidden_states,) if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        for layer in self.layer:
+            hidden_states, attn_weights = layer(hidden_states, key_padding_mask, output_attentions)
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if output_attentions:
+                all_attentions = all_attentions + (attn_weights,)
+        return hidden_states, all_hidden_states, all_attentions
+# ---------------------------------------------------------------------------
+# Embeddings and pooler
+# ---------------------------------------------------------------------------
+class RNAErnie2Embeddings(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False)
+    def forward(self, input_ids: torch.LongTensor, token_type_ids: Optional[torch.LongTensor] = None) -> torch.Tensor:
+        B, T = input_ids.shape
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+        x = self.word_embeddings(input_ids)
+        x = x + self.position_embeddings(self.position_ids[:, :T])
+        x = x + self.token_type_embeddings(token_type_ids)
+        return self.dropout(self.LayerNorm(x))
+class RNAErnie2Pooler(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return self.activation(self.dense(hidden_states[:, 0]))
+# ---------------------------------------------------------------------------
+# MLM prediction head -- key names match original BertForMaskedLM exactly:
+#   cls.predictions.bias
+#   cls.predictions.transform.dense.{weight,bias}
+#   cls.predictions.transform.LayerNorm.{weight,bias}
+#   cls.predictions.decoder.weight  (tied to word_embeddings)
+# ---------------------------------------------------------------------------
+class RNAErnie2PredictionHeadTransform(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.act = nn.GELU() if config.hidden_act == "gelu" else nn.ReLU()
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return self.LayerNorm(self.act(self.dense(hidden_states)))
+class RNAErnie2LMPredictionHead(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.transform = RNAErnie2PredictionHeadTransform(config)
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+        self.decoder.bias = self.bias
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return self.decoder(self.transform(hidden_states))
+class RNAErnie2OnlyMLMHead(nn.Module):
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__()
+        self.predictions = RNAErnie2LMPredictionHead(config)
+    def forward(self, sequence_output: torch.Tensor) -> torch.Tensor:
+        return self.predictions(sequence_output)
+# ---------------------------------------------------------------------------
+# Top-level models
+# ---------------------------------------------------------------------------
+class RNAErnie2Model(PreTrainedModel):
+    config_class = RNAErnie2Config
+    _supports_sdpa = True
+    _supports_flash_attn_2 = True
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__(config)
+        self.embeddings = RNAErnie2Embeddings(config)
+        self.encoder = RNAErnie2Encoder(config)
+        self.pooler = RNAErnie2Pooler(config)
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids)
+        key_padding_mask = attention_mask.eq(0)
+        if not key_padding_mask.any():
+            key_padding_mask = None
+        x = self.embeddings(input_ids, token_type_ids)
+        last_hidden_state, all_hidden_states, all_attentions = self.encoder(
+            x, key_padding_mask,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+        )
+        pooled = self.pooler(last_hidden_state)
+        if not return_dict:
+            return tuple(v for v in [last_hidden_state, pooled, all_hidden_states, all_attentions] if v is not None)
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+        )
+class RNAErnie2ForMaskedLM(PreTrainedModel):
+    config_class = RNAErnie2Config
+    _supports_sdpa = True
+    _supports_flash_attn_2 = True
+    def __init__(self, config: RNAErnie2Config):
+        super().__init__(config)
+        self.bert = RNAErnie2Model(config)
+        self.cls = RNAErnie2OnlyMLMHead(config)
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.bert.embeddings.word_embeddings
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.LongTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, MaskedLMOutput]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.bert(
+            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids,
+            output_hidden_states=output_hidden_states, output_attentions=output_attentions,
+            return_dict=True,
+        )
+        logits = self.cls(outputs.last_hidden_state)
+        loss = None
+        if labels is not None:
+            loss = F.cross_entropy(logits.view(-1, self.config.vocab_size), labels.view(-1), ignore_index=-100)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return (loss,) + output if loss is not None else output
+        return MaskedLMOutput(
+            loss=loss, logits=logits,
+            hidden_states=outputs.hidden_states, attentions=outputs.attentions,
+        )

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "bos_token": "[CLS]",
+  "cls_token": "[CLS]",
+  "eos_token": "[EOS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenization_rnaernie2.py ADDED Viewed

	@@ -0,0 +1,107 @@

+import os
+from transformers import PreTrainedTokenizer
+_VOCAB = {
+    "[PAD]": 0,
+    "[UNK]": 1,
+    "[CLS]": 2,
+    "[EOS]": 3,
+    "[SEP]": 4,
+    "[MASK]": 5,
+    "A": 6,
+    "U": 7,
+    "C": 8,
+    "G": 9,
+    "N": 10,
+}
+class RNAErnie2Tokenizer(PreTrainedTokenizer):
+    """Character-level RNA tokenizer for RNAErnie2.
+    Vocab (11 tokens): [PAD]=0, [UNK]=1, [CLS]=2, [EOS]=3, [SEP]=4, [MASK]=5,
+    A=6, U=7, C=8, G=9, N=10.
+    Sequences are wrapped [CLS] + tokens + [SEP].
+    T is silently converted to U (RNA convention).
+    """
+    vocab_files_names = {"vocab_file": "vocab.txt"}
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        vocab_file=None,
+        pad_token="[PAD]",
+        unk_token="[UNK]",
+        cls_token="[CLS]",
+        eos_token="[EOS]",
+        sep_token="[SEP]",
+        mask_token="[MASK]",
+        **kwargs,
+    ):
+        self._vocab = {}
+        if vocab_file and os.path.isfile(vocab_file):
+            with open(vocab_file, encoding="utf-8") as f:
+                for idx, line in enumerate(f):
+                    token = line.rstrip("\n")
+                    self._vocab[token] = idx
+        else:
+            self._vocab = dict(_VOCAB)
+        self._ids_to_tokens = {v: k for k, v in self._vocab.items()}
+        super().__init__(
+            pad_token=pad_token,
+            unk_token=unk_token,
+            cls_token=cls_token,
+            eos_token=eos_token,
+            sep_token=sep_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+    @property
+    def vocab_size(self):
+        return len(self._vocab)
+    def get_vocab(self):
+        return dict(self._vocab)
+    def _tokenize(self, text):
+        return list(text.upper().replace("T", "U"))
+    def _convert_token_to_id(self, token):
+        return self._vocab.get(token, self._vocab.get("[UNK]", 1))
+    def _convert_id_to_token(self, index):
+        return self._ids_to_tokens.get(index, "[UNK]")
+    def save_vocabulary(self, save_directory, filename_prefix=None):
+        os.makedirs(save_directory, exist_ok=True)
+        fname = (filename_prefix + "-" if filename_prefix else "") + "vocab.txt"
+        path = os.path.join(save_directory, fname)
+        with open(path, "w", encoding="utf-8") as f:
+            for token, _ in sorted(self._vocab.items(), key=lambda x: x[1]):
+                f.write(token + "\n")
+        return (path,)
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0 + sep
+        return cls + token_ids_0 + sep + token_ids_1 + sep
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(token_ids_0, token_ids_1, True)
+        mask = [1] + [0] * len(token_ids_0) + [1]
+        if token_ids_1 is not None:
+            mask += [0] * len(token_ids_1) + [1]
+        return mask
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        cls_sep = [0]
+        if token_ids_1 is None:
+            return cls_sep + [0] * len(token_ids_0) + cls_sep
+        return cls_sep + [0] * len(token_ids_0) + cls_sep + [0] * len(token_ids_1) + cls_sep

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_rnaernie2.RNAErnie2Tokenizer",
+      null
+    ]
+  },
+  "tokenizer_class": "RNAErnie2Tokenizer",
+  "model_max_length": 2048,
+  "pad_token": "[PAD]",
+  "unk_token": "[UNK]",
+  "cls_token": "[CLS]",
+  "eos_token": "[EOS]",
+  "sep_token": "[SEP]",
+  "mask_token": "[MASK]",
+  "padding_side": "right"
+}

vocab.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+[PAD]
+[UNK]
+[CLS]
+[EOS]
+[SEP]
+[MASK]
+A
+U
+C
+G
+N