Upload folder using huggingface_hub

Browse files

Files changed (12) hide show

README.md +206 -0
__pycache__/configuration_rnamsm.cpython-39.pyc +0 -0
__pycache__/modeling_rnamsm.cpython-39.pyc +0 -0
__pycache__/tokenization_rnamsm.cpython-39.pyc +0 -0
config.json +28 -0
configuration_rnamsm.py +48 -0
model.safetensors +3 -0
modeling_rnamsm.py +420 -0
special_tokens_map.json +7 -0
tokenization_rnamsm.py +241 -0
tokenizer_config.json +56 -0
vocab.json +14 -0

README.md ADDED Viewed

	@@ -0,0 +1,206 @@

+---
+language:
+- rna
+library_name: transformers
+tags:
+- RNA
+- language-model
+- MSA
+license: mit
+---
+# RNA-MSM
+Multiple sequence alignment-based RNA language model trained on homologous RNA
+sequence alignments from the RNAcmap pipeline.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 10 |
+| Attention heads | 12 |
+| Embedding dimension | 768 |
+| FFN dimension | 3072 |
+| Vocabulary size | 12 |
+| Positional encoding | Learned (sequence) + learned scalar (alignment row) |
+| Architecture | Axial MSA Transformer (row + column self-attention) |
+| Max sequence length | 1024 |
+| Max alignment depth | 1024 |
+**Input format:** RNA-MSM takes 3D input `(batch, num_alignments, seqlen)`. Each
+alignment is a set of homologous RNA sequences of equal length (an MSA). The model
+applies row self-attention (across sequence positions) and column self-attention
+(across alignment rows) at each of the 10 transformer layers.
+### Vocabulary
+| Token | ID | Token | ID |
+|---|---|---|---|
+| `<cls>` | 0 | `U` | 7 |
+| `<pad>` | 1 | `X` | 8 |
+| `<eos>` | 2 | `N` | 9 |
+| `<unk>` | 3 | `-` | 10 |
+| `A` | 4 | `<mask>` | 11 |
+| `G` | 5 | | |
+| `C` | 6 | | |
+Each sequence is prepended with `<cls>` (id 0). No `<eos>` token is appended.
+## Pretraining
+- **Objective:** Masked language modeling on RNA MSAs (masking ~15% of tokens)
+- **Data:** RNA homologous sequences searched by RNAcmap from non-redundant RNA
+  databases
+- **Source checkpoint:** `RNA_MSM_pretrained.ckpt`
+  ([original Google Drive link](https://drive.google.com/file/d/11A-S13qAb5wiBi1YLs3EOrnixSDq7Q0q/view))
+### Checkpoint selection
+There is one publicly released RNA-MSM pretrained checkpoint. This is that checkpoint,
+converted from the original PyTorch Lightning `.ckpt` format.
+## Parity Verification
+Hidden-state representations verified identical (max abs diff = 0.00, exact match) to
+the reference implementation at all 11 representation levels (embedding + 10 transformer
+layers), both on padded and unpadded batches. Verified on GPU with PyTorch 2.7 /
+CUDA 12.6.
+## Related Models
+See the full [RNA-MSM collection](https://huggingface.co/collections/Taykhoom/rna-msm).
+## Usage
+RNA-MSM is an **MSA model** -- it performs best when given multiple homologous
+sequences as input. For single-sequence embedding, each sequence is treated as a
+1-row MSA.
+### Single-sequence embedding
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNA-MSM", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/RNA-MSM", trust_remote_code=True)
+model.eval()
+sequences = ["AGCUAGCUAGCU", "GCUAGCUA"]
+enc = tokenizer(sequences, return_tensors="pt", padding=True)
+# enc["input_ids"]: (2, 1, seqlen)  -- each sequence treated as 1-row MSA
+with torch.no_grad():
+    out = model(**enc)
+# last_hidden_state: (batch, num_alignments, seqlen, 768)
+lhs = out.last_hidden_state   # (2, 1, seqlen, 768)
+# Per-token embeddings for the query sequence (row 0), excluding CLS
+token_emb = lhs[:, 0, 1:, :]  # (2, seqlen-1, 768)
+# Mean-pool over non-padding positions for sequence-level embedding
+mask = enc["attention_mask"][:, 0, 1:].unsqueeze(-1).float()  # (2, seqlen-1, 1)
+seq_emb = (token_emb * mask).sum(1) / mask.sum(1).clamp(min=1)  # (2, 768)
+```
+### MSA embedding
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNA-MSM", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/RNA-MSM", trust_remote_code=True)
+model.eval()
+# One MSA: 3 aligned homologous sequences of equal length
+msa = [
+    "AGCUAGCUAGCU",
+    "AGCUAGCUAGC-",
+    "AGCU--CUAGCU",
+]
+enc = tokenizer.encode_msa([msa], return_tensors="pt", padding=True)
+# enc["input_ids"]: (1, 3, seqlen)
+with torch.no_grad():
+    out = model(**enc)
+# last_hidden_state: (1, 3, seqlen, 768)
+# Use row 0 (query sequence) for downstream tasks
+query_emb = out.last_hidden_state[:, 0, 1:, :]  # (1, seqlen-1, 768)
+```
+### Intermediate layers
+```python
+with torch.no_grad():
+    out = model(**enc, output_hidden_states=True)
+# hidden_states: tuple of 11 tensors, each (batch, num_alignments, seqlen, 768)
+# Index 0 = embedding, 1..10 = transformer layer outputs
+layer5_emb = out.hidden_states[5][:, 0, :, :]  # (batch, seqlen, 768)
+```
+### MLM logits
+```python
+from transformers import AutoModelForMaskedLM
+mlm = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNA-MSM", trust_remote_code=True)
+mlm.eval()
+enc = tokenizer(["AGCU<mask>AGCU"], return_tensors="pt", padding=True)
+with torch.no_grad():
+    logits = mlm(**enc).logits  # (1, 1, seqlen, 12)
+```
+### Fine-tuning
+For sequence-level downstream tasks (e.g., solvent accessibility), extract the
+embedding from the query row (row 0) of the last hidden state, then apply a
+prediction head. The model's attention maps (row attention) are also useful for
+2D structural tasks (e.g., secondary structure prediction).
+## Implementation Notes
+RNA-MSM uses **axial attention**: each transformer layer applies row self-attention
+(attending across sequence positions, summed over alignment rows) followed by column
+self-attention (attending across alignment rows per position). This custom attention
+pattern is not compatible with `attn_implementation="sdpa"` or
+`attn_implementation="flash_attention_2"` -- only `"eager"` is supported.
+`last_hidden_state` has shape `(batch, num_alignments, seqlen, embed_dim)` -- note
+the 4D output, reflecting the MSA structure. For single-sequence use (1-row MSA),
+this is `(batch, 1, seqlen, embed_dim)`.
+## Citation
+```bibtex
+@article{zhang2024rnamsm,
+    author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang
+              and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder
+              and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian
+              and Chen, Jie and Zhou, Yaoqi},
+    title = {Multiple sequence alignment-based RNA language model and its application
+             to structural inference},
+    journal = {Nucleic Acids Research},
+    volume = {52},
+    number = {1},
+    pages = {e3},
+    year = {2024},
+    doi = {10.1093/nar/gkad1031},
+    pmid = {37941140},
+}
+```
+## Credits
+Original model and code by Zhang et al. Source: [GitHub](https://github.com/yikunpku/RNA-MSM).
+The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
+and reviewed manually by Taykhoom Dalal.
+## License
+MIT, following the original repository.

__pycache__/configuration_rnamsm.cpython-39.pyc ADDED Viewed

Binary file (1.4 kB). View file

__pycache__/modeling_rnamsm.cpython-39.pyc ADDED Viewed

Binary file (14.9 kB). View file

__pycache__/tokenization_rnamsm.cpython-39.pyc ADDED Viewed

Binary file (8.51 kB). View file

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "auto_map": {
+    "AutoConfig": "configuration_rnamsm.RNAMSMConfig",
+    "AutoModel": "modeling_rnamsm.RNAMSMModel",
+    "AutoModelForMaskedLM": "modeling_rnamsm.RNAMSMForMaskedLM"
+  },
+  "activation_dropout": 0.1,
+  "architectures": [
+    "RNAMSMForMaskedLM"
+  ],
+  "attention_dropout": 0.1,
+  "cls_idx": 0,
+  "dropout": 0.1,
+  "embed_dim": 768,
+  "embed_positions_msa": true,
+  "eos_idx": 2,
+  "ffn_embed_dim": 3072,
+  "mask_idx": 11,
+  "max_alignments": 1024,
+  "max_positions": 1024,
+  "max_tokens_per_msa": 16384,
+  "model_type": "rnamsm",
+  "num_attention_heads": 12,
+  "num_layers": 10,
+  "padding_idx": 1,
+  "transformers_version": "4.57.6",
+  "vocab_size": 12
+}

configuration_rnamsm.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from transformers import PretrainedConfig
+class RNAMSMConfig(PretrainedConfig):
+    model_type = "rnamsm"
+    auto_map = {
+        "AutoConfig": "configuration_rnamsm.RNAMSMConfig",
+        "AutoModel": "modeling_rnamsm.RNAMSMModel",
+        "AutoModelForMaskedLM": "modeling_rnamsm.RNAMSMForMaskedLM",
+    }
+    def __init__(
+        self,
+        vocab_size=12,
+        num_layers=10,
+        embed_dim=768,
+        num_attention_heads=12,
+        ffn_embed_dim=3072,
+        padding_idx=1,
+        mask_idx=11,
+        cls_idx=0,
+        eos_idx=2,
+        dropout=0.1,
+        attention_dropout=0.1,
+        activation_dropout=0.1,
+        max_positions=1024,
+        max_alignments=1024,
+        max_tokens_per_msa=16384,
+        embed_positions_msa=True,
+        **kwargs,
+    ):
+        super().__init__(padding_idx=padding_idx, **kwargs)
+        self.vocab_size = vocab_size
+        self.num_layers = num_layers
+        self.embed_dim = embed_dim
+        self.num_attention_heads = num_attention_heads
+        self.ffn_embed_dim = ffn_embed_dim
+        self.mask_idx = mask_idx
+        self.cls_idx = cls_idx
+        self.eos_idx = eos_idx
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.max_positions = max_positions
+        self.max_alignments = max_alignments
+        self.max_tokens_per_msa = max_tokens_per_msa
+        self.embed_positions_msa = embed_positions_msa

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3998fb8289d98cf53944fc14da4157a40c03dffecf0efefd7e76044ed16a0095
+size 383678288

modeling_rnamsm.py ADDED Viewed

	@@ -0,0 +1,420 @@

+import math
+from typing import Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutput, MaskedLMOutput
+try:
+    from .configuration_rnamsm import RNAMSMConfig
+except ImportError:
+    from configuration_rnamsm import RNAMSMConfig
+def gelu(x):
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+class RNAMSMLMHead(nn.Module):
+    def __init__(self, config: RNAMSMConfig, embed_tokens_weight: nn.Parameter):
+        super().__init__()
+        self.dense = nn.Linear(config.embed_dim, config.embed_dim)
+        self.layer_norm = nn.LayerNorm(config.embed_dim)
+        self.weight = embed_tokens_weight
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+    def forward(self, x):
+        x = self.dense(x)
+        x = gelu(x)
+        x = self.layer_norm(x)
+        return F.linear(x, self.weight) + self.bias
+class LearnedPositionalEmbedding(nn.Embedding):
+    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int):
+        num_embeddings_ = num_embeddings + padding_idx + 1
+        super().__init__(num_embeddings_, embedding_dim, padding_idx)
+        self.max_positions = num_embeddings
+    def forward(self, tokens: torch.Tensor) -> torch.Tensor:
+        mask = tokens.ne(self.padding_idx).int()
+        positions = (torch.cumsum(mask, dim=1).type_as(mask) * mask).long() + self.padding_idx
+        return F.embedding(positions, self.weight, self.padding_idx,
+                           self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse)
+class NormalizedResidualBlock(nn.Module):
+    def __init__(self, layer: nn.Module, embedding_dim: int, dropout: float):
+        super().__init__()
+        self.layer = layer
+        self.layer_norm = nn.LayerNorm(embedding_dim)
+        self.dropout_module = nn.Dropout(dropout)
+    def forward(self, x, *args, **kwargs):
+        residual = x
+        x = self.layer_norm(x)
+        outputs = self.layer(x, *args, **kwargs)
+        if isinstance(outputs, tuple):
+            x, *out = outputs
+        else:
+            x, out = outputs, None
+        x = self.dropout_module(x)
+        x = residual + x
+        if out is not None:
+            return (x,) + tuple(out)
+        return x
+class FeedForwardNetwork(nn.Module):
+    def __init__(self, embedding_dim: int, ffn_embedding_dim: int,
+                 activation_dropout: float, max_tokens_per_msa: int):
+        super().__init__()
+        self.fc1 = nn.Linear(embedding_dim, ffn_embedding_dim)
+        self.fc2 = nn.Linear(ffn_embedding_dim, embedding_dim)
+        self.activation_fn = nn.GELU()
+        self.activation_dropout_module = nn.Dropout(activation_dropout)
+        self.max_tokens_per_msa = max_tokens_per_msa
+    def forward(self, x):
+        x = self.activation_fn(self.fc1(x))
+        x = self.activation_dropout_module(x)
+        return self.fc2(x)
+class RowSelfAttention(nn.Module):
+    """Self-attention across columns (sequence positions), summed over MSA rows."""
+    def __init__(self, embed_dim: int, num_heads: int, dropout: float, max_tokens_per_msa: int):
+        super().__init__()
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        self.scaling = self.head_dim ** -0.5
+        self.max_tokens_per_msa = max_tokens_per_msa
+        self.attn_shape = "hnij"
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.out_proj = nn.Linear(embed_dim, embed_dim)
+        self.dropout_module = nn.Dropout(dropout)
+    def align_scaling(self, q):
+        return self.scaling / math.sqrt(q.size(0))
+    def compute_attention_weights(self, x, scaling, padding_mask=None):
+        num_rows, num_cols, batch_size, embed_dim = x.size()
+        q = self.q_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
+        k = self.k_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
+        q = q * scaling
+        if padding_mask is not None:
+            q = q * (1 - padding_mask.permute(1, 2, 0).unsqueeze(3).unsqueeze(4).to(q))
+        attn_weights = torch.einsum(f"rinhd,rjnhd->{self.attn_shape}", q, k)
+        if padding_mask is not None:
+            attn_weights = attn_weights.masked_fill(
+                padding_mask[:, 0].unsqueeze(0).unsqueeze(2), -10000.0)
+        return attn_weights
+    def compute_attention_update(self, x, attn_probs):
+        num_rows, num_cols, batch_size, embed_dim = x.size()
+        v = self.v_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
+        context = torch.einsum(f"{self.attn_shape},rjnhd->rinhd", attn_probs, v)
+        context = context.contiguous().view(num_rows, num_cols, batch_size, embed_dim)
+        return self.out_proj(context)
+    def _batched_forward(self, x, padding_mask=None):
+        num_rows, num_cols, batch_size, _ = x.size()
+        max_rows = max(1, self.max_tokens_per_msa // num_cols)
+        scaling = self.align_scaling(x)
+        attns = 0
+        for start in range(0, num_rows, max_rows):
+            pm = padding_mask[:, start:start + max_rows] if padding_mask is not None else None
+            attns = attns + self.compute_attention_weights(x[start:start + max_rows], scaling, pm)
+        attn_probs = attns.softmax(-1)
+        attn_probs = self.dropout_module(attn_probs)
+        outputs = [self.compute_attention_update(x[start:start + max_rows], attn_probs)
+                   for start in range(0, num_rows, max_rows)]
+        return torch.cat(outputs, 0), attn_probs
+    def forward(self, x, self_attn_mask=None, self_attn_padding_mask=None):
+        num_rows, num_cols, batch_size, _ = x.size()
+        if num_rows * num_cols > self.max_tokens_per_msa and not torch.is_grad_enabled():
+            return self._batched_forward(x, self_attn_padding_mask)
+        scaling = self.align_scaling(x)
+        attn_weights = self.compute_attention_weights(x, scaling, self_attn_padding_mask)
+        attn_probs = attn_weights.softmax(-1)
+        attn_probs = self.dropout_module(attn_probs)
+        output = self.compute_attention_update(x, attn_probs)
+        return output, attn_probs
+class ColumnSelfAttention(nn.Module):
+    """Self-attention across MSA rows (alignment depth) per sequence position."""
+    def __init__(self, embed_dim: int, num_heads: int, dropout: float, max_tokens_per_msa: int):
+        super().__init__()
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        self.scaling = self.head_dim ** -0.5
+        self.max_tokens_per_msa = max_tokens_per_msa
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.out_proj = nn.Linear(embed_dim, embed_dim)
+        self.dropout_module = nn.Dropout(dropout)
+    def compute_attention_update(self, x, self_attn_padding_mask=None):
+        num_rows, num_cols, batch_size, embed_dim = x.size()
+        if num_rows == 1:
+            attn_probs = torch.ones(self.num_heads, num_cols, batch_size, 1, 1,
+                                    device=x.device, dtype=x.dtype)
+            output = self.out_proj(self.v_proj(x))
+        else:
+            q = self.q_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
+            k = self.k_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
+            v = self.v_proj(x).view(num_rows, num_cols, batch_size, self.num_heads, self.head_dim)
+            q = q * self.scaling
+            attn_weights = torch.einsum("icnhd,jcnhd->hcnij", q, k)
+            if self_attn_padding_mask is not None:
+                attn_weights = attn_weights.masked_fill(
+                    self_attn_padding_mask.permute(2, 0, 1).unsqueeze(0).unsqueeze(3), -10000.0)
+            attn_probs = attn_weights.softmax(-1)
+            attn_probs = self.dropout_module(attn_probs)
+            context = torch.einsum("hcnij,jcnhd->icnhd", attn_probs, v)
+            context = context.contiguous().view(num_rows, num_cols, batch_size, embed_dim)
+            output = self.out_proj(context)
+        return output, attn_probs
+    def _batched_forward(self, x, self_attn_padding_mask=None):
+        num_rows, num_cols, batch_size, _ = x.size()
+        max_cols = max(1, self.max_tokens_per_msa // num_rows)
+        outputs, attns = [], []
+        for start in range(0, num_cols, max_cols):
+            pm = (self_attn_padding_mask[:, :, start:start + max_cols]
+                  if self_attn_padding_mask is not None else None)
+            out, attn = self.compute_attention_update(x[:, start:start + max_cols], pm)
+            outputs.append(out)
+            attns.append(attn)
+        return torch.cat(outputs, 1), torch.cat(attns, 1)
+    def forward(self, x, self_attn_mask=None, self_attn_padding_mask=None):
+        num_rows, num_cols, batch_size, _ = x.size()
+        if num_rows * num_cols > self.max_tokens_per_msa and not torch.is_grad_enabled():
+            return self._batched_forward(x, self_attn_padding_mask)
+        return self.compute_attention_update(x, self_attn_padding_mask)
+class AxialTransformerLayer(nn.Module):
+    def __init__(self, config: RNAMSMConfig):
+        super().__init__()
+        self.row_self_attention = NormalizedResidualBlock(
+            RowSelfAttention(config.embed_dim, config.num_attention_heads,
+                             config.attention_dropout, config.max_tokens_per_msa),
+            config.embed_dim, config.dropout,
+        )
+        self.column_self_attention = NormalizedResidualBlock(
+            ColumnSelfAttention(config.embed_dim, config.num_attention_heads,
+                                config.attention_dropout, config.max_tokens_per_msa),
+            config.embed_dim, config.dropout,
+        )
+        self.feed_forward_layer = NormalizedResidualBlock(
+            FeedForwardNetwork(config.embed_dim, config.ffn_embed_dim,
+                               config.activation_dropout, config.max_tokens_per_msa),
+            config.embed_dim, config.dropout,
+        )
+    def forward(self, x, padding_mask=None, output_attentions=False):
+        x, row_attn = self.row_self_attention(x, self_attn_padding_mask=padding_mask)
+        x, col_attn = self.column_self_attention(x, self_attn_padding_mask=padding_mask)
+        x = self.feed_forward_layer(x)
+        return x, row_attn, col_attn
+class RNAMSMPreTrainedModel(PreTrainedModel):
+    config_class = RNAMSMConfig
+    base_model_prefix = "rnamsm"
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            nn.init.normal_(module.weight, std=0.02)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, std=0.02)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            nn.init.ones_(module.weight)
+            nn.init.zeros_(module.bias)
+class RNAMSMModel(RNAMSMPreTrainedModel):
+    """
+    RNA-MSM backbone: MSA Transformer that processes multiple-sequence-aligned RNA
+    sequences and produces per-position embeddings for each alignment row.
+    Input: input_ids of shape (batch, num_alignments, seqlen)
+    Output: last_hidden_state of shape (batch, num_alignments, seqlen, embed_dim)
+    """
+    def __init__(self, config: RNAMSMConfig):
+        super().__init__(config)
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.embed_dim,
+                                         padding_idx=config.padding_idx)
+        self.embed_positions = LearnedPositionalEmbedding(
+            config.max_positions, config.embed_dim, config.padding_idx)
+        if config.embed_positions_msa:
+            self.msa_position_embedding = nn.Parameter(
+                0.01 * torch.randn(1, config.max_alignments, 1, 1))
+        else:
+            self.register_parameter("msa_position_embedding", None)
+        self.dropout_module = nn.Dropout(config.dropout)
+        self.emb_layer_norm_before = nn.LayerNorm(config.embed_dim)
+        self.emb_layer_norm_after = nn.LayerNorm(config.embed_dim)
+        self.layers = nn.ModuleList([AxialTransformerLayer(config)
+                                     for _ in range(config.num_layers)])
+        self.post_init()
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        output_hidden_states = (output_hidden_states if output_hidden_states is not None
+                                else self.config.output_hidden_states)
+        output_attentions = (output_attentions if output_attentions is not None
+                             else self.config.output_attentions)
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        assert input_ids.ndim == 3, (
+            "RNA-MSM expects 3D input_ids of shape (batch, num_alignments, seqlen). "
+            "For single sequences, use tokenizer which produces (batch, 1, seqlen).")
+        batch_size, num_alignments, seqlen = input_ids.size()
+        # HF convention: attention_mask 1=attend, 0=pad -> padding_mask True=padding
+        if attention_mask is not None:
+            padding_mask = attention_mask.eq(0)
+        else:
+            padding_mask = input_ids.eq(self.config.padding_idx)
+        if not padding_mask.any():
+            padding_mask = None
+        # (B, R, C) -> embed: (B, R, C, D)
+        x = self.embed_tokens(input_ids)
+        x = x + self.embed_positions(
+            input_ids.view(batch_size * num_alignments, seqlen)
+        ).view(batch_size, num_alignments, seqlen, self.config.embed_dim)
+        if self.msa_position_embedding is not None:
+            if num_alignments > self.config.max_alignments:
+                raise RuntimeError(
+                    f"MSA depth {num_alignments} exceeds max_alignments "
+                    f"{self.config.max_alignments}.")
+            x = x + self.msa_position_embedding[:, :num_alignments]
+        x = self.emb_layer_norm_before(x)
+        x = self.dropout_module(x)
+        if padding_mask is not None:
+            x = x * (1 - padding_mask.unsqueeze(-1).to(x))
+        all_hidden_states = []
+        all_row_attentions = []
+        all_col_attentions = []
+        if output_hidden_states:
+            all_hidden_states.append(x)
+        # (B, R, C, D) -> (R, C, B, D) for axial attention
+        x = x.permute(1, 2, 0, 3)
+        for layer in self.layers:
+            x, row_attn, col_attn = layer(x, padding_mask=padding_mask,
+                                           output_attentions=output_attentions)
+            if output_hidden_states:
+                all_hidden_states.append(x.permute(2, 0, 1, 3))
+            if output_attentions:
+                all_row_attentions.append(row_attn)
+                all_col_attentions.append(col_attn)
+        x = self.emb_layer_norm_after(x)
+        x = x.permute(2, 0, 1, 3)  # (R, C, B, D) -> (B, R, C, D)
+        if output_hidden_states:
+            all_hidden_states[-1] = x
+        if not return_dict:
+            return tuple(v for v in [
+                x,
+                tuple(all_hidden_states) if output_hidden_states else None,
+                tuple(all_row_attentions) if output_attentions else None,
+            ] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=x,
+            hidden_states=tuple(all_hidden_states) if output_hidden_states else None,
+            attentions=tuple(all_row_attentions) if output_attentions else None,
+        )
+class RNAMSMForMaskedLM(RNAMSMPreTrainedModel):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config: RNAMSMConfig):
+        super().__init__(config)
+        self.rnamsm = RNAMSMModel(config)
+        self.lm_head = RNAMSMLMHead(config, self.rnamsm.embed_tokens.weight)
+        self.post_init()
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        out = self.rnamsm(
+            input_ids,
+            attention_mask=attention_mask,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            return_dict=return_dict,
+        )
+        logits = self.lm_head(out[0] if not return_dict else out.last_hidden_state)
+        loss = None
+        if labels is not None:
+            loss = F.cross_entropy(
+                logits.view(-1, self.config.vocab_size),
+                labels.view(-1),
+                ignore_index=-100,
+            )
+        if not return_dict:
+            output = (logits,) + out[1:]
+            return ((loss,) + output) if loss is not None else output
+        return MaskedLMOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=out.hidden_states,
+            attentions=out.attentions,
+        )

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "<cls>",
+  "eos_token": "<eos>",
+  "mask_token": "<mask>",
+  "pad_token": "<pad>",
+  "unk_token": "<unk>"
+}

tokenization_rnamsm.py ADDED Viewed

	@@ -0,0 +1,241 @@

+import json
+import os
+from typing import Dict, List, Optional, Union
+import torch
+from transformers import PreTrainedTokenizer
+_VOCAB = {
+    "<cls>":  0,
+    "<pad>":  1,
+    "<eos>":  2,
+    "<unk>":  3,
+    "A":      4,
+    "G":      5,
+    "C":      6,
+    "U":      7,
+    "X":      8,
+    "N":      9,
+    "-":      10,
+    "<mask>": 11,
+}
+class RNAMSMTokenizer(PreTrainedTokenizer):
+    """
+    Tokenizer for RNA-MSM.
+    Vocabulary: <cls>(0) <pad>(1) <eos>(2) <unk>(3) A(4) G(5) C(6) U(7) X(8) N(9) -(10) <mask>(11)
+    RNA-MSM is an MSA Transformer: it always expects 3D input
+    (batch, num_alignments, seqlen). This tokenizer treats each input string
+    as a single-sequence MSA (1 alignment row), so the standard __call__ API:
+        enc = tokenizer(["AGCU", "GAUC"], return_tensors="pt", padding=True)
+        # enc.input_ids: (2, 1, T)  -- batch of 2 single-sequence MSAs
+    For real MSAs (multiple aligned sequences), use encode_msa():
+        enc = tokenizer.encode_msa([["AGCU--", "AGCUUU"]], return_tensors="pt")
+        # enc["input_ids"]: (1, 2, T)  -- 1 MSA with 2 alignment rows
+    """
+    vocab_files_names = {"vocab_file": "vocab.json"}
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        vocab_file=None,
+        cls_token="<cls>",
+        pad_token="<pad>",
+        eos_token="<eos>",
+        unk_token="<unk>",
+        mask_token="<mask>",
+        **kwargs,
+    ):
+        if vocab_file and os.path.isfile(vocab_file):
+            with open(vocab_file) as f:
+                self._vocab = json.load(f)
+        else:
+            self._vocab = dict(_VOCAB)
+        self._ids_to_tokens = {v: k for k, v in self._vocab.items()}
+        super().__init__(
+            cls_token=cls_token,
+            pad_token=pad_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            mask_token=mask_token,
+            **kwargs,
+        )
+    @property
+    def vocab_size(self):
+        return len(self._vocab)
+    def get_vocab(self):
+        return dict(self._vocab)
+    def _tokenize(self, text):
+        return list(text)
+    def _convert_token_to_id(self, token):
+        return self._vocab.get(token, self._vocab["<unk>"])
+    def _convert_id_to_token(self, index):
+        return self._ids_to_tokens.get(index, "<unk>")
+    def save_vocabulary(self, save_directory, filename_prefix=None):
+        os.makedirs(save_directory, exist_ok=True)
+        fname = (filename_prefix + "-" if filename_prefix else "") + "vocab.json"
+        path = os.path.join(save_directory, fname)
+        with open(path, "w") as f:
+            json.dump(self._vocab, f, indent=2)
+        return (path,)
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        cls = [self.cls_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0
+        return cls + token_ids_0 + cls + token_ids_1
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None,
+                                already_has_special_tokens=False):
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0, token_ids_1, already_has_special_tokens=True)
+        mask = [1] + [0] * len(token_ids_0)
+        if token_ids_1 is not None:
+            mask += [1] + [0] * len(token_ids_1)
+        return mask
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        if token_ids_1 is None:
+            return [0] + token_ids_0
+        return [0] + token_ids_0 + [0] + token_ids_1
+    def __call__(
+        self,
+        text,
+        text_pair=None,
+        add_special_tokens=True,
+        padding=False,
+        truncation=False,
+        max_length=None,
+        return_tensors=None,
+        **kwargs,
+    ):
+        """
+        Tokenize one or more sequences, each treated as a 1-row MSA.
+        text: str or List[str]
+        Returns dict with input_ids of shape (batch, 1, seqlen) and
+        attention_mask of shape (batch, 1, seqlen).
+        """
+        if isinstance(text, str):
+            sequences = [text]
+        else:
+            sequences = list(text)
+        encoded = []
+        for seq in sequences:
+            ids = self._tokenize_single(seq, add_special_tokens)
+            encoded.append(ids)
+        if padding and len(encoded) > 1:
+            max_len = max(len(ids) for ids in encoded)
+            pad_id = self.pad_token_id
+            encoded = [ids + [pad_id] * (max_len - len(ids)) for ids in encoded]
+        input_ids = [[ids] for ids in encoded]
+        attention_mask = [[[1 if t != self.pad_token_id else 0 for t in ids]]
+                          for ids in encoded]
+        if return_tensors == "pt":
+            input_ids = torch.tensor(input_ids, dtype=torch.long)
+            attention_mask = torch.tensor(attention_mask, dtype=torch.long)
+            return {"input_ids": input_ids, "attention_mask": attention_mask}
+        return {"input_ids": input_ids, "attention_mask": attention_mask}
+    def _tokenize_single(self, sequence, add_special_tokens=True):
+        tokens = list(sequence)
+        ids = [self._convert_token_to_id(t) for t in tokens]
+        if add_special_tokens:
+            ids = [self.cls_token_id] + ids
+        return ids
+    def encode_msa(
+        self,
+        msas,
+        add_special_tokens=True,
+        padding=False,
+        return_tensors=None,
+    ):
+        """
+        Tokenize a batch of MSAs.
+        msas: List[List[str]]
+            Each inner list is one MSA (multiple aligned sequences of equal length).
+            All sequences within an MSA must have the same length.
+        Returns dict with:
+            input_ids: (batch, max_alignments, max_seqlen)
+            attention_mask: (batch, max_alignments, max_seqlen)
+        """
+        if isinstance(msas[0], str):
+            msas = [msas]
+        max_rows = max(len(msa) for msa in msas)
+        max_seqlen = max(
+            len(self._tokenize_single(seq, add_special_tokens))
+            for msa in msas for seq in msa
+        )
+        pad_id = self.pad_token_id
+        batch_ids = []
+        batch_mask = []
+        for msa in msas:
+            msa_ids = []
+            msa_mask = []
+            for seq in msa:
+                ids = self._tokenize_single(seq, add_special_tokens)
+                if padding:
+                    pad_len = max_seqlen - len(ids)
+                    mask = [1] * len(ids) + [0] * pad_len
+                    ids = ids + [pad_id] * pad_len
+                else:
+                    mask = [1] * len(ids)
+                msa_ids.append(ids)
+                msa_mask.append(mask)
+            if padding:
+                pad_row = [pad_id] * max_seqlen
+                pad_mask_row = [0] * max_seqlen
+                while len(msa_ids) < max_rows:
+                    msa_ids.append(pad_row)
+                    msa_mask.append(pad_mask_row)
+            batch_ids.append(msa_ids)
+            batch_mask.append(msa_mask)
+        if return_tensors == "pt":
+            batch_ids = torch.tensor(batch_ids, dtype=torch.long)
+            batch_mask = torch.tensor(batch_mask, dtype=torch.long)
+            return {"input_ids": batch_ids, "attention_mask": batch_mask}
+        return {"input_ids": batch_ids, "attention_mask": batch_mask}
+    def decode(self, token_ids, skip_special_tokens=False, **kwargs):
+        if isinstance(token_ids, torch.Tensor):
+            token_ids = token_ids.tolist()
+        tokens = [self._convert_id_to_token(i) for i in token_ids]
+        if skip_special_tokens:
+            special = {self.cls_token, self.pad_token, self.eos_token,
+                       self.unk_token, self.mask_token}
+            tokens = [t for t in tokens if t not in special]
+        return "".join(tokens)
+    def num_special_tokens_to_add(self, pair=False):
+        return 1

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<cls>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<eos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "auto_map": {
+    "AutoTokenizer": ["tokenization_rnamsm.RNAMSMTokenizer", null]
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<cls>",
+  "eos_token": "<eos>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 1024,
+  "pad_token": "<pad>",
+  "tokenizer_class": "RNAMSMTokenizer",
+  "unk_token": "<unk>"
+}

vocab.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "<cls>": 0,
+  "<pad>": 1,
+  "<eos>": 2,
+  "<unk>": 3,
+  "A": 4,
+  "G": 5,
+  "C": 6,
+  "U": 7,
+  "X": 8,
+  "N": 9,
+  "-": 10,
+  "<mask>": 11
+}