Note:

This model is copied version of mRNABERT which removes the FlashAttention integration with Trition. This allows the model to be installed off HuggingFace without having to uninstall Triton. Running the below example code yields identical output compared to the original verison.

import torch
from transformers import AutoTokenizer, AutoModel
from transformers.models.bert.configuration_bert import BertConfig

config = BertConfig.from_pretrained("Taykhoom/mRNABERT-no-flashattention")
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNABERT-no-flashattention")
model = AutoModel.from_pretrained("Taykhoom/mRNABERT-no-flashattention", trust_remote_code=True, config=config)

seq = ["A T C G G A GGG CCC TTT", 
       "A T C G", 
       "TTT CCC GAC ATG"]  #Separate the sequences with spaces.

encoding = tokenizer.batch_encode_plus(seq, add_special_tokens=True, padding='longest', return_tensors="pt")

input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask'] 

output = model(input_ids=input_ids, attention_mask=attention_mask)
last_hidden_state = output[0]

attention_mask = attention_mask.unsqueeze(-1).expand_as(last_hidden_state)  # Shape : [batch_size, seq_length, hidden_size]

# Sum embeddings along the batch dimension
sum_embeddings = torch.sum(last_hidden_state * attention_mask, dim=1)  

# Also sum the masks along the batch dimension
sum_masks = attention_mask.sum(1)  

# Compute mean embedding.
mean_embedding = sum_embeddings / sum_masks  #Shape:[batch_size, hidden_size]

print(torch.mean(mean_embedding, dim=1))
# Should output: tensor([-0.0209, -0.0156, -0.0201], device='cuda:0', grad_fn=<MeanBackward1>)
# this is the same as the original version of mRNABERT (checked using original installation instructions)

Original README:

"""

mRNABERT

A robust language model pre-trained on over 18 million high-quality mRNA sequences, incorporating contrastive learning to integrate the semantic features of amino acids.

This is the official pre-trained model introduced in mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset.

The repository of mRNABERT is at yyly6/mRNABERT.

Intended uses & limitations

The model could be used for mRNA sequences feature extraction or to be fine-tuned on downstream tasks. **Before inputting the model, you need to preprocess the data: use single-letter separation for the UTR regions and three-character separation for the CDS regions.**For full examples, please see our code on data processing.

Training data

The mRNABERT model was pretrained on a comprehensive mRNA dataset, which originally consisted of approximately 36 million complete CDS or mRNA sequences. After cleaning, this number was reduced to 18 million.

Usage

To load the model from huggingface:

import torch
from transformers import AutoTokenizer, AutoModel
from transformers.models.bert.configuration_bert import BertConfig

config = BertConfig.from_pretrained("YYLY66/mRNABERT")
tokenizer = AutoTokenizer.from_pretrained("YYLY66/mRNABERT")
model = AutoModel.from_pretrained("YYLY66/mRNABERT", trust_remote_code=True, config=config)

To extract the embeddings of mRNA sequences:

seq = ["A T C G G A GGG CCC TTT", 
       "A T C G", 
       "TTT CCC GAC ATG"]  #Separate the sequences with spaces.

encoding = tokenizer.batch_encode_plus(seq, add_special_tokens=True, padding='longest', return_tensors="pt")

input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask'] 

output = model(input_ids=input_ids, attention_mask=attention_mask)
last_hidden_state = output[0]

attention_mask = attention_mask.unsqueeze(-1).expand_as(last_hidden_state)  # Shape : [batch_size, seq_length, hidden_size]

# Sum embeddings along the batch dimension
sum_embeddings = torch.sum(last_hidden_state * attention_mask, dim=1)  

# Also sum the masks along the batch dimension
sum_masks = attention_mask.sum(1)  

# Compute mean embedding.
mean_embedding = sum_embeddings / sum_masks  #Shape:[batch_size, hidden_size]

The extracted embeddings can be used for contrastive learning pretraining or as a feature extractor for protein-related downstream tasks.

Citation

BibTeX:

@article{xiong2025mrnabert,
  title={mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset},
  author={Xiong, Ying and Wang, Aowen and Kang, Yu and Shen, Chao and Hsieh, Chang-Yu and Hou, Tingjun},
  journal={Nature Communications},
  volume={16},
  number={1},
  pages={10371},
  year={2025},
  publisher={Nature Publishing Group UK London},
}

Contact

If you have any question, please feel free to email us (xiongying@zju.edu.cn).

"""

Downloads last month: 102