--- library_name: transformers license: apache-2.0 tags: - biology - medical - mRNA - rna - mrna --- # mRNABERT A robust language model pre-trained on over 18 million high-quality mRNA sequences, incorporating contrastive learning to integrate the semantic features of amino acids. This is the official pre-trained model introduced in [mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset](https://www.nature.com/articles/s41467-025-65340-8#citeas). The repository of mRNABERT is at [yyly6/mRNABERT](https://github.com/yyly6/mRNABERT). ## Intended uses & limitations The model could be used for mRNA sequences feature extraction or to be fine-tuned on downstream tasks. **Before inputting the model, you need to preprocess the data: use single-letter separation for the UTR regions and three-character separation for the CDS regions.**For full examples, please see [our code on data processing](https://github.com/yyly6/mRNABERT). ## Training data The mRNABERT model was pretrained on [a comprehensive mRNA dataset](https://zenodo.org/records/12516160), which originally consisted of approximately 36 million complete CDS or mRNA sequences. After cleaning, this number was reduced to 18 million. ## Usage To load the model from huggingface: ```python import torch from transformers import AutoTokenizer, AutoModel from transformers.models.bert.configuration_bert import BertConfig config = BertConfig.from_pretrained("YYLY66/mRNABERT") tokenizer = AutoTokenizer.from_pretrained("YYLY66/mRNABERT") model = AutoModel.from_pretrained("YYLY66/mRNABERT", trust_remote_code=True, config=config) ``` To extract the embeddings of mRNA sequences: ```python seq = ["A T C G G A GGG CCC TTT", "A T C G", "TTT CCC GAC ATG"] #Separate the sequences with spaces. encoding = tokenizer.batch_encode_plus(seq, add_special_tokens=True, padding='longest', return_tensors="pt") input_ids = encoding['input_ids'] attention_mask = encoding['attention_mask'] output = model(input_ids=input_ids, attention_mask=attention_mask) last_hidden_state = output[0] attention_mask = attention_mask.unsqueeze(-1).expand_as(last_hidden_state) # Shape : [batch_size, seq_length, hidden_size] # Sum embeddings along the batch dimension sum_embeddings = torch.sum(last_hidden_state * attention_mask, dim=1) # Also sum the masks along the batch dimension sum_masks = attention_mask.sum(1) # Compute mean embedding. mean_embedding = sum_embeddings / sum_masks #Shape:[batch_size, hidden_size] ``` The extracted embeddings can be used for contrastive learning pretraining or as a feature extractor for protein-related downstream tasks. ## Citation **BibTeX**: ``` @article{xiong2025mrnabert, title={mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset}, author={Xiong, Ying and Wang, Aowen and Kang, Yu and Shen, Chao and Hsieh, Chang-Yu and Hou, Tingjun}, journal={Nature Communications}, volume={16}, number={1}, pages={10371}, year={2025}, publisher={Nature Publishing Group UK London}, } ``` ## Contact If you have any question, please feel free to email us (xiongying@zju.edu.cn).