YYLY66
/

mRNABERT

+# mRNABERT
+ A robust language model pre-trained on over 18 million high-quality mRNA sequences, incorporating contrastive learning to integrate the semantic features of amino acids.
+This is the official pre-trained model introduced in  [A Universal Model Integrating Multimodal Data for Comprehensive mRNA Property Prediction](https://doi.org/10.1101/2022.08.06.5).
+The repository of mRNABERT is at [yyly6/mRNABERT](https://github.com/yyly6/mRNABERT).
+## Intended uses & limitations
+The model could be used for mRNA sequences feature extraction or to be fine-tuned on downstream tasks. **Before inputting the model, you need to preprocess the data: use single-letter separation for the UTR regions and three-character separation for the CDS regions.**For full examples, please see [our code on data processing](https://github.com/yyly6/mRNABERT).
+## Training data
+The mRNABERT model was pretrained on [a comprehensive mRNA dataset](https://zenodo.org/records/12516160), which originally consisted of approximately 36 million complete CDS or mRNA sequences. After cleaning, this number was reduced to 18 million.
+## Usage
+To load the model from huggingface:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+from transformers.models.bert.configuration_bert import BertConfig
+config = BertConfig.from_pretrained("YYLY66/mRNABERT")
+tokenizer = AutoTokenizer.from_pretrained("YYLY66/mRNABERT")
+model = AutoModel.from_pretrained("YYLY66/mRNABERT", trust_remote_code=True, config=config)
+```
+To extract the embeddings of mRNA sequences:
+```python
+seq = ["A T C G G A GGG CCC TTT",
+       "A T C G",
+       "TTT CCC GAC ATG"]  #Separate the sequences with spaces.
+encoding = tokenizer.batch_encode_plus(seq, add_special_tokens=True, padding='longest', return_tensors="pt")
+input_ids = encoding['input_ids']
+attention_mask = encoding['attention_mask']
+output = model(input_ids=input_ids, attention_mask=attention_mask)
+last_hidden_state = output[0]
+attention_mask = attention_mask.unsqueeze(-1).expand_as(last_hidden_state)  # Shape : [batch_size, seq_length, hidden_size]
+# Sum embeddings along the batch dimension
+sum_embeddings = torch.sum(last_hidden_state * attention_mask, dim=1)
+# Also sum the masks along the batch dimension
+sum_masks = attention_mask.sum(1)
+# Compute mean embedding.
+mean_embedding = sum_embeddings / sum_masks  #Shape:[batch_size, hidden_size]
+```
+The extracted embeddings can be used for contrastive learning pretraining or as a feature extractor for protein-related downstream tasks.
+## Citation
+**BibTeX**:
+```bibtex
+```
+## Contact
+If you have any question, please feel free to email us (22360244@zju.edu.cn).