YYLY66 commited on
Commit
7d0a776
·
verified ·
1 Parent(s): ed23530

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -3
README.md CHANGED
@@ -1,3 +1,73 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # mRNABERT
3
+
4
+ A robust language model pre-trained on over 18 million high-quality mRNA sequences, incorporating contrastive learning to integrate the semantic features of amino acids.
5
+
6
+ This is the official pre-trained model introduced in [A Universal Model Integrating Multimodal Data for Comprehensive mRNA Property Prediction](https://doi.org/10.1101/2022.08.06.5).
7
+
8
+ The repository of mRNABERT is at [yyly6/mRNABERT](https://github.com/yyly6/mRNABERT).
9
+
10
+
11
+ ## Intended uses & limitations
12
+ The model could be used for mRNA sequences feature extraction or to be fine-tuned on downstream tasks. **Before inputting the model, you need to preprocess the data: use single-letter separation for the UTR regions and three-character separation for the CDS regions.**For full examples, please see [our code on data processing](https://github.com/yyly6/mRNABERT).
13
+
14
+ ## Training data
15
+
16
+ The mRNABERT model was pretrained on [a comprehensive mRNA dataset](https://zenodo.org/records/12516160), which originally consisted of approximately 36 million complete CDS or mRNA sequences. After cleaning, this number was reduced to 18 million.
17
+
18
+ ## Usage
19
+ To load the model from huggingface:
20
+ ```python
21
+ import torch
22
+ from transformers import AutoTokenizer, AutoModel
23
+ from transformers.models.bert.configuration_bert import BertConfig
24
+
25
+ config = BertConfig.from_pretrained("YYLY66/mRNABERT")
26
+ tokenizer = AutoTokenizer.from_pretrained("YYLY66/mRNABERT")
27
+ model = AutoModel.from_pretrained("YYLY66/mRNABERT", trust_remote_code=True, config=config)
28
+ ```
29
+
30
+ To extract the embeddings of mRNA sequences:
31
+
32
+ ```python
33
+ seq = ["A T C G G A GGG CCC TTT",
34
+ "A T C G",
35
+ "TTT CCC GAC ATG"] #Separate the sequences with spaces.
36
+
37
+ encoding = tokenizer.batch_encode_plus(seq, add_special_tokens=True, padding='longest', return_tensors="pt")
38
+
39
+ input_ids = encoding['input_ids']
40
+ attention_mask = encoding['attention_mask']
41
+
42
+ output = model(input_ids=input_ids, attention_mask=attention_mask)
43
+ last_hidden_state = output[0]
44
+
45
+ attention_mask = attention_mask.unsqueeze(-1).expand_as(last_hidden_state) # Shape : [batch_size, seq_length, hidden_size]
46
+
47
+ # Sum embeddings along the batch dimension
48
+ sum_embeddings = torch.sum(last_hidden_state * attention_mask, dim=1)
49
+
50
+ # Also sum the masks along the batch dimension
51
+ sum_masks = attention_mask.sum(1)
52
+
53
+ # Compute mean embedding.
54
+ mean_embedding = sum_embeddings / sum_masks #Shape:[batch_size, hidden_size]
55
+
56
+ ```
57
+
58
+ The extracted embeddings can be used for contrastive learning pretraining or as a feature extractor for protein-related downstream tasks.
59
+
60
+
61
+
62
+ ## Citation
63
+
64
+ **BibTeX**:
65
+
66
+ ```bibtex
67
+
68
+
69
+ ```
70
+
71
+ ## Contact
72
+ If you have any question, please feel free to email us (22360244@zju.edu.cn).
73
+