JadenLong commited on
Commit
626ce55
·
verified ·
1 Parent(s): 746fdd0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - biology
5
+ - transformers
6
+ - Feature Extraction
7
+ - bioRxiv 2025.01.23.634452
8
+ ---
9
+
10
+ ## Introduction
11
+
12
+ This is the official pre-trained model introduced in MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models.
13
+
14
+ We sincerely appreciate the Tochka-Al team for the ruRoPEBert implementation, which serves as the base of MutBERT development.
15
+
16
+ MutBERT is a transformer-based genome foundation model trained soly on Human genome.
17
+
18
+ ## Model Source
19
+
20
+ - Repository: [MutBERT](https://github.com/ai4nucleome/mutBERT)
21
+ - Paper: [MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models](https://www.biorxiv.org/content/10.1101/2025.01.23.634452v1)
22
+
23
+ ## Usage
24
+
25
+ ### Load tokenizer and model
26
+
27
+ ```python
28
+ from transformers import AutoTokenizer, AutoModel
29
+
30
+ model_name = "JadenLong/MutBERT"
31
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
32
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
33
+ ```
34
+
35
+ The default attention is flash attention("sdpa"). If you want use basic attention, you can replace it with "eager". Please refer to [here](https://huggingface.co/JadenLong/MutBERT/blob/main/modeling_mutbert.py#L438).
36
+
37
+ ### Get embeddings
38
+
39
+ ```python
40
+ import torch
41
+ import torch.nn.functional as F
42
+
43
+ from transformers import AutoTokenizer, AutoModel
44
+
45
+ model_name = "JadenLong/MutBERT"
46
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
47
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
48
+
49
+ dna = "ATCGGGGCCCATTA"
50
+ inputs = tokenizer(dna, return_tensors='pt')["input_ids"]
51
+
52
+ mut_inputs = F.one_hot(inputs, num_classes=len(tokenizer)).float().to("cpu") # len(tokenizer) is vocab size
53
+ last_hidden_state = model(inputs).last_hidden_state # [1, sequence_length, 768]
54
+ # or: last_hidden_state = model(mut_inputs)[0] # [1, sequence_length, 768]
55
+
56
+ # embedding with mean pooling
57
+ embedding_mean = torch.mean(last_hidden_state[0], dim=0)
58
+ print(embedding_mean.shape) # expect to be 768
59
+
60
+ # embedding with max pooling
61
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
62
+ print(embedding_max.shape) # expect to be 768
63
+ ```
64
+
65
+ ### Using as a Classifier
66
+
67
+ ```python
68
+ from transformers import AutoModelForSequenceClassification
69
+
70
+ model_name = "JadenLong/MutBERT"
71
+ model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=2)
72
+ ```
73
+
74
+ ### With RoPE scaling
75
+
76
+ Allowed types for RoPE scaling are: `linear` and `dynamic`. To extend the model's context window you need to add rope_scaling parameter.
77
+
78
+ If you want to scale your model context by 2x:
79
+
80
+ ```python
81
+ model = AutoModel.from_pretrained(model_name,
82
+ trust_remote_code=True,
83
+ attn_implementation='sdpa',
84
+ rope_scaling={'type': 'dynamic','factor': 2.0}
85
+ ) # 2.0 for x2 scaling, 4.0 for x4, etc..
86
+ ```
87
+