JadenLong commited on
Commit
7f3c4b0
·
verified ·
1 Parent(s): 2a1237a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - biology
5
+ - transformers
6
+ - Feature Extraction
7
+ - bioRxiv 2025.01.23.634452
8
+ ---
9
+
10
+ **This is repository for MutBERT-Human-Ref (no mutation data)**.
11
+
12
+ ## Introduction
13
+
14
+ This is the official pre-trained model introduced in MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models.
15
+
16
+ We sincerely appreciate the Tochka-Al team for the ruRoPEBert implementation, which serves as the base of MutBERT development.
17
+
18
+ MutBERT is a transformer-based genome foundation model trained only on Human genome.
19
+
20
+ ## Model Source
21
+
22
+ - Repository: [MutBERT](https://github.com/ai4nucleome/mutBERT)
23
+ - Paper: [MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models](https://www.biorxiv.org/content/10.1101/2025.01.23.634452v1)
24
+
25
+ ## Usage
26
+
27
+ ### Load tokenizer and model
28
+
29
+ ```python
30
+ from transformers import AutoTokenizer, AutoModel
31
+
32
+ model_name = "JadenLong/MutBERT"
33
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
34
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
35
+ ```
36
+
37
+ The default attention is flash attention("sdpa"). If you want use basic attention, you can replace it with "eager". Please refer to [here](https://huggingface.co/JadenLong/MutBERT/blob/main/modeling_mutbert.py#L438).
38
+
39
+ ### Get embeddings
40
+
41
+ ```python
42
+ import torch
43
+ import torch.nn.functional as F
44
+
45
+ from transformers import AutoTokenizer, AutoModel
46
+
47
+ model_name = "JadenLong/MutBERT"
48
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
49
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
50
+
51
+ dna = "ATCGGGGCCCATTA"
52
+ inputs = tokenizer(dna, return_tensors='pt')["input_ids"]
53
+
54
+ mut_inputs = F.one_hot(inputs, num_classes=len(tokenizer)).float().to("cpu") # len(tokenizer) is vocab size
55
+ last_hidden_state = model(inputs).last_hidden_state # [1, sequence_length, 768]
56
+ # or: last_hidden_state = model(mut_inputs)[0] # [1, sequence_length, 768]
57
+
58
+ # embedding with mean pooling
59
+ embedding_mean = torch.mean(last_hidden_state[0], dim=0)
60
+ print(embedding_mean.shape) # expect to be 768
61
+
62
+ # embedding with max pooling
63
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
64
+ print(embedding_max.shape) # expect to be 768
65
+ ```
66
+
67
+ ### Using as a Classifier
68
+
69
+ ```python
70
+ from transformers import AutoModelForSequenceClassification
71
+
72
+ model_name = "JadenLong/MutBERT"
73
+ model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=2)
74
+ ```
75
+
76
+ ### With RoPE scaling
77
+
78
+ Allowed types for RoPE scaling are: `linear` and `dynamic`. To extend the model's context window you need to add rope_scaling parameter.
79
+
80
+ If you want to scale your model context by 2x:
81
+
82
+ ```python
83
+ model = AutoModel.from_pretrained(model_name,
84
+ trust_remote_code=True,
85
+ rope_scaling={'type': 'dynamic','factor': 2.0}
86
+ ) # 2.0 for x2 scaling, 4.0 for x4, etc..
87
+ ```