JadenLong
/

MutBERT-Human-Ref

Feature Extraction

bioRxiv 2025.01.23.634452

Model card Files Files and versions

JadenLong commited on Mar 18, 2025

Commit

7f3c4b0

·

verified ·

1 Parent(s): 2a1237a

Update README.md

Files changed (1) hide show

README.md +87 -3

README.md CHANGED Viewed

@@ -1,3 +1,87 @@
----
-license: mit
----

+---
+license: mit
+tags:
+- biology
+- transformers
+- Feature Extraction
+- bioRxiv 2025.01.23.634452
+---
+**This is repository for MutBERT-Human-Ref (no mutation data)**.
+## Introduction
+This is the official pre-trained model introduced in MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models.
+We sincerely appreciate the Tochka-Al team for the ruRoPEBert implementation, which serves as the base of MutBERT development.
+MutBERT is a transformer-based genome foundation model trained only on Human genome.
+## Model Source
+- Repository: [MutBERT](https://github.com/ai4nucleome/mutBERT)
+- Paper: [MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models](https://www.biorxiv.org/content/10.1101/2025.01.23.634452v1)
+## Usage
+### Load tokenizer and model
+```python
+from transformers import AutoTokenizer, AutoModel
+model_name = "JadenLong/MutBERT"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+```
+The default attention is flash attention("sdpa"). If you want use basic attention, you can replace it with "eager". Please refer to [here](https://huggingface.co/JadenLong/MutBERT/blob/main/modeling_mutbert.py#L438).
+### Get embeddings
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+model_name = "JadenLong/MutBERT"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
+dna = "ATCGGGGCCCATTA"
+inputs = tokenizer(dna, return_tensors='pt')["input_ids"]
+mut_inputs = F.one_hot(inputs, num_classes=len(tokenizer)).float().to("cpu")  # len(tokenizer) is vocab size
+last_hidden_state = model(inputs).last_hidden_state   # [1, sequence_length, 768]
+# or: last_hidden_state = model(mut_inputs)[0]        # [1, sequence_length, 768]
+# embedding with mean pooling
+embedding_mean = torch.mean(last_hidden_state[0], dim=0)
+print(embedding_mean.shape) # expect to be 768
+# embedding with max pooling
+embedding_max = torch.max(hidden_states[0], dim=0)[0]
+print(embedding_max.shape) # expect to be 768
+```
+### Using as a Classifier
+```python
+from transformers import AutoModelForSequenceClassification
+model_name = "JadenLong/MutBERT"
+model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=2)
+```
+### With RoPE scaling
+Allowed types for RoPE scaling are: `linear` and `dynamic`. To extend the model's context window you need to add rope_scaling parameter.
+If you want to scale your model context by 2x:
+```python
+model = AutoModel.from_pretrained(model_name,
+                                  trust_remote_code=True,
+                                  rope_scaling={'type': 'dynamic','factor': 2.0}
+                                  ) # 2.0 for x2 scaling, 4.0 for x4, etc..
+```