quietflamingo's picture
Update README.md
813031b verified
---
metrics:
- matthews_correlation
- f1
tags:
- biology
- medical
license: apache-2.0
---
### Note:
This model is copied version of DNABERT-2 which removes the FlashAttention integration with Trition. This allows the model to be installed off HuggingFace without having to uninstall Triton. Running the below example code yields identical output compared to the original verison.
```
import torch
from transformers import AutoTokenizer, AutoModel
from transformers.models.bert.configuration_bert import BertConfig
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"quietflamingo/dnabert2-fixed",
trust_remote_code=True,
)
config = BertConfig.from_pretrained(
"quietflamingo/dnabert2-fixed",
)
self.model = AutoModel.from_pretrained(
"quietflamingo/dnabert2-fixed",
config=config
)
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
inputs = inputs.to(device)
model = model.to(device)
hidden_states = model(inputs)[0] # [1, sequence_length, 768]
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(torch.mean(embedding_mean) # Outputs 0.0045, matches DNABERT2
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(torch.mean(embedding_max) # Outputs 0.2840, matches DNABERT2
```
If you use this model please give full attribution to the original authors below:
https://huggingface.co/zhihan1996/DNABERT-2-117M
```
@misc{zhou2023dnabert2,
title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu},
year={2023},
eprint={2306.15006},
archivePrefix={arXiv},
primaryClass={q-bio.GN}
}
```
### Original README:
"""
This is the official pre-trained model introduced in [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
](https://arxiv.org/pdf/2306.15006.pdf).
We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
To load the model from huggingface:
```
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
```
To calculate the embedding of a dna sequence
```
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 768]
# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768
# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768
```
"""