|
|
--- |
|
|
metrics: |
|
|
- matthews_correlation |
|
|
- f1 |
|
|
tags: |
|
|
- biology |
|
|
- medical |
|
|
license: apache-2.0 |
|
|
--- |
|
|
### Note: |
|
|
This model is copied version of DNABERT-2 which removes the FlashAttention integration with Trition. This allows the model to be installed off HuggingFace without having to uninstall Triton. Running the below example code yields identical output compared to the original verison. |
|
|
``` |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
from transformers.models.bert.configuration_bert import BertConfig |
|
|
|
|
|
device = torch.device("cuda") |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"quietflamingo/dnabert2-fixed", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
config = BertConfig.from_pretrained( |
|
|
"quietflamingo/dnabert2-fixed", |
|
|
) |
|
|
|
|
|
self.model = AutoModel.from_pretrained( |
|
|
"quietflamingo/dnabert2-fixed", |
|
|
config=config |
|
|
) |
|
|
|
|
|
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC" |
|
|
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"] |
|
|
|
|
|
inputs = inputs.to(device) |
|
|
model = model.to(device) |
|
|
|
|
|
hidden_states = model(inputs)[0] # [1, sequence_length, 768] |
|
|
|
|
|
embedding_mean = torch.mean(hidden_states[0], dim=0) |
|
|
print(torch.mean(embedding_mean) # Outputs 0.0045, matches DNABERT2 |
|
|
|
|
|
embedding_max = torch.max(hidden_states[0], dim=0)[0] |
|
|
print(torch.mean(embedding_max) # Outputs 0.2840, matches DNABERT2 |
|
|
``` |
|
|
If you use this model please give full attribution to the original authors below: |
|
|
https://huggingface.co/zhihan1996/DNABERT-2-117M |
|
|
|
|
|
``` |
|
|
@misc{zhou2023dnabert2, |
|
|
title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome}, |
|
|
author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu}, |
|
|
year={2023}, |
|
|
eprint={2306.15006}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={q-bio.GN} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Original README: |
|
|
|
|
|
""" |
|
|
This is the official pre-trained model introduced in [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome |
|
|
](https://arxiv.org/pdf/2306.15006.pdf). |
|
|
|
|
|
We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development. |
|
|
|
|
|
DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome. |
|
|
|
|
|
To load the model from huggingface: |
|
|
``` |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True) |
|
|
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True) |
|
|
``` |
|
|
|
|
|
To calculate the embedding of a dna sequence |
|
|
``` |
|
|
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC" |
|
|
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"] |
|
|
hidden_states = model(inputs)[0] # [1, sequence_length, 768] |
|
|
|
|
|
# embedding with mean pooling |
|
|
embedding_mean = torch.mean(hidden_states[0], dim=0) |
|
|
print(embedding_mean.shape) # expect to be 768 |
|
|
|
|
|
# embedding with max pooling |
|
|
embedding_max = torch.max(hidden_states[0], dim=0)[0] |
|
|
print(embedding_max.shape) # expect to be 768 |
|
|
``` |
|
|
""" |