quietflamingo
/

dnabert2-no-flashattention

Model card Files Files and versions

dnabert2-no-flashattention / README.md

quietflamingo's picture

Update README.md

813031b verified 9 months ago

|

history blame contribute delete

3.06 kB

	---
	metrics:
	- matthews_correlation
	- f1
	tags:
	- biology
	- medical
	license: apache-2.0
	---
	### Note:
	This model is copied version of DNABERT-2 which removes the FlashAttention integration with Trition. This allows the model to be installed off HuggingFace without having to uninstall Triton. Running the below example code yields identical output compared to the original verison.
	```
	import torch
	from transformers import AutoTokenizer, AutoModel
	from transformers.models.bert.configuration_bert import BertConfig

	device = torch.device("cuda")

	tokenizer = AutoTokenizer.from_pretrained(
	"quietflamingo/dnabert2-fixed",
	trust_remote_code=True,
	)

	config = BertConfig.from_pretrained(
	"quietflamingo/dnabert2-fixed",
	)

	self.model = AutoModel.from_pretrained(
	"quietflamingo/dnabert2-fixed",
	config=config
	)

	dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
	inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]

	inputs = inputs.to(device)
	model = model.to(device)

	hidden_states = model(inputs)[0] # [1, sequence_length, 768]

	embedding_mean = torch.mean(hidden_states[0], dim=0)
	print(torch.mean(embedding_mean) # Outputs 0.0045, matches DNABERT2

	embedding_max = torch.max(hidden_states[0], dim=0)[0]
	print(torch.mean(embedding_max) # Outputs 0.2840, matches DNABERT2
	```
	If you use this model please give full attribution to the original authors below:
	https://huggingface.co/zhihan1996/DNABERT-2-117M

	```
	@misc{zhou2023dnabert2,
	title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
	author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu},
	year={2023},
	eprint={2306.15006},
	archivePrefix={arXiv},
	primaryClass={q-bio.GN}
	}
	```

	### Original README:

	"""
	This is the official pre-trained model introduced in [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
	](https://arxiv.org/pdf/2306.15006.pdf).

	We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.

	DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.

	To load the model from huggingface:
	```
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
	model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
	```

	To calculate the embedding of a dna sequence
	```
	dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
	inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
	hidden_states = model(inputs)[0] # [1, sequence_length, 768]

	# embedding with mean pooling
	embedding_mean = torch.mean(hidden_states[0], dim=0)
	print(embedding_mean.shape) # expect to be 768

	# embedding with max pooling
	embedding_max = torch.max(hidden_states[0], dim=0)[0]
	print(embedding_max.shape) # expect to be 768
	```
	"""