DNABERT2 / README.md

Upload folder using huggingface_hub

5b2aed0 verified 4 days ago

5.23 kB

	---
	language:
	- dna
	library_name: transformers
	tags:
	- DNA
	- BERT
	- language-model
	- genomics
	license: mit
	---

	# DNABERT-2

	Weights and tokenizer for [DNABERT-2](https://arxiv.org/abs/2306.15006)
	(Zhou et al., arXiv 2023), loaded with the shared MosaicBERT implementation
	from [Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated).

	DNABERT-2 is a foundation model trained on large-scale multi-species genome data.
	It replaces k-mer tokenization with Byte Pair Encoding (BPE), uses ALiBi positional
	biases instead of learned embeddings, and incorporates a GLU-based FFN for improved
	efficiency.

	This repo contains only weights and tokenizer files. The model code is loaded
	automatically from `Taykhoom/MosaicBERT-updated` via `trust_remote_code=True`.

	## Architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Embedding dimension \| 768 \|
	\| Intermediate size \| 3072 \|
	\| Vocabulary size \| 4096 (BPE) \|
	\| Positional encoding \| ALiBi (no hard length limit) \|
	\| Max sequence length \| ~10000 nt (practical; ALiBi resizes dynamically) \|
	\| Parameters \| ~117M \|

	### Tokenization

	Uses Byte Pair Encoding (BPE) tokenization via `PreTrainedTokenizerFast`.
	No k-mer pre-processing required.

	## Pretraining

	- Objective: Masked Language Modeling
	- Data: Large-scale multi-species genome (GRCh38 and others)
	- Source checkpoint: `pytorch_model.bin` from [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)

	## Parity Verification

	Hidden-state representations verified identical (max abs diff = 0.00) to the original
	implementation at all 13 representation levels (embedding + 12 transformer layers).
	SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.

	## Related Models

	See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).

	\| Model \| Architecture \| Notes \|
	\|---\|---\|---\|
	\| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) \| BERT + k-mer \| k=3 \|
	\| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) \| BERT + k-mer \| k=4 \|
	\| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) \| BERT + k-mer \| k=5 \|
	\| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) \| BERT + k-mer \| k=6 \|
	\| [DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2) \| MosaicBERT + BPE + ALiBi \| This model \|
	\| [DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S) \| MosaicBERT + BPE + ALiBi \| Species-aware contrastive fine-tune \|

	## Usage

	### Embedding generation

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
	model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
	model.eval()

	sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
	enc = tokenizer(sequences, return_tensors="pt", padding=True)

	with torch.no_grad():
	out = model(**enc)

	cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768)
	mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling

	# Intermediate layers
	out_all = model(**enc, output_hidden_states=True)
	layer6_emb = out_all.hidden_states[6]
	```

	### MLM logits

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
	model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
	model.eval()

	enc = tokenizer(["ACGTAGCAT[MASK]GGATCTATC"], return_tensors="pt")
	with torch.no_grad():
	logits = model(**enc).logits # (1, seq_len, 4096)
	```

	### Attention implementation

	```python
	# SDPA (default on PyTorch >= 2.0)
	model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
	attn_implementation="sdpa")

	# Flash Attention 2
	model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
	attn_implementation="flash_attention_2",
	torch_dtype=torch.bfloat16)
	```

	## Implementation Notes

	The original DNABERT-2 codebase uses a Triton-based flash attention implementation
	(`flash_attn_triton.py`). This HF port uses
	[Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated)
	which replaces it with the standard `flash-attn` package, and also adds
	`attn_implementation="sdpa"` support. These were not part of the original codebase.

	## Citation

	```bibtex
	@misc{zhou2023_dnabert2,
	title = {{DNABERT}-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
	author = {Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
	Davuluri, Ramana and Liu, Han},
	year = {2023},
	eprint = {2306.15006},
	archivePrefix = {arXiv},
	primaryClass = {q-bio.GN}
	}
	```

	## Credits

	Original DNABERT-2 model and code by Zhou et al.
	Source: [GitHub](https://github.com/MAGICS-LAB/DNABERT_2).
	The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
	and reviewed manually by Taykhoom Dalal.

	## License

	MIT, following the original repository.