Upload folder using huggingface_hub

2b4d944 verified 4 days ago

4.9 kB

	---
	language:
	- dna
	library_name: transformers
	tags:
	- DNA
	- BERT
	- language-model
	- genomics
	license: apache-2.0
	---

	# DNABERT-S

	Weights and tokenizer for [DNABERT-S](https://arxiv.org/abs/2402.08777)
	(Zhou et al., arXiv 2024), loaded with the shared MosaicBERT implementation
	from [Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated).

	DNABERT-S is a species-aware DNA embedding model fine-tuned from DNABERT-2 using
	curriculum contrastive learning. It generates embeddings that naturally cluster and
	segregate genomes from different species, enabling species identification,
	metagenomics binning, and evolutionary analysis.

	This repo contains only weights and tokenizer files. The model code is loaded
	automatically from `Taykhoom/MosaicBERT-updated` via `trust_remote_code=True`.

	## Architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Embedding dimension \| 768 \|
	\| Intermediate size \| 3072 \|
	\| Vocabulary size \| 4096 (BPE, identical to DNABERT-2) \|
	\| Positional encoding \| ALiBi (no hard length limit) \|
	\| Max sequence length \| ~10000 nt (practical; ALiBi resizes dynamically) \|
	\| Parameters \| ~110M (backbone only, no MLM head) \|

	### Tokenization

	Uses Byte Pair Encoding (BPE) tokenization via `PreTrainedTokenizerFast`,
	identical vocabulary to DNABERT-2. No k-mer pre-processing required.

	## Pretraining

	- Objective: Curriculum contrastive learning (same-species pairs with i-Mix)
	- Initialization: Fine-tuned from [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
	- Source checkpoint: `pytorch_model.bin` from [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S)

	## Parity Verification

	Hidden-state representations verified identical (max abs diff = 0.00) to the original
	implementation at all 13 representation levels (embedding + 12 transformer layers).
	SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.

	## Related Models

	See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).

	\| Model \| Architecture \| Notes \|
	\|---\|---\|---\|
	\| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) \| BERT + k-mer \| k=3 \|
	\| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) \| BERT + k-mer \| k=4 \|
	\| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) \| BERT + k-mer \| k=5 \|
	\| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) \| BERT + k-mer \| k=6 \|
	\| [DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2) \| MosaicBERT + BPE + ALiBi \| Pre-trained \|
	\| [DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S) \| MosaicBERT + BPE + ALiBi \| This model \|

	## Usage

	### Embedding generation

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
	model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
	model.eval()

	sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
	enc = tokenizer(sequences, return_tensors="pt", padding=True)

	with torch.no_grad():
	out = model(**enc)

	cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768)
	mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling
	```

	### Attention implementation

	```python
	# SDPA (default on PyTorch >= 2.0)
	model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
	attn_implementation="sdpa")

	# Flash Attention 2
	model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
	attn_implementation="flash_attention_2",
	torch_dtype=torch.bfloat16)
	```

	## Implementation Notes

	The original DNABERT-S codebase uses a Triton-based flash attention implementation
	(`flash_attn_triton.py`). This HF port uses
	[Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated)
	which replaces it with the standard `flash-attn` package, and also adds
	`attn_implementation="sdpa"` support. These were not part of the original codebase.

	## Citation

	```bibtex
	@misc{zhou2024_dnaberts,
	title = {{DNABERT}-S: Learning Species-Aware {DNA} Embedding with Genome Foundation Models},
	author = {Zhou, Zhihan and Wu, Winmin and Ho, Harrison and Wang, Jiayi and
	Shi, Lizhen and Davuluri, Ramana V and Wang, Zhong and Liu, Han},
	year = {2024},
	eprint = {2402.08777},
	archivePrefix = {arXiv},
	primaryClass = {q-bio.GN}
	}
	```

	## Credits

	Original DNABERT-S model and code by Zhou et al.
	Source: [GitHub](https://github.com/MAGICS-LAB/DNABERT_S).
	The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
	and reviewed manually by Taykhoom Dalal.

	## License

	Apache 2.0, following the original repository.