Upload README.md with huggingface_hub

5e1471e verified 4 days ago

5.11 kB

	---
	language:
	- dna
	library_name: transformers
	tags:
	- DNA
	- BERT
	- language-model
	- genomics
	license: mit
	---

	# DNABERT-5mer

	Weights and tokenizer for [DNABERT](https://github.com/jerryji1993/DNABERT)
	(Ji et al., Bioinformatics 2021), 5-mer variant, loaded with the shared
	BERT implementation from [Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated).

	DNABERT is a BERT model pre-trained on the human reference genome using
	overlapping 5-mer tokenization.

	This repo contains only weights and tokenizer files. The model code is loaded
	automatically from `Taykhoom/BERT-updated` via `trust_remote_code=True`.

	## Architecture

	Standard BERT-base with a 5-mer DNA vocabulary.

	\| Parameter \| Value \|
	\|---\|---\|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Embedding dimension \| 768 \|
	\| Vocabulary size \| 1029 (5 special + 1024 DNA 5-mers) \|
	\| Positional encoding \| Learned absolute \|
	\| Max sequence length \| 512 tokens \|
	\| Parameters \| ~92M \|

	### Tokenization

	Input sequences must be pre-split into overlapping 5-mers (stride 1) with spaces
	between tokens before calling the tokenizer. For example:

	```
	ATCGATG -> ATCGA TCGAT CGATG
	```

	```python
	def seq_to_kmers(seq, k=5):
	return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))
	```

	## Pretraining

	- Objective: Masked Language Modeling
	- Data: Human reference genome (GRCh38)
	- Source checkpoint: `pytorch_model.bin` from [zhihan1996/DNA_bert_5](https://huggingface.co/zhihan1996/DNA_bert_5)

	## Parity Verification

	Hidden-state representations verified (max abs diff < 1.5e-4) relative to the
	source implementation at all 13 representation levels (embedding + 12 transformer
	layers). The small differences are float32 accumulation from two independent
	implementations of identical mathematics; the source `dnabert_layer.BertModel`
	is a direct subclass of `transformers.BertModel` with no modifications.
	Verified on GPU with PyTorch 2.7 / CUDA 12.9.

	## Related Models

	See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).

	\| Model \| Architecture \| Notes \|
	\|---\|---\|---\|
	\| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) \| BERT + k-mer \| k=3 \|
	\| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) \| BERT + k-mer \| k=4 \|
	\| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) \| BERT + k-mer \| k=5 \|
	\| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) \| BERT + k-mer \| k=6 \|
	\| [DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2) \| MosaicBERT + BPE + ALiBi \| Multi-species pre-trained \|
	\| [DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S) \| MosaicBERT + BPE + ALiBi \| Species-aware \|


	## Usage

	### Embedding generation

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	def seq_to_kmers(seq, k=5):
	return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True)
	model = AutoModel.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True)
	model.eval()

	sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
	kmer_seqs = [seq_to_kmers(s) for s in sequences]
	enc = tokenizer(kmer_seqs, return_tensors="pt", padding=True)

	with torch.no_grad():
	out = model(**enc)

	cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768)
	token_emb = out.last_hidden_state # (batch, seq_len, 768)

	# Intermediate layers
	out_all = model(**enc, output_hidden_states=True)
	layer6_emb = out_all.hidden_states[6]
	```

	### Attention implementation

	```python
	# SDPA (default on PyTorch >= 2.0)
	model = AutoModel.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True,
	attn_implementation="sdpa")

	# Flash Attention 2
	model = AutoModel.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True,
	attn_implementation="flash_attention_2",
	torch_dtype=torch.bfloat16)
	```

	## Implementation Notes

	The original DNABERT codebase has `BertModel` as a thin subclass of
	`transformers.BertModel` with no modifications. This HF port uses
	[Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated) which adds
	`attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"`
	support — these were not part of the original codebase.

	## Citation

	```bibtex
	@article{ji2021_dnabert,
	title = {{DNABERT}: pre-trained Bidirectional Encoder Representations from Transformers model for {DNA}-language in genome},
	author = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},
	journal = {Bioinformatics},
	volume = {37},
	number = {15},
	pages = {2112--2120},
	year = {2021},
	doi = {10.1093/bioinformatics/btab083}
	}
	```

	## Credits

	Original DNABERT model and code by Ji et al. Source: [GitHub](https://github.com/jerryji1993/DNABERT).
	The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
	and reviewed manually by Taykhoom Dalal.

	## License

	MIT, following the original repository.