Fill-Mask
Transformers
Safetensors
Upper Grand Valley Dani
bert
DNA
BERT
language-model
genomics
custom_code
Instructions to use Taykhoom/DNABERT2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/DNABERT2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/DNABERT2", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
File size: 5,229 Bytes
5b2aed0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | ---
language:
- dna
library_name: transformers
tags:
- DNA
- BERT
- language-model
- genomics
license: mit
---
# DNABERT-2
Weights and tokenizer for [DNABERT-2](https://arxiv.org/abs/2306.15006)
(Zhou et al., arXiv 2023), loaded with the shared MosaicBERT implementation
from [Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated).
DNABERT-2 is a foundation model trained on large-scale multi-species genome data.
It replaces k-mer tokenization with Byte Pair Encoding (BPE), uses ALiBi positional
biases instead of learned embeddings, and incorporates a GLU-based FFN for improved
efficiency.
**This repo contains only weights and tokenizer files.** The model code is loaded
automatically from `Taykhoom/MosaicBERT-updated` via `trust_remote_code=True`.
## Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 4096 (BPE) |
| Positional encoding | ALiBi (no hard length limit) |
| Max sequence length | ~10000 nt (practical; ALiBi resizes dynamically) |
| Parameters | ~117M |
### Tokenization
Uses Byte Pair Encoding (BPE) tokenization via `PreTrainedTokenizerFast`.
No k-mer pre-processing required.
## Pretraining
- **Objective:** Masked Language Modeling
- **Data:** Large-scale multi-species genome (GRCh38 and others)
- **Source checkpoint:** `pytorch_model.bin` from [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
## Parity Verification
Hidden-state representations verified identical (max abs diff = 0.00) to the original
implementation at all 13 representation levels (embedding + 12 transformer layers).
SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.
## Related Models
See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).
| Model | Architecture | Notes |
|---|---|---|
| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) | BERT + k-mer | k=3 |
| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) | BERT + k-mer | k=4 |
| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) | BERT + k-mer | k=5 |
| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) | BERT + k-mer | k=6 |
| **[DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2)** | **MosaicBERT + BPE + ALiBi** | **This model** |
| [DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S) | MosaicBERT + BPE + ALiBi | Species-aware contrastive fine-tune |
## Usage
### Embedding generation
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()
sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768)
mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]
```
### MLM logits
```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()
enc = tokenizer(["ACGTAGCAT[MASK]GGATCTATC"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 4096)
```
### Attention implementation
```python
# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
attn_implementation="sdpa")
# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16)
```
## Implementation Notes
The original DNABERT-2 codebase uses a Triton-based flash attention implementation
(`flash_attn_triton.py`). This HF port uses
[Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated)
which replaces it with the standard `flash-attn` package, and also adds
`attn_implementation="sdpa"` support. These were not part of the original codebase.
## Citation
```bibtex
@misc{zhou2023_dnabert2,
title = {{DNABERT}-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
author = {Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
Davuluri, Ramana and Liu, Han},
year = {2023},
eprint = {2306.15006},
archivePrefix = {arXiv},
primaryClass = {q-bio.GN}
}
```
## Credits
Original DNABERT-2 model and code by Zhou et al.
Source: [GitHub](https://github.com/MAGICS-LAB/DNABERT_2).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.
## License
MIT, following the original repository.
|