File size: 5,229 Bytes
5b2aed0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
language:
- dna
library_name: transformers
tags:
- DNA
- BERT
- language-model
- genomics
license: mit
---

# DNABERT-2

Weights and tokenizer for [DNABERT-2](https://arxiv.org/abs/2306.15006)
(Zhou et al., arXiv 2023), loaded with the shared MosaicBERT implementation
from [Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated).

DNABERT-2 is a foundation model trained on large-scale multi-species genome data.
It replaces k-mer tokenization with Byte Pair Encoding (BPE), uses ALiBi positional
biases instead of learned embeddings, and incorporates a GLU-based FFN for improved
efficiency.

**This repo contains only weights and tokenizer files.** The model code is loaded
automatically from `Taykhoom/MosaicBERT-updated` via `trust_remote_code=True`.

## Architecture

| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 4096 (BPE) |
| Positional encoding | ALiBi (no hard length limit) |
| Max sequence length | ~10000 nt (practical; ALiBi resizes dynamically) |
| Parameters | ~117M |

### Tokenization

Uses Byte Pair Encoding (BPE) tokenization via `PreTrainedTokenizerFast`.
No k-mer pre-processing required.

## Pretraining

- **Objective:** Masked Language Modeling
- **Data:** Large-scale multi-species genome (GRCh38 and others)
- **Source checkpoint:** `pytorch_model.bin` from [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)

## Parity Verification

Hidden-state representations verified identical (max abs diff = 0.00) to the original
implementation at all 13 representation levels (embedding + 12 transformer layers).
SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.

## Related Models

See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).

| Model | Architecture | Notes |
|---|---|---|
| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) | BERT + k-mer | k=3 |
| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) | BERT + k-mer | k=4 |
| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) | BERT + k-mer | k=5 |
| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) | BERT + k-mer | k=6 |
| **[DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2)** | **MosaicBERT + BPE + ALiBi** | **This model** |
| [DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S) | MosaicBERT + BPE + ALiBi | Species-aware contrastive fine-tune |

## Usage

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()

sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb  = out.last_hidden_state[:, 0, :]   # (batch, 768)
mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]
```

### MLM logits

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACGTAGCAT[MASK]GGATCTATC"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 4096)
```

### Attention implementation

```python
# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)
```

## Implementation Notes

The original DNABERT-2 codebase uses a Triton-based flash attention implementation
(`flash_attn_triton.py`). This HF port uses
[Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated)
which replaces it with the standard `flash-attn` package, and also adds
`attn_implementation="sdpa"` support. These were not part of the original codebase.

## Citation

```bibtex
@misc{zhou2023_dnabert2,
  title   = {{DNABERT}-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
  author  = {Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
             Davuluri, Ramana and Liu, Han},
  year    = {2023},
  eprint  = {2306.15006},
  archivePrefix = {arXiv},
  primaryClass  = {q-bio.GN}
}
```

## Credits

Original DNABERT-2 model and code by Zhou et al.
Source: [GitHub](https://github.com/MAGICS-LAB/DNABERT_2).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

MIT, following the original repository.