---
language:
- dna
library_name: transformers
tags:
- DNA
- BERT
- language-model
- genomics
license: mit
---

# DNABERT-4mer

Weights and tokenizer for [DNABERT](https://github.com/jerryji1993/DNABERT)
(Ji et al., Bioinformatics 2021), 4-mer variant, loaded with the shared
BERT implementation from [Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated).

DNABERT is a BERT model pre-trained on the human reference genome using
overlapping 4-mer tokenization.

**This repo contains only weights and tokenizer files.** The model code is loaded
automatically from `Taykhoom/BERT-updated` via `trust_remote_code=True`.

## Architecture

Standard BERT-base with a 4-mer DNA vocabulary.

| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Vocabulary size | 261 (5 special + 256 DNA 4-mers) |
| Positional encoding | Learned absolute |
| Max sequence length | 512 tokens |
| Parameters | ~88M |

### Tokenization

Input sequences must be pre-split into overlapping 4-mers (stride 1) with spaces
between tokens before calling the tokenizer. For example:

```
ATCGATG  ->  ATCG TCGA CGAT GATG
```

```python
def seq_to_kmers(seq, k=4):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))
```

## Pretraining

- **Objective:** Masked Language Modeling
- **Data:** Human reference genome (GRCh38)
- **Source checkpoint:** `pytorch_model.bin` from [zhihan1996/DNA_bert_4](https://huggingface.co/zhihan1996/DNA_bert_4)

## Parity Verification

Hidden-state representations verified (max abs diff < 1.5e-4) relative to the
source implementation at all 13 representation levels (embedding + 12 transformer
layers). The small differences are float32 accumulation from two independent
implementations of identical mathematics; the source `dnabert_layer.BertModel`
is a direct subclass of `transformers.BertModel` with no modifications.
Verified on GPU with PyTorch 2.7 / CUDA 12.9.

## Related Models

See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).

| Model | Architecture | Notes |
|---|---|---|
| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) | BERT + k-mer | k=3 |
| **[DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer)** | **BERT + k-mer** | **k=4** |
| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) | BERT + k-mer | k=5 |
| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) | BERT + k-mer | k=6 |
| [DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2) | MosaicBERT + BPE + ALiBi | Multi-species pre-trained |
| [DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S) | MosaicBERT + BPE + ALiBi | Species-aware |


## Usage

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

def seq_to_kmers(seq, k=4):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True)
model.eval()

sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
kmer_seqs = [seq_to_kmers(s) for s in sequences]
enc = tokenizer(kmer_seqs, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768)
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]
```

### Attention implementation

```python
# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT-4mer", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)
```

## Implementation Notes

The original DNABERT codebase has `BertModel` as a thin subclass of
`transformers.BertModel` with no modifications. This HF port uses
[Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated) which adds
`attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"`
support — these were not part of the original codebase.

## Citation

```bibtex
@article{ji2021_dnabert,
  title   = {{DNABERT}: pre-trained Bidirectional Encoder Representations from Transformers model for {DNA}-language in genome},
  author  = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},
  journal = {Bioinformatics},
  volume  = {37},
  number  = {15},
  pages   = {2112--2120},
  year    = {2021},
  doi     = {10.1093/bioinformatics/btab083}
}
```

## Credits

Original DNABERT model and code by Ji et al. Source: [GitHub](https://github.com/jerryji1993/DNABERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

MIT, following the original repository.