DNABERT2 / README.md

Upload folder using huggingface_hub

5b2aed0 verified 4 days ago

5.23 kB

language:
  - dna
library_name: transformers
tags:
  - DNA
  - BERT
  - language-model
  - genomics
license: mit

DNABERT-2

Weights and tokenizer for DNABERT-2 (Zhou et al., arXiv 2023), loaded with the shared MosaicBERT implementation from Taykhoom/MosaicBERT-updated.

DNABERT-2 is a foundation model trained on large-scale multi-species genome data. It replaces k-mer tokenization with Byte Pair Encoding (BPE), uses ALiBi positional biases instead of learned embeddings, and incorporates a GLU-based FFN for improved efficiency.

This repo contains only weights and tokenizer files. The model code is loaded automatically from Taykhoom/MosaicBERT-updated via trust_remote_code=True.

Architecture

Parameter	Value
Layers	12
Attention heads	12
Embedding dimension	768
Intermediate size	3072
Vocabulary size	4096 (BPE)
Positional encoding	ALiBi (no hard length limit)
Max sequence length	~10000 nt (practical; ALiBi resizes dynamically)
Parameters	~117M

Tokenization

Uses Byte Pair Encoding (BPE) tokenization via PreTrainedTokenizerFast. No k-mer pre-processing required.

Pretraining

Objective: Masked Language Modeling
Data: Large-scale multi-species genome (GRCh38 and others)
Source checkpoint: pytorch_model.bin from zhihan1996/DNABERT-2-117M

Parity Verification

Hidden-state representations verified identical (max abs diff = 0.00) to the original implementation at all 13 representation levels (embedding + 12 transformer layers). SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.

Related Models

See the full DNABERT collection.

Model	Architecture	Notes
DNABERT-3mer	BERT + k-mer	k=3
DNABERT-4mer	BERT + k-mer	k=4
DNABERT-5mer	BERT + k-mer	k=5
DNABERT-6mer	BERT + k-mer	k=6
DNABERT-2	MosaicBERT + BPE + ALiBi	This model
DNABERT-S	MosaicBERT + BPE + ALiBi	Species-aware contrastive fine-tune

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()

sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb  = out.last_hidden_state[:, 0, :]   # (batch, 768)
mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACGTAGCAT[MASK]GGATCTATC"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 4096)

Attention implementation

# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)

Implementation Notes

The original DNABERT-2 codebase uses a Triton-based flash attention implementation (flash_attn_triton.py). This HF port uses Taykhoom/MosaicBERT-updated which replaces it with the standard flash-attn package, and also adds attn_implementation="sdpa" support. These were not part of the original codebase.

Citation

@misc{zhou2023_dnabert2,
  title   = {{DNABERT}-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
  author  = {Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
             Davuluri, Ramana and Liu, Han},
  year    = {2023},
  eprint  = {2306.15006},
  archivePrefix = {arXiv},
  primaryClass  = {q-bio.GN}
}

Credits

Original DNABERT-2 model and code by Zhou et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.