Upload README.md with huggingface_hub

5e1471e verified 3 days ago

5.11 kB

language:
  - dna
library_name: transformers
tags:
  - DNA
  - BERT
  - language-model
  - genomics
license: mit

DNABERT-5mer

Weights and tokenizer for DNABERT (Ji et al., Bioinformatics 2021), 5-mer variant, loaded with the shared BERT implementation from Taykhoom/BERT-updated.

DNABERT is a BERT model pre-trained on the human reference genome using overlapping 5-mer tokenization.

This repo contains only weights and tokenizer files. The model code is loaded automatically from Taykhoom/BERT-updated via trust_remote_code=True.

Architecture

Standard BERT-base with a 5-mer DNA vocabulary.

Parameter	Value
Layers	12
Attention heads	12
Embedding dimension	768
Vocabulary size	1029 (5 special + 1024 DNA 5-mers)
Positional encoding	Learned absolute
Max sequence length	512 tokens
Parameters	~92M

Tokenization

Input sequences must be pre-split into overlapping 5-mers (stride 1) with spaces between tokens before calling the tokenizer. For example:

ATCGATG  ->  ATCGA TCGAT CGATG

def seq_to_kmers(seq, k=5):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

Pretraining

Objective: Masked Language Modeling
Data: Human reference genome (GRCh38)
Source checkpoint: pytorch_model.bin from zhihan1996/DNA_bert_5

Parity Verification

Hidden-state representations verified (max abs diff < 1.5e-4) relative to the source implementation at all 13 representation levels (embedding + 12 transformer layers). The small differences are float32 accumulation from two independent implementations of identical mathematics; the source dnabert_layer.BertModel is a direct subclass of transformers.BertModel with no modifications. Verified on GPU with PyTorch 2.7 / CUDA 12.9.

Related Models

See the full DNABERT collection.

Model	Architecture	Notes
DNABERT-3mer	BERT + k-mer	k=3
DNABERT-4mer	BERT + k-mer	k=4
DNABERT-5mer	BERT + k-mer	k=5
DNABERT-6mer	BERT + k-mer	k=6
DNABERT-2	MosaicBERT + BPE + ALiBi	Multi-species pre-trained
DNABERT-S	MosaicBERT + BPE + ALiBi	Species-aware

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

def seq_to_kmers(seq, k=5):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True)
model.eval()

sequences = ["ATCGATCGATCG", "GCTAGCTAGCTA"]
kmer_seqs = [seq_to_kmers(s) for s in sequences]
enc = tokenizer(kmer_seqs, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768)
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]

Attention implementation

# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT-5mer", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)

Implementation Notes

The original DNABERT codebase has BertModel as a thin subclass of transformers.BertModel with no modifications. This HF port uses Taykhoom/BERT-updated which adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support — these were not part of the original codebase.

Citation

@article{ji2021_dnabert,
  title   = {{DNABERT}: pre-trained Bidirectional Encoder Representations from Transformers model for {DNA}-language in genome},
  author  = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},
  journal = {Bioinformatics},
  volume  = {37},
  number  = {15},
  pages   = {2112--2120},
  year    = {2021},
  doi     = {10.1093/bioinformatics/btab083}
}

Credits

Original DNABERT model and code by Ji et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.