Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +155 -0
config.json +32 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0

README.md ADDED Viewed

	@@ -0,0 +1,155 @@

+---
+language:
+- dna
+library_name: transformers
+tags:
+- DNA
+- BERT
+- language-model
+- genomics
+license: mit
+---
+# DNABERT-2
+Weights and tokenizer for [DNABERT-2](https://arxiv.org/abs/2306.15006)
+(Zhou et al., arXiv 2023), loaded with the shared MosaicBERT implementation
+from [Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated).
+DNABERT-2 is a foundation model trained on large-scale multi-species genome data.
+It replaces k-mer tokenization with Byte Pair Encoding (BPE), uses ALiBi positional
+biases instead of learned embeddings, and incorporates a GLU-based FFN for improved
+efficiency.
+**This repo contains only weights and tokenizer files.** The model code is loaded
+automatically from `Taykhoom/MosaicBERT-updated` via `trust_remote_code=True`.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 12 |
+| Attention heads | 12 |
+| Embedding dimension | 768 |
+| Intermediate size | 3072 |
+| Vocabulary size | 4096 (BPE) |
+| Positional encoding | ALiBi (no hard length limit) |
+| Max sequence length | ~10000 nt (practical; ALiBi resizes dynamically) |
+| Parameters | ~117M |
+### Tokenization
+Uses Byte Pair Encoding (BPE) tokenization via `PreTrainedTokenizerFast`.
+No k-mer pre-processing required.
+## Pretraining
+- **Objective:** Masked Language Modeling
+- **Data:** Large-scale multi-species genome (GRCh38 and others)
+- **Source checkpoint:** `pytorch_model.bin` from [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
+## Parity Verification
+Hidden-state representations verified identical (max abs diff = 0.00) to the original
+implementation at all 13 representation levels (embedding + 12 transformer layers).
+SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.
+## Related Models
+See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).
+| Model | Architecture | Notes |
+|---|---|---|
+| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) | BERT + k-mer | k=3 |
+| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) | BERT + k-mer | k=4 |
+| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) | BERT + k-mer | k=5 |
+| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) | BERT + k-mer | k=6 |
+| **[DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2)** | **MosaicBERT + BPE + ALiBi** | **This model** |
+| [DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S) | MosaicBERT + BPE + ALiBi | Species-aware contrastive fine-tune |
+## Usage
+### Embedding generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
+model.eval()
+sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
+enc = tokenizer(sequences, return_tensors="pt", padding=True)
+with torch.no_grad():
+    out = model(**enc)
+cls_emb  = out.last_hidden_state[:, 0, :]   # (batch, 768)
+mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling
+# Intermediate layers
+out_all = model(**enc, output_hidden_states=True)
+layer6_emb = out_all.hidden_states[6]
+```
+### MLM logits
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
+model = AutoModelForMaskedLM.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True)
+model.eval()
+enc = tokenizer(["ACGTAGCAT[MASK]GGATCTATC"], return_tensors="pt")
+with torch.no_grad():
+    logits = model(**enc).logits   # (1, seq_len, 4096)
+```
+### Attention implementation
+```python
+# SDPA (default on PyTorch >= 2.0)
+model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
+                                   attn_implementation="sdpa")
+# Flash Attention 2
+model = AutoModel.from_pretrained("Taykhoom/DNABERT2", trust_remote_code=True,
+                                   attn_implementation="flash_attention_2",
+                                   torch_dtype=torch.bfloat16)
+```
+## Implementation Notes
+The original DNABERT-2 codebase uses a Triton-based flash attention implementation
+(`flash_attn_triton.py`). This HF port uses
+[Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated)
+which replaces it with the standard `flash-attn` package, and also adds
+`attn_implementation="sdpa"` support. These were not part of the original codebase.
+## Citation
+```bibtex
+@misc{zhou2023_dnabert2,
+  title   = {{DNABERT}-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
+  author  = {Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and
+             Davuluri, Ramana and Liu, Han},
+  year    = {2023},
+  eprint  = {2306.15006},
+  archivePrefix = {arXiv},
+  primaryClass  = {q-bio.GN}
+}
+```
+## Credits
+Original DNABERT-2 model and code by Zhou et al.
+Source: [GitHub](https://github.com/MAGICS-LAB/DNABERT_2).
+The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
+and reviewed manually by Taykhoom Dalal.
+## License
+MIT, following the original repository.

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_name_or_path": "Taykhoom/DNABERT2",
+  "alibi_starting_size": 512,
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auto_map": {
+    "AutoConfig": "Taykhoom/MosaicBERT-updated--configuration_bert.BertConfig",
+    "AutoModel": "Taykhoom/MosaicBERT-updated--bert_layers.BertModel",
+    "AutoModelForMaskedLM": "Taykhoom/MosaicBERT-updated--bert_layers.BertForMaskedLM",
+    "AutoModelForSequenceClassification": "Taykhoom/MosaicBERT-updated--bert_layers.BertForSequenceClassification"
+  },
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_max_length": 10000,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 3,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.57.6",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 4096
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92a85bb75c431027162899e60f80f26eff64cba9218cc2bdb42c881d9461e852
+size 480895944

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "model_max_length": 10000,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}