Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +136 -0
config.json +32 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0

README.md ADDED Viewed

	@@ -0,0 +1,136 @@

+---
+language:
+- dna
+library_name: transformers
+tags:
+- DNA
+- BERT
+- language-model
+- genomics
+license: apache-2.0
+---
+# DNABERT-S
+Weights and tokenizer for [DNABERT-S](https://arxiv.org/abs/2402.08777)
+(Zhou et al., arXiv 2024), loaded with the shared MosaicBERT implementation
+from [Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated).
+DNABERT-S is a species-aware DNA embedding model fine-tuned from DNABERT-2 using
+curriculum contrastive learning. It generates embeddings that naturally cluster and
+segregate genomes from different species, enabling species identification,
+metagenomics binning, and evolutionary analysis.
+**This repo contains only weights and tokenizer files.** The model code is loaded
+automatically from `Taykhoom/MosaicBERT-updated` via `trust_remote_code=True`.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 12 |
+| Attention heads | 12 |
+| Embedding dimension | 768 |
+| Intermediate size | 3072 |
+| Vocabulary size | 4096 (BPE, identical to DNABERT-2) |
+| Positional encoding | ALiBi (no hard length limit) |
+| Max sequence length | ~10000 nt (practical; ALiBi resizes dynamically) |
+| Parameters | ~110M (backbone only, no MLM head) |
+### Tokenization
+Uses Byte Pair Encoding (BPE) tokenization via `PreTrainedTokenizerFast`,
+identical vocabulary to DNABERT-2. No k-mer pre-processing required.
+## Pretraining
+- **Objective:** Curriculum contrastive learning (same-species pairs with i-Mix)
+- **Initialization:** Fine-tuned from [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
+- **Source checkpoint:** `pytorch_model.bin` from [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S)
+## Parity Verification
+Hidden-state representations verified identical (max abs diff = 0.00) to the original
+implementation at all 13 representation levels (embedding + 12 transformer layers).
+SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.
+## Related Models
+See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).
+| Model | Architecture | Notes |
+|---|---|---|
+| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) | BERT + k-mer | k=3 |
+| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) | BERT + k-mer | k=4 |
+| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) | BERT + k-mer | k=5 |
+| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) | BERT + k-mer | k=6 |
+| [DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2) | MosaicBERT + BPE + ALiBi | Pre-trained |
+| **[DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S)** | **MosaicBERT + BPE + ALiBi** | **This model** |
+## Usage
+### Embedding generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
+model.eval()
+sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
+enc = tokenizer(sequences, return_tensors="pt", padding=True)
+with torch.no_grad():
+    out = model(**enc)
+cls_emb  = out.last_hidden_state[:, 0, :]   # (batch, 768)
+mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling
+```
+### Attention implementation
+```python
+# SDPA (default on PyTorch >= 2.0)
+model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
+                                   attn_implementation="sdpa")
+# Flash Attention 2
+model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
+                                   attn_implementation="flash_attention_2",
+                                   torch_dtype=torch.bfloat16)
+```
+## Implementation Notes
+The original DNABERT-S codebase uses a Triton-based flash attention implementation
+(`flash_attn_triton.py`). This HF port uses
+[Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated)
+which replaces it with the standard `flash-attn` package, and also adds
+`attn_implementation="sdpa"` support. These were not part of the original codebase.
+## Citation
+```bibtex
+@misc{zhou2024_dnaberts,
+  title   = {{DNABERT}-S: Learning Species-Aware {DNA} Embedding with Genome Foundation Models},
+  author  = {Zhou, Zhihan and Wu, Winmin and Ho, Harrison and Wang, Jiayi and
+             Shi, Lizhen and Davuluri, Ramana V and Wang, Zhong and Liu, Han},
+  year    = {2024},
+  eprint  = {2402.08777},
+  archivePrefix = {arXiv},
+  primaryClass  = {q-bio.GN}
+}
+```
+## Credits
+Original DNABERT-S model and code by Zhou et al.
+Source: [GitHub](https://github.com/MAGICS-LAB/DNABERT_S).
+The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
+and reviewed manually by Taykhoom Dalal.
+## License
+Apache 2.0, following the original repository.

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_name_or_path": "Taykhoom/DNABERT-S",
+  "alibi_starting_size": 512,
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auto_map": {
+    "AutoConfig": "Taykhoom/MosaicBERT-updated--configuration_bert.BertConfig",
+    "AutoModel": "Taykhoom/MosaicBERT-updated--bert_layers.BertModel",
+    "AutoModelForMaskedLM": "Taykhoom/MosaicBERT-updated--bert_layers.BertForMaskedLM",
+    "AutoModelForSequenceClassification": "Taykhoom/MosaicBERT-updated--bert_layers.BertForSequenceClassification"
+  },
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_max_length": 10000,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 3,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.57.6",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 4096
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7100412d0dc474f911d01c43003fc9bb1bcea9506ea04e969bba16015b2189b9
+size 468289360

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "model_max_length": 10000,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}