Feature Extraction
Transformers
Safetensors
Upper Grand Valley Dani
bert
DNA
BERT
language-model
genomics
custom_code
text-embeddings-inference
Instructions to use Taykhoom/DNABERT-S with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/DNABERT-S with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Taykhoom/DNABERT-S", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True) model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
File size: 4,900 Bytes
2b4d944 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | ---
language:
- dna
library_name: transformers
tags:
- DNA
- BERT
- language-model
- genomics
license: apache-2.0
---
# DNABERT-S
Weights and tokenizer for [DNABERT-S](https://arxiv.org/abs/2402.08777)
(Zhou et al., arXiv 2024), loaded with the shared MosaicBERT implementation
from [Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated).
DNABERT-S is a species-aware DNA embedding model fine-tuned from DNABERT-2 using
curriculum contrastive learning. It generates embeddings that naturally cluster and
segregate genomes from different species, enabling species identification,
metagenomics binning, and evolutionary analysis.
**This repo contains only weights and tokenizer files.** The model code is loaded
automatically from `Taykhoom/MosaicBERT-updated` via `trust_remote_code=True`.
## Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 4096 (BPE, identical to DNABERT-2) |
| Positional encoding | ALiBi (no hard length limit) |
| Max sequence length | ~10000 nt (practical; ALiBi resizes dynamically) |
| Parameters | ~110M (backbone only, no MLM head) |
### Tokenization
Uses Byte Pair Encoding (BPE) tokenization via `PreTrainedTokenizerFast`,
identical vocabulary to DNABERT-2. No k-mer pre-processing required.
## Pretraining
- **Objective:** Curriculum contrastive learning (same-species pairs with i-Mix)
- **Initialization:** Fine-tuned from [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
- **Source checkpoint:** `pytorch_model.bin` from [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S)
## Parity Verification
Hidden-state representations verified identical (max abs diff = 0.00) to the original
implementation at all 13 representation levels (embedding + 12 transformer layers).
SDPA verified (max abs diff < 1e-4). Verified on GPU with PyTorch 2.7 / CUDA 12.9.
## Related Models
See the full [DNABERT collection](https://huggingface.co/collections/Taykhoom/dnabert-6a20958f8ce004ea4e985e7b).
| Model | Architecture | Notes |
|---|---|---|
| [DNABERT-3mer](https://huggingface.co/Taykhoom/DNABERT-3mer) | BERT + k-mer | k=3 |
| [DNABERT-4mer](https://huggingface.co/Taykhoom/DNABERT-4mer) | BERT + k-mer | k=4 |
| [DNABERT-5mer](https://huggingface.co/Taykhoom/DNABERT-5mer) | BERT + k-mer | k=5 |
| [DNABERT-6mer](https://huggingface.co/Taykhoom/DNABERT-6mer) | BERT + k-mer | k=6 |
| [DNABERT-2](https://huggingface.co/Taykhoom/DNABERT2) | MosaicBERT + BPE + ALiBi | Pre-trained |
| **[DNABERT-S](https://huggingface.co/Taykhoom/DNABERT-S)** | **MosaicBERT + BPE + ALiBi** | **This model** |
## Usage
### Embedding generation
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True)
model.eval()
sequences = ["ACGTAGCATCGGATCTATCTATCGACACTTGG", "ATCGATCGATCGATCG"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768)
mean_emb = out.last_hidden_state.mean(dim=1) # (batch, 768) -- mean pooling
```
### Attention implementation
```python
# SDPA (default on PyTorch >= 2.0)
model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
attn_implementation="sdpa")
# Flash Attention 2
model = AutoModel.from_pretrained("Taykhoom/DNABERT-S", trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16)
```
## Implementation Notes
The original DNABERT-S codebase uses a Triton-based flash attention implementation
(`flash_attn_triton.py`). This HF port uses
[Taykhoom/MosaicBERT-updated](https://huggingface.co/Taykhoom/MosaicBERT-updated)
which replaces it with the standard `flash-attn` package, and also adds
`attn_implementation="sdpa"` support. These were not part of the original codebase.
## Citation
```bibtex
@misc{zhou2024_dnaberts,
title = {{DNABERT}-S: Learning Species-Aware {DNA} Embedding with Genome Foundation Models},
author = {Zhou, Zhihan and Wu, Winmin and Ho, Harrison and Wang, Jiayi and
Shi, Lizhen and Davuluri, Ramana V and Wang, Zhong and Liu, Han},
year = {2024},
eprint = {2402.08777},
archivePrefix = {arXiv},
primaryClass = {q-bio.GN}
}
```
## Credits
Original DNABERT-S model and code by Zhou et al.
Source: [GitHub](https://github.com/MAGICS-LAB/DNABERT_S).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.
## License
Apache 2.0, following the original repository.
|