dna2vec / README.md
yigitturali's picture
Update README.md
6e15aca verified
---
library_name: transformers
license: mit
pipeline_tag: sentence-similarity
---
# DNA2Vec: Transformer-Based DNA Sequence Embedding
This repository provides an implementation of `dna2vec`, a transformer-based model designed for DNA sequence embeddings. It includes both the Hugging Face (`hf_model`) and a locally trained model (`local_model`). The model can be used for DNA sequence alignment, classification, and other genomic applications.
## Model Overview
DNA sequence alignment is an essential genomic task that involves mapping short DNA reads to the most probable locations within a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advances leverage transformer-based models to encode DNA sequences into vector representations.
The `dna2vec` framework introduces a **Reference-Free DNA Embedding (RDE) Transformer model**, which encodes DNA sequences into a shared vector space, allowing for efficient similarity search and sequence alignment.
### Key Features:
- **Transformer-based architecture** trained on genomic data.
- **Reference-free embeddings** that enable efficient sequence retrieval.
- **Contrastive loss for self-supervised training**, ensuring robust sequence similarity learning.
- **Support for Hugging Face and custom-trained local models**.
- **Efficient search through a DNA vector store**, reducing genome-wide alignment to a local search.
## Model Details
### Model Architecture
The transformer model consists of:
- **12 attention heads**
- **6 encoder layers**
- **Embedding dimension:** 1020
- **Vocabulary size:** 10,000
- **Cosine similarity-based sequence matching**
- **Dropout:** 0.1
- **Training: Cosine Annealing learning rate scheduling**
## Installation
To use the model, install the required dependencies:
```bash
pip install transformers torch
```
## Usage
### Load Hugging Face Model
```python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn as nn
def load_hf_model():
hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True)
class AveragePooler(nn.Module):
def forward(self, last_hidden, attention_mask):
return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1)
hf_model.pooler = AveragePooler()
return hf_model, hf_tokenizer, hf_model.pooler
```
###Using the Model
Once the model is loaded, you can use it to obtain embeddings for DNA sequences:
```python
def get_embedding(dna_sequence):
model, tokenizer, pooler = load_hf_model()
tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True)
with torch.no_grad():
output = model(**tokenized_input)
embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask)
return embedding.numpy()
# Example usage
dna_seq = "ATGCGTACGTAGCTAGCTAGC"
embedding = get_embedding(dna_seq)
print("Embedding shape:", embedding.shape)
```
## Training Details
### Dataset
The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers **approximately 2% of the human genome**, ensuring generalization across different sequences. Reads are generated using **ART MiSeq** simulation, with variations in insertion and deletion rates.
### Training Procedure
- **Self-Supervised Learning:** Contrastive loss-based training.
- **Dynamic Length Sequences:** DNA fragments of length 800-2000 with reads sampled in [150, 500].
- **Noise Augmentation:** 1-5% random base substitutions in 40% of training reads.
- **Batch Size:** 16 with gradient accumulation.
## Evaluation
The model was evaluated against traditional aligners (Bowtie-2) and other Transformer-based baselines (DNABERT-2, HyenaDNA). The evaluation metrics include:
- **Alignment Recall:** >99% for high-quality reads.
- **Cross-Species Transfer:** Successfully aligns sequences from different species, including *Thermus Aquaticus* and *Rattus Norvegicus*.
## Citation
If you use this model, please cite:
```bibtex
@article{10.1093/bioinformatics/btaf041,
author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani},
title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models},
journal = {Bioinformatics},
pages = {btaf041},
year = {2025},
month = {02},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btaf041},
url = {https://doi.org/10.1093/bioinformatics/btaf041},
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf},
}
```
For more details, check the [full paper](https://arxiv.org/abs/2309.11087v6).