|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
pipeline_tag: sentence-similarity |
|
|
--- |
|
|
|
|
|
# DNA2Vec: Transformer-Based DNA Sequence Embedding |
|
|
|
|
|
This repository provides an implementation of `dna2vec`, a transformer-based model designed for DNA sequence embeddings. It includes both the Hugging Face (`hf_model`) and a locally trained model (`local_model`). The model can be used for DNA sequence alignment, classification, and other genomic applications. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
DNA sequence alignment is an essential genomic task that involves mapping short DNA reads to the most probable locations within a reference genome. Traditional methods rely on genome indexing and efficient search algorithms, while recent advances leverage transformer-based models to encode DNA sequences into vector representations. |
|
|
|
|
|
The `dna2vec` framework introduces a **Reference-Free DNA Embedding (RDE) Transformer model**, which encodes DNA sequences into a shared vector space, allowing for efficient similarity search and sequence alignment. |
|
|
|
|
|
### Key Features: |
|
|
- **Transformer-based architecture** trained on genomic data. |
|
|
- **Reference-free embeddings** that enable efficient sequence retrieval. |
|
|
- **Contrastive loss for self-supervised training**, ensuring robust sequence similarity learning. |
|
|
- **Support for Hugging Face and custom-trained local models**. |
|
|
- **Efficient search through a DNA vector store**, reducing genome-wide alignment to a local search. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Architecture |
|
|
The transformer model consists of: |
|
|
- **12 attention heads** |
|
|
- **6 encoder layers** |
|
|
- **Embedding dimension:** 1020 |
|
|
- **Vocabulary size:** 10,000 |
|
|
- **Cosine similarity-based sequence matching** |
|
|
- **Dropout:** 0.1 |
|
|
- **Training: Cosine Annealing learning rate scheduling** |
|
|
|
|
|
## Installation |
|
|
|
|
|
To use the model, install the required dependencies: |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Load Hugging Face Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
|
|
|
def load_hf_model(): |
|
|
hf_model = AutoModel.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True) |
|
|
hf_tokenizer = AutoTokenizer.from_pretrained("roychowdhuryresearch/dna2vec", trust_remote_code=True) |
|
|
|
|
|
class AveragePooler(nn.Module): |
|
|
def forward(self, last_hidden, attention_mask): |
|
|
return (last_hidden * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(-1).unsqueeze(-1) |
|
|
|
|
|
hf_model.pooler = AveragePooler() |
|
|
return hf_model, hf_tokenizer, hf_model.pooler |
|
|
``` |
|
|
###Using the Model |
|
|
Once the model is loaded, you can use it to obtain embeddings for DNA sequences: |
|
|
|
|
|
```python |
|
|
def get_embedding(dna_sequence): |
|
|
model, tokenizer, pooler = load_hf_model() |
|
|
tokenized_input = tokenizer(dna_sequence, return_tensors="pt", padding=True) |
|
|
with torch.no_grad(): |
|
|
output = model(**tokenized_input) |
|
|
embedding = pooler(output.last_hidden_state, tokenized_input.attention_mask) |
|
|
return embedding.numpy() |
|
|
|
|
|
# Example usage |
|
|
dna_seq = "ATGCGTACGTAGCTAGCTAGC" |
|
|
embedding = get_embedding(dna_seq) |
|
|
print("Embedding shape:", embedding.shape) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
The training data consists of DNA sequences sampled from various chromosomes across species. The dataset covers **approximately 2% of the human genome**, ensuring generalization across different sequences. Reads are generated using **ART MiSeq** simulation, with variations in insertion and deletion rates. |
|
|
|
|
|
### Training Procedure |
|
|
- **Self-Supervised Learning:** Contrastive loss-based training. |
|
|
- **Dynamic Length Sequences:** DNA fragments of length 800-2000 with reads sampled in [150, 500]. |
|
|
- **Noise Augmentation:** 1-5% random base substitutions in 40% of training reads. |
|
|
- **Batch Size:** 16 with gradient accumulation. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The model was evaluated against traditional aligners (Bowtie-2) and other Transformer-based baselines (DNABERT-2, HyenaDNA). The evaluation metrics include: |
|
|
- **Alignment Recall:** >99% for high-quality reads. |
|
|
- **Cross-Species Transfer:** Successfully aligns sequences from different species, including *Thermus Aquaticus* and *Rattus Norvegicus*. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{10.1093/bioinformatics/btaf041, |
|
|
author = {Holur, Pavan and Enevoldsen, K C and Rajesh, Shreyas and Mboning, Lajoyce and Georgiou, Thalia and Bouchard, Louis-S and Pellegrini, Matteo and Roychowdhury, Vwani}, |
|
|
title = {Embed-Search-Align: DNA Sequence Alignment using Transformer models}, |
|
|
journal = {Bioinformatics}, |
|
|
pages = {btaf041}, |
|
|
year = {2025}, |
|
|
month = {02}, |
|
|
issn = {1367-4811}, |
|
|
doi = {10.1093/bioinformatics/btaf041}, |
|
|
url = {https://doi.org/10.1093/bioinformatics/btaf041}, |
|
|
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaf041/61778456/btaf041.pdf}, |
|
|
} |
|
|
``` |
|
|
|
|
|
For more details, check the [full paper](https://arxiv.org/abs/2309.11087v6). |
|
|
|