| | --- |
| | library_name: transformers |
| | tags: |
| | - DNA |
| | - genomics |
| | datasets: |
| | - omicseye/prok_heavy |
| | --- |
| | |
| | ## Introduction |
| |
|
| | The seqLens models are a collection of genomic language models. |
| | seqLens models leverage an extensive dataset of 19,551 reference genomes, |
| | including over 18,000 prokaryotic genomes (115B nucleotides), |
| | alongside a more balanced dataset of 1,354 genomes spanning 1,166 prokaryotic and 188 eukaryotic reference genomes (180B nucleotides). |
| | Through systematic evaluation of 52 DNA language models with varying architectures, hyperparameters, and classification heads, |
| | we developed seqLens, a family of models based on disentangled attention with relative positional encoding. |
| | These models demonstrate superior performance, outperforming state-of-the-art methods in phenotypic predictions. |
| | The seqLens models provide a robust foundation for optimizing DNA language models and advancing genome annotations across diverse biological contexts. |
| |
|
| | - **Developed by:** omicseye |
| |
|
| | - **Model type:** Encoder |
| | - **Language(s) (NLP):** DNA |
| |
|
| | - **pretraining dataset:** omicseye/prok_heavy |
| | - **License:** The model is made available under the [CC-BY-NC 4.0 License]. For inquiries about commercial licensing, please contact rahnavard@gwu.edu. |
| | |
| | <p align="center"> |
| | <img width="100%" src="https://github.com/omicsEye/seqLens/blob/main/visualizations/plots/png/deberta_merged.png?raw=true"> |
| | </p> |
| | |
| | ### Model Sources |
| | |
| | <!-- Provide the basic links for the model. --> |
| | |
| | - **Repository:** https://github.com/omicsEye/seqLens |
| | - **Paper:** https://doi.org/10.1101/2025.03.12.642848 |
| | |
| | ## How to Get Started with the Model |
| | |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForMaskedLM |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("omicseye/seqLens_4096_512_89M") |
| | model = AutoModelForMaskedLM.from_pretrained("omicseye/seqLens_4096_512_89M") |
| | ``` |
| | |
| | ## Citation |
| | ```bibtex |
| | @article {seqLens, |
| | author = {Baghbanzadeh, Mahdi and Mann, Brendan and Crandall, Keith A and Rahnavard, Ali}, |
| | title = {seqLens: optimizing language models for genomic predictions}, |
| | elocation-id = {2025.03.12.642848}, |
| | year = {2025}, |
| | doi = {10.1101/2025.03.12.642848}, |
| | publisher = {Cold Spring Harbor Laboratory}, |
| | URL = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848}, |
| | eprint = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848.full.pdf}, |
| | journal = {bioRxiv} |
| | } |
| | ``` |