omicseye
/

seqLens_4096_512_89M

Model card Files Files and versions

seqLens_4096_512_89M / README.md

mahdibaghbanzadeh's picture

mahdibaghbanzadeh

Update README.md

0abfe03 verified 12 months ago

|

history blame contribute delete

2.36 kB

	---
	library_name: transformers
	tags:
	- DNA
	- genomics
	datasets:
	- omicseye/prok_heavy
	---

	## Introduction

	The seqLens models are a collection of genomic language models.
	seqLens models leverage an extensive dataset of 19,551 reference genomes,
	including over 18,000 prokaryotic genomes (115B nucleotides),
	alongside a more balanced dataset of 1,354 genomes spanning 1,166 prokaryotic and 188 eukaryotic reference genomes (180B nucleotides).
	Through systematic evaluation of 52 DNA language models with varying architectures, hyperparameters, and classification heads,
	we developed seqLens, a family of models based on disentangled attention with relative positional encoding.
	These models demonstrate superior performance, outperforming state-of-the-art methods in phenotypic predictions.
	The seqLens models provide a robust foundation for optimizing DNA language models and advancing genome annotations across diverse biological contexts.

	- Developed by: omicseye

	- Model type: Encoder
	- Language(s) (NLP): DNA

	- pretraining dataset: omicseye/prok_heavy
	- License: The model is made available under the [CC-BY-NC 4.0 License]. For inquiries about commercial licensing, please contact rahnavard@gwu.edu.

	<p align="center">
	<img width="100%" src="https://github.com/omicsEye/seqLens/blob/main/visualizations/plots/png/deberta_merged.png?raw=true">
	</p>

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/omicsEye/seqLens
	- Paper: https://doi.org/10.1101/2025.03.12.642848

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("omicseye/seqLens_4096_512_89M")
	model = AutoModelForMaskedLM.from_pretrained("omicseye/seqLens_4096_512_89M")
	```

	## Citation
	```bibtex
	@article {seqLens,
	author = {Baghbanzadeh, Mahdi and Mann, Brendan and Crandall, Keith A and Rahnavard, Ali},
	title = {seqLens: optimizing language models for genomic predictions},
	elocation-id = {2025.03.12.642848},
	year = {2025},
	doi = {10.1101/2025.03.12.642848},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848},
	eprint = {https://www.biorxiv.org/content/early/2025/03/14/2025.03.12.642848.full.pdf},
	journal = {bioRxiv}
	}
	```