|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- genomics |
|
|
- nucleotide |
|
|
- dna |
|
|
- sequence-modeling |
|
|
- biology |
|
|
- bioinformatics |
|
|
datasets: |
|
|
- genome |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations |
|
|
|
|
|
NucEL is a specialized language model designed for nucleotide sequence analysis and genomic applications. This model provides powerful embeddings for DNA sequences and can be fine-tuned for various downstream genomic tasks. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Transformer-based sequence model |
|
|
- **Domain**: Genomics and Nucleotide Sequences |
|
|
- **Architecture**: Based on ModernBert architecture optimized for nucleotide sequences |
|
|
|
|
|
## Features |
|
|
|
|
|
- Nucleotide-level tokenization and embedding |
|
|
- Pre-trained on human genome |
|
|
- Optimized for biological sequence understanding |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel |
|
|
from tokenizer import NucEL_Tokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModel.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True) |
|
|
tokenizer = NucEL_Tokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True) |
|
|
|
|
|
# Example DNA sequence |
|
|
sequence = "ATCGATCGATCGATCG" |
|
|
|
|
|
# Tokenize and encode |
|
|
inputs = tokenizer(sequence, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Get sequence embeddings |
|
|
embeddings = outputs.last_hidden_state |
|
|
print(f"Sequence embeddings shape: {embeddings.shape}") |
|
|
``` |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
# Install any additional dependencies for your specific use case |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- transformers >= 4.21.0 |
|
|
- torch >= 1.9.0 |
|
|
- Python >= 3.7 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use NucEL in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{nucel2024, |
|
|
title={NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations}, |
|
|
author={Ke Ding, Brian Parker, and Jiayu Wen}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/FreakingPotato/NucEL}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License. |
|
|
|