|
|
---
|
|
|
license: mit
|
|
|
base_model:
|
|
|
- zhihan1996/DNABERT-2-117M
|
|
|
tags:
|
|
|
- genomics
|
|
|
- bioinformatics
|
|
|
- DNA
|
|
|
- nucleotide-classification
|
|
|
- introns
|
|
|
- exons
|
|
|
- DNABERT2
|
|
|
---
|
|
|
|
|
|
# Nucleotide Classifier
|
|
|
|
|
|
BERT finetuned model for **classifying nucleotides** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
|
|
|
|
|
|
---
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
- Base model: DNABERT2
|
|
|
- Approach: Nucleotide classification
|
|
|
- Framework: PyTorch + Hugging Face Transformers
|
|
|
|
|
|
---
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
You can use this model through its own custom pipeline:
|
|
|
|
|
|
```python
|
|
|
from transformers import pipeline
|
|
|
|
|
|
pipe = pipeline(
|
|
|
task="dnabert2-nucleotide-classification",
|
|
|
model="GustavoHCruz/NuclDNABERT2",
|
|
|
trust_remote_code=True,
|
|
|
)
|
|
|
|
|
|
out = pipe(
|
|
|
{
|
|
|
"before": "ATGATCCAGTTAAAAAATATATTC",
|
|
|
"sequence": "C",
|
|
|
"after": ""
|
|
|
}
|
|
|
)
|
|
|
|
|
|
print(out) # EXON
|
|
|
|
|
|
out = pipe(
|
|
|
{
|
|
|
"before": "GTAACATTAAAATAAAAAACAAAA",
|
|
|
"sequence": "T",
|
|
|
"after": "ATTATTTAAAGAAAAATATAATTA"
|
|
|
}
|
|
|
)
|
|
|
|
|
|
print(out) # INTRON
|
|
|
|
|
|
```
|
|
|
|
|
|
The maximum context of this model is the same as DNABERT2 (512 tokens), but its training was limited to 24 nucleotides before and after the target nucleotide.
|
|
|
|
|
|
When using the pipeline, these rules will be applied. The sequence will be limited to only a single nucleotide, and the sequences before and after will be truncated to a maximum of 24 nucleotides, even if longer sequences are provided. Similarly, the organism will be limited to 10 characters.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Custom Usage Information
|
|
|
|
|
|
Prompt format:
|
|
|
|
|
|
The model expects the following input format:
|
|
|
|
|
|
```
|
|
|
TCGG...[SEP]T[SEP]AGCT...
|
|
|
```
|
|
|
|
|
|
The model should predict a class label: 0 (Exon), 1 (Intron) or 2 (Unknown).
|
|
|
|
|
|
---
|
|
|
|
|
|
## Dataset
|
|
|
|
|
|
The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
|
|
|
|
|
|
---
|
|
|
|
|
|
## Publications
|
|
|
|
|
|
- **Full Paper**
|
|
|
Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
|
|
|
DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
|
|
|
- **Short Paper**
|
|
|
Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
|
|
|
DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
|
|
|
|
|
|
---
|
|
|
|
|
|
## Training
|
|
|
|
|
|
- Trained on an architecture with 8x H100 GPUs.
|
|
|
|
|
|
---
|
|
|
|
|
|
## Metrics
|
|
|
|
|
|
**Average accuracy:** **0.7890**
|
|
|
|
|
|
| Class | Precision | Recall | F1-Score |
|
|
|
| ----------- | --------- | ------ | -------- |
|
|
|
| **Intron** | 0.6007 | 0.9113 | 0.6060 |
|
|
|
| **Exon** | 0.8011 | 0.9011 | 0.8482 |
|
|
|
| **Unknown** | 0.8096 | 0.6697 | 0.7331 |
|
|
|
|
|
|
### Notes
|
|
|
|
|
|
- Metrics were computed on a full isolated test set.
|
|
|
|
|
|
---
|
|
|
|
|
|
## GitHub Repository
|
|
|
|
|
|
The full code for **data processing, model training, and inference** is available on GitHub:
|
|
|
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
|
|
|
|
|
|
You can find scripts for:
|
|
|
|
|
|
- Preprocessing GenBank sequences
|
|
|
- Fine-tuning models
|
|
|
- Evaluating and using the trained models
|
|
|
|