NuclDNABERT2 / README.md
GustavoHCruz's picture
Upload folder using huggingface_hub
91aa26b verified
---
license: mit
base_model:
- zhihan1996/DNABERT-2-117M
tags:
- genomics
- bioinformatics
- DNA
- nucleotide-classification
- introns
- exons
- DNABERT2
---
# Nucleotide Classifier
BERT finetuned model for **classifying nucleotides** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
---
## Architecture
- Base model: DNABERT2
- Approach: Nucleotide classification
- Framework: PyTorch + Hugging Face Transformers
---
## Usage
You can use this model through its own custom pipeline:
```python
from transformers import pipeline
pipe = pipeline(
task="dnabert2-nucleotide-classification",
model="GustavoHCruz/NuclDNABERT2",
trust_remote_code=True,
)
out = pipe(
{
"before": "ATGATCCAGTTAAAAAATATATTC",
"sequence": "C",
"after": ""
}
)
print(out) # EXON
out = pipe(
{
"before": "GTAACATTAAAATAAAAAACAAAA",
"sequence": "T",
"after": "ATTATTTAAAGAAAAATATAATTA"
}
)
print(out) # INTRON
```
The maximum context of this model is the same as DNABERT2 (512 tokens), but its training was limited to 24 nucleotides before and after the target nucleotide.
When using the pipeline, these rules will be applied. The sequence will be limited to only a single nucleotide, and the sequences before and after will be truncated to a maximum of 24 nucleotides, even if longer sequences are provided. Similarly, the organism will be limited to 10 characters.
---
## Custom Usage Information
Prompt format:
The model expects the following input format:
```
TCGG...[SEP]T[SEP]AGCT...
```
The model should predict a class label: 0 (Exon), 1 (Intron) or 2 (Unknown).
---
## Dataset
The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
---
## Publications
- **Full Paper**
Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
- **Short Paper**
Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
---
## Training
- Trained on an architecture with 8x H100 GPUs.
---
## Metrics
**Average accuracy:** **0.7890**
| Class | Precision | Recall | F1-Score |
| ----------- | --------- | ------ | -------- |
| **Intron** | 0.6007 | 0.9113 | 0.6060 |
| **Exon** | 0.8011 | 0.9011 | 0.8482 |
| **Unknown** | 0.8096 | 0.6697 | 0.7331 |
### Notes
- Metrics were computed on a full isolated test set.
---
## GitHub Repository
The full code for **data processing, model training, and inference** is available on GitHub:
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
You can find scripts for:
- Preprocessing GenBank sequences
- Fine-tuning models
- Evaluating and using the trained models