Nucleotide Classifier

BERT finetuned model for classifying nucleotides into introns and exons, trained on a large cross-species GenBank dataset (34,627 different species).

Architecture

Base model: DNABERT2
Approach: Nucleotide classification
Framework: PyTorch + Hugging Face Transformers

Usage

You can use this model through its own custom pipeline:

from transformers import pipeline

pipe = pipeline(
  task="dnabert2-nucleotide-classification",
  model="GustavoHCruz/NuclDNABERT2",
  trust_remote_code=True,
)

out = pipe(
  {
    "before": "ATGATCCAGTTAAAAAATATATTC",
    "sequence": "C",
    "after": ""
    }
)

print(out) # EXON

out = pipe(
    {
    "before": "GTAACATTAAAATAAAAAACAAAA",
    "sequence": "T",
    "after": "ATTATTTAAAGAAAAATATAATTA"
    }
)

print(out) # INTRON

The maximum context of this model is the same as DNABERT2 (512 tokens), but its training was limited to 24 nucleotides before and after the target nucleotide.

When using the pipeline, these rules will be applied. The sequence will be limited to only a single nucleotide, and the sequences before and after will be truncated to a maximum of 24 nucleotides, even if longer sequences are provided. Similarly, the organism will be limited to 10 characters.

Custom Usage Information

Prompt format:

The model expects the following input format:

TCGG...[SEP]T[SEP]AGCT...

The model should predict a class label: 0 (Exon), 1 (Intron) or 2 (Unknown).

Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.

Publications

Full Paper
Achieved 2nd place at the Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025), organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
DOI: https://doi.org/10.5753/kdmile.2025.247575.
Short Paper
Presented at the IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025), held in Athens, Greece.
DOI: https://doi.org/10.1109/BIBE66822.2025.00113.

Training

Trained on an architecture with 8x H100 GPUs.

Metrics

Average accuracy: 0.7890

Class	Precision	Recall	F1-Score
Intron	0.6007	0.9113	0.6060
Exon	0.8011	0.9011	0.8482
Unknown	0.8096	0.6697	0.7331

Notes

Metrics were computed on a full isolated test set.

GitHub Repository

The full code for data processing, model training, and inference is available on GitHub:
CodingDNATransformers

You can find scripts for:

Preprocessing GenBank sequences
Fine-tuning models
Evaluating and using the trained models

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GustavoHCruz/NuclDNABERT2

Base model

zhihan1996/DNABERT-2-117M

Finetuned

(29)

this model

Collection including GustavoHCruz/NuclDNABERT2

DNA Coding Regions

Collection

Collection of models and dataset for classifying introns and exons across species and DNA-to-proteins translation. • 7 items • Updated Dec 19, 2025