Exons and Introns Classifier

DNABERT2 finetuned model for classifying DNA sequences into introns and exons, trained on a large cross-species GenBank dataset (34,627 different species).


Architecture

  • Base model: DNABERT2
  • Approach: Full-sequence classification

Usage

You can use this model through its own custom pipeline:

from transformers import pipeline

pipe = pipeline(
  task="dnabert2-exon-intron-classification",
  model="GustavoHCruz/ExInDNABERT2",
  trust_remote_code=True,
)

out = pipe(
  "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG"
)

print(out) # EXON

This model uses the same maximum context length as the standard DNABERT2 (512 tokens), but it was trained on DNA sequences of up to 256 nucleotides.

The pipeline will automatically truncate the nucleotide sequence they exceed this limit.


Custom Usage Information

The model expects the same tokens as DNABERT2, ou seja, nucleotídeos de entrada, como por exemplo

GTAAGGAGGGGGAT

The model should predict the class label: 0 (Intron) or 1 (Exon).


Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.


Publications


Training

  • Trained on an architecture with 8x H100 GPUs.

Metrics

Average accuracy: 0.9956

Class Precision Recall F1-Score
Intron 0.9943 0.9922 0.9932
Exon 0.9962 0.9972 0.9967

Notes

  • Metrics were computed on a full isolated test set.
  • The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.

GitHub Repository

The full code for data processing, model training, and inference is available on GitHub:
CodingDNATransformers

You can find scripts for:

  • Preprocessing GenBank sequences
  • Fine-tuning models
  • Evaluating and using the trained models
Downloads last month
84
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GustavoHCruz/ExInDNABERT2

Finetuned
(16)
this model

Collection including GustavoHCruz/ExInDNABERT2