Exons and Introns Classifier

GPT-2 finetuned model for classifying DNA sequences into introns and exons, trained on a large cross-species GenBank dataset (34,627 different species).

Model Architecture

Base model: GPT-2
Approach: Full-sequence classification

Usage

You can use this model through its own custom pipeline:

from transformers import pipeline

pipe = pipeline(
  task="gpt2-exon-intron-classification",
  model="GustavoHCruz/ExInGPT",
  trust_remote_code=True,
)

out = pipe(
  {
    "sequence": "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG",
    "organism": "Homo sapiens",
    "gene": "HLA-C",
    "before": "GGTCTTTTTTTTTGTTCTACCCCAG",
    "after": "GTGAGATTCTGGGGAGCTGAAGTGG",
  }
)

print(out) # EXON

This model uses the same maximum context length as the standard GPT‑2 (1024 tokens), but it was trained on DNA sequences of up to 512 nucleotides. Additional context information (organism, gene, before, after) was also trained using specific rules:

Organism and gene names were truncated to 10 characters
Flanking sequences before and after were up to 25 nucleotides.

The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.

Custom Usage Information

Prompt format:

The model expects the following input format:

<|SEQUENCE|>[G][C][A][G]...
<|ORGANISM|>Homo sapiens
<|GENE|>HLA-C
<|FLANK_BEFORE|>[G][G][T][C]...
<|FLANK_AFTER|>[G][T][G][A]...
<|TARGET|>

<|SEQUENCE|>: Full DNA sequence. Maximum of 512 nucleotides.
<|ORGANISM|>: Optional organism name (truncated to a maximum of 10 characters in training).
<|GENE|>: Optional gene name (truncated to a maximum of 10 characters in training).
<|FLANK_BEFORE|> and <|FLANK_AFTER|>: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.
<|TARGET|>: Separation token for label prediction.

The model should predict the next token as the class label: [EXON] or [INTRON].

Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.

Publications

Full Paper
Achieved 2nd place at the Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025), organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
DOI: https://doi.org/10.5753/kdmile.2025.247575.
Short Paper
Presented at the IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025), held in Athens, Greece.
DOI: https://doi.org/10.1109/BIBE66822.2025.00113.

Training

Trained on an architecture with 8x H100 GPUs.

Metrics

Average accuracy: 0.9985

Class	Precision	Recall	F1-Score
Intron	0.9977	0.9973	0.9975
Exon	0.9988	0.9990	0.9989

Notes

Metrics were computed on a full isolated test set.
The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).