ExInGPT / README.md
GustavoHCruz's picture
Upload folder using huggingface_hub
e7e64ae verified
---
license: mit
base_model:
- openai-community/gpt2
tags:
- genomics
- bioinformatics
- DNA
- sequence-classification
- introns
- exons
- GPT
---
# Exons and Introns Classifier
GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
---
## Model Architecture
- Base model: GPT-2
- Approach: Full-sequence classification
---
## Usage
You can use this model through its own custom pipeline:
```python
from transformers import pipeline
pipe = pipeline(
task="gpt2-exon-intron-classification",
model="GustavoHCruz/ExInGPT",
trust_remote_code=True,
)
out = pipe(
{
"sequence": "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG",
"organism": "Homo sapiens",
"gene": "HLA-C",
"before": "GGTCTTTTTTTTTGTTCTACCCCAG",
"after": "GTGAGATTCTGGGGAGCTGAAGTGG",
}
)
print(out) # EXON
```
This model uses the same maximum context length as the standard GPT‑2 (1024 tokens), but it was trained on DNA sequences of up to 512 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:
- Organism and gene names were truncated to 10 characters
- Flanking sequences `before` and `after` were up to 25 nucleotides.
The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.
---
## Custom Usage Information
Prompt format:
The model expects the following input format:
```
<|SEQUENCE|>[G][C][A][G]...
<|ORGANISM|>Homo sapiens
<|GENE|>HLA-C
<|FLANK_BEFORE|>[G][G][T][C]...
<|FLANK_AFTER|>[G][T][G][A]...
<|TARGET|>
```
- `<|SEQUENCE|>`: Full DNA sequence. Maximum of 512 nucleotides.
- `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
- `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.
- `<|TARGET|>`: Separation token for label prediction.
The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
---
## Dataset
The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
---
## Publications
- **Full Paper**
Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
- **Short Paper**
Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
---
## Training
- Trained on an architecture with 8x H100 GPUs.
---
## Metrics
**Average accuracy:** **0.9985**
| Class | Precision | Recall | F1-Score |
| ---------- | --------- | ------ | -------- |
| **Intron** | 0.9977 | 0.9973 | 0.9975 |
| **Exon** | 0.9988 | 0.9990 | 0.9989 |
---
### Notes
- Metrics were computed on a full isolated test set.
- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).
---
## GitHub Repository
The full code for **data processing, model training, and inference** is available on GitHub:
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
You can find scripts for:
- Preprocessing GenBank sequences
- Fine-tuning models
- Evaluating and using the trained models