|
|
--- |
|
|
license: mit |
|
|
base_model: |
|
|
- openai-community/gpt2 |
|
|
tags: |
|
|
- genomics |
|
|
- bioinformatics |
|
|
- DNA |
|
|
- sequence-classification |
|
|
- introns |
|
|
- exons |
|
|
- GPT |
|
|
--- |
|
|
|
|
|
# Exons and Introns Classifier |
|
|
|
|
|
GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species). |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- Base model: GPT-2 |
|
|
- Approach: Full-sequence classification |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use this model through its own custom pipeline: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline( |
|
|
task="gpt2-exon-intron-classification", |
|
|
model="GustavoHCruz/ExInGPT", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
out = pipe( |
|
|
{ |
|
|
"sequence": "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG", |
|
|
"organism": "Homo sapiens", |
|
|
"gene": "HLA-C", |
|
|
"before": "GGTCTTTTTTTTTGTTCTACCCCAG", |
|
|
"after": "GTGAGATTCTGGGGAGCTGAAGTGG", |
|
|
} |
|
|
) |
|
|
|
|
|
print(out) # EXON |
|
|
``` |
|
|
|
|
|
This model uses the same maximum context length as the standard GPT‑2 (1024 tokens), but it was trained on DNA sequences of up to 512 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules: |
|
|
|
|
|
- Organism and gene names were truncated to 10 characters |
|
|
- Flanking sequences `before` and `after` were up to 25 nucleotides. |
|
|
|
|
|
The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit. |
|
|
|
|
|
--- |
|
|
|
|
|
## Custom Usage Information |
|
|
|
|
|
Prompt format: |
|
|
|
|
|
The model expects the following input format: |
|
|
|
|
|
``` |
|
|
<|SEQUENCE|>[G][C][A][G]... |
|
|
<|ORGANISM|>Homo sapiens |
|
|
<|GENE|>HLA-C |
|
|
<|FLANK_BEFORE|>[G][G][T][C]... |
|
|
<|FLANK_AFTER|>[G][T][G][A]... |
|
|
<|TARGET|> |
|
|
``` |
|
|
|
|
|
- `<|SEQUENCE|>`: Full DNA sequence. Maximum of 512 nucleotides. |
|
|
- `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training). |
|
|
- `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training). |
|
|
- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides. |
|
|
- `<|TARGET|>`: Separation token for label prediction. |
|
|
|
|
|
The model should predict the next token as the class label: `[EXON]` or `[INTRON]`. |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions). |
|
|
|
|
|
--- |
|
|
|
|
|
## Publications |
|
|
|
|
|
- **Full Paper** |
|
|
Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil. |
|
|
DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575). |
|
|
- **Short Paper** |
|
|
Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece. |
|
|
DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113). |
|
|
|
|
|
--- |
|
|
|
|
|
## Training |
|
|
|
|
|
- Trained on an architecture with 8x H100 GPUs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Metrics |
|
|
|
|
|
**Average accuracy:** **0.9985** |
|
|
|
|
|
| Class | Precision | Recall | F1-Score | |
|
|
| ---------- | --------- | ------ | -------- | |
|
|
| **Intron** | 0.9977 | 0.9973 | 0.9975 | |
|
|
| **Exon** | 0.9988 | 0.9990 | 0.9989 | |
|
|
|
|
|
--- |
|
|
|
|
|
### Notes |
|
|
|
|
|
- Metrics were computed on a full isolated test set. |
|
|
- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores. |
|
|
- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after). |
|
|
|
|
|
--- |
|
|
|
|
|
## GitHub Repository |
|
|
|
|
|
The full code for **data processing, model training, and inference** is available on GitHub: |
|
|
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers) |
|
|
|
|
|
You can find scripts for: |
|
|
|
|
|
- Preprocessing GenBank sequences |
|
|
- Fine-tuning models |
|
|
- Evaluating and using the trained models |
|
|
|