File size: 3,988 Bytes
6bc4910 637b00c a5a689e 637b00c 65c688a 8a02ba0 65c688a d44e01f 637b00c 65c688a d44e01f 65c688a 637b00c d773463 65c688a 637b00c 86c610c 637b00c 8a02ba0 65c688a 637b00c 8a02ba0 637b00c 65c688a e7e64ae 65c688a d773463 65c688a 8a02ba0 65c688a d44e01f 65c688a 637b00c 65c688a 8a02ba0 65c688a 637b00c 7972d75 637b00c 7972d75 637b00c d44e01f 8a02ba0 65c688a d44e01f 637b00c d44e01f 637b00c d44e01f e7e64ae 637b00c d44e01f 8a02ba0 65c688a 637b00c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: mit
base_model:
- openai-community/gpt2
tags:
- genomics
- bioinformatics
- DNA
- sequence-classification
- introns
- exons
- GPT
---
# Exons and Introns Classifier
GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset (34,627 different species).
---
## Model Architecture
- Base model: GPT-2
- Approach: Full-sequence classification
---
## Usage
You can use this model through its own custom pipeline:
```python
from transformers import pipeline
pipe = pipeline(
task="gpt2-exon-intron-classification",
model="GustavoHCruz/ExInGPT",
trust_remote_code=True,
)
out = pipe(
{
"sequence": "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG",
"organism": "Homo sapiens",
"gene": "HLA-C",
"before": "GGTCTTTTTTTTTGTTCTACCCCAG",
"after": "GTGAGATTCTGGGGAGCTGAAGTGG",
}
)
print(out) # EXON
```
This model uses the same maximum context length as the standard GPT‑2 (1024 tokens), but it was trained on DNA sequences of up to 512 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:
- Organism and gene names were truncated to 10 characters
- Flanking sequences `before` and `after` were up to 25 nucleotides.
The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.
---
## Custom Usage Information
Prompt format:
The model expects the following input format:
```
<|SEQUENCE|>[G][C][A][G]...
<|ORGANISM|>Homo sapiens
<|GENE|>HLA-C
<|FLANK_BEFORE|>[G][G][T][C]...
<|FLANK_AFTER|>[G][T][G][A]...
<|TARGET|>
```
- `<|SEQUENCE|>`: Full DNA sequence. Maximum of 512 nucleotides.
- `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
- `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.
- `<|TARGET|>`: Separation token for label prediction.
The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
---
## Dataset
The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
---
## Publications
- **Full Paper**
Achieved **2nd place** at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
- **Short Paper**
Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).
---
## Training
- Trained on an architecture with 8x H100 GPUs.
---
## Metrics
**Average accuracy:** **0.9985**
| Class | Precision | Recall | F1-Score |
| ---------- | --------- | ------ | -------- |
| **Intron** | 0.9977 | 0.9973 | 0.9975 |
| **Exon** | 0.9988 | 0.9990 | 0.9989 |
---
### Notes
- Metrics were computed on a full isolated test set.
- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).
---
## GitHub Repository
The full code for **data processing, model training, and inference** is available on GitHub:
[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
You can find scripts for:
- Preprocessing GenBank sequences
- Fine-tuning models
- Evaluating and using the trained models
|