ExInGPT / README.md

Upload folder using huggingface_hub

e7e64ae verified 28 days ago

3.99 kB

	---
	license: mit
	base_model:
	- openai-community/gpt2
	tags:
	- genomics
	- bioinformatics
	- DNA
	- sequence-classification
	- introns
	- exons
	- GPT
	---

	# Exons and Introns Classifier

	GPT-2 finetuned model for classifying DNA sequences into introns and exons, trained on a large cross-species GenBank dataset (34,627 different species).

	---

	## Model Architecture

	- Base model: GPT-2
	- Approach: Full-sequence classification

	---

	## Usage

	You can use this model through its own custom pipeline:

	```python
	from transformers import pipeline

	pipe = pipeline(
	task="gpt2-exon-intron-classification",
	model="GustavoHCruz/ExInGPT",
	trust_remote_code=True,
	)

	out = pipe(
	{
	"sequence": "GCAGCAACAGTGCCCAGGGCTCTGATGAGTCTCTCATCACTTGTAAAG",
	"organism": "Homo sapiens",
	"gene": "HLA-C",
	"before": "GGTCTTTTTTTTTGTTCTACCCCAG",
	"after": "GTGAGATTCTGGGGAGCTGAAGTGG",
	}
	)

	print(out) # EXON
	```

	This model uses the same maximum context length as the standard GPT‑2 (1024 tokens), but it was trained on DNA sequences of up to 512 nucleotides. Additional context information (`organism`, `gene`, `before`, `after`) was also trained using specific rules:

	- Organism and gene names were truncated to 10 characters
	- Flanking sequences `before` and `after` were up to 25 nucleotides.

	The pipeline follows these rules. Nucleotide sequences, organism, gene, before and after, will be automatically truncated if they exceed the limit.

	---

	## Custom Usage Information

	Prompt format:

	The model expects the following input format:

	```
	<\|SEQUENCE\|>[G][C][A][G]...
	<\|ORGANISM\|>Homo sapiens
	<\|GENE\|>HLA-C
	<\|FLANK_BEFORE\|>[G][G][T][C]...
	<\|FLANK_AFTER\|>[G][T][G][A]...
	<\|TARGET\|>
	```

	- `<\|SEQUENCE\|>`: Full DNA sequence. Maximum of 512 nucleotides.
	- `<\|ORGANISM\|>`: Optional organism name (truncated to a maximum of 10 characters in training).
	- `<\|GENE\|>`: Optional gene name (truncated to a maximum of 10 characters in training).
	- `<\|FLANK_BEFORE\|>` and `<\|FLANK_AFTER\|>`: Optional upstream/downstream context sequences. Maximum of 25 nucleotides.
	- `<\|TARGET\|>`: Separation token for label prediction.

	The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.

	---

	## Dataset

	The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).

	---

	## Publications

	- Full Paper
	Achieved 2nd place at the _Symposium on Knowledge Discovery, Mining and Learning (KDMiLe 2025)_, organized by the Brazilian Computer Society (SBC), held in Fortaleza, Ceará, Brazil.
	DOI: [https://doi.org/10.5753/kdmile.2025.247575](https://doi.org/10.5753/kdmile.2025.247575).
	- Short Paper
	Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
	DOI: [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113).

	---

	## Training

	- Trained on an architecture with 8x H100 GPUs.

	---

	## Metrics

	Average accuracy: 0.9985

	\| Class \| Precision \| Recall \| F1-Score \|
	\| ---------- \| --------- \| ------ \| -------- \|
	\| Intron \| 0.9977 \| 0.9973 \| 0.9975 \|
	\| Exon \| 0.9988 \| 0.9990 \| 0.9989 \|

	---

	### Notes

	- Metrics were computed on a full isolated test set.
	- The classes follow a ratio of approximately 2 exons to one intron, allowing for direct interpretation of the scores.
	- The model can operate on raw nucleotide sequences without additional biological features (e.g. organism, gene, before or after).

	---

	## GitHub Repository

	The full code for data processing, model training, and inference is available on GitHub:
	[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)

	You can find scripts for:

	- Preprocessing GenBank sequences
	- Fine-tuning models
	- Evaluating and using the trained models