GustavoHCruz
/

ExInGPT

 - introns
 - exons
 - GPT
+---
+# Exons and Introns Classifier
+GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
+## Architecture
+- Base model: GPT-2
+- Approach: Full-sequence classification
+- Framework: PyTorch + Hugging Face Transformers
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("gu-dudi/ExInGPT")
+model = AutoModelForSequenceClassification.from_pretrained("gu-dudi/ExInGPT")
+```
+Prompt format:
+The model expects the following input format:
+```
+<|SEQUENCE|>ACGAAGGGTAAGCC...
+<|FLANK_BEFORE|>ACGT...
+<|FLANK_AFTER|>ACGT...
+<|ORGANISM|>...
+<|GENE|>...
+<|TARGET|>
+```
+- `<|SEQUENCE|>`: Full DNA sequence.
+- `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences.
+- `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
+- `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
+- `<|TARGET|>`: Separation token for label prediction.
+The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
+## Data
+The model was trained on a processed version of GenBank sequences spanning multiple species.
+## Publications
+- Achieved **2nd place** at a national event in Fortaleza, Ceará, Brazil - [Symposium on Knowledge Discovery, Mining and Learning (KDMiLe) - SBC](https://doi.org/10.5753/kdmile.2025.247575).
+- Later accepted for publication in Athens, Greece, on [International Conference on BioInformatics and BioEngineering (BIBE) - IEEE](pending).
+## Training
+- Trained on an architecture with 8x H100 GPUs.
+## GitHub Repository
+The full code for **data processing, model training, and inference** is available on GitHub:
+[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
+You can find scripts for:
+- Preprocessing GenBank sequences
+- Fine-tuning models
+- Evaluating and using the trained models
+## Reference
+If you use this model in scientific research, please cite:
+> [future IEEE link](pending)