GustavoHCruz
/

ExInGPT

sequence-classification

Model card Files Files and versions

GustavoHCruz commited on Nov 17, 2025

Commit

d44e01f

·

verified ·

1 Parent(s): aa2d98d

Update README.md

Files changed (1) hide show

README.md +30 -4

README.md CHANGED Viewed

@@ -16,11 +16,15 @@ tags:
 GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
-## Architecture
 - Base model: GPT-2
 - Approach: Full-sequence classification
 - Framework: PyTorch + Hugging Face Transformers
 ## Usage
 ```python
@@ -51,7 +55,9 @@ The model expects the following input format:
 The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
-## Data
 The model was trained on a processed version of GenBank sequences spanning multiple species.
@@ -63,11 +69,31 @@ The model was trained on a processed version of GenBank sequences spanning multi
 - **Short Paper (International)**
   Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
   [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
 ## Training
 - Trained on an architecture with 8x H100 GPUs.
 ## GitHub Repository
 The full code for **data processing, model training, and inference** is available on GitHub:

 GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
+---
+## Model Architecture
 - Base model: GPT-2
 - Approach: Full-sequence classification
 - Framework: PyTorch + Hugging Face Transformers
+---
 ## Usage
 ```python
 The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
+---
+## Dataset
 The model was trained on a processed version of GenBank sequences spanning multiple species.
 - **Short Paper (International)**
   Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
   [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
 ## Training
 - Trained on an architecture with 8x H100 GPUs.
+---
+## Metrics
+**Average accuracy:** **0.9985**
+| Class  | Precision | Recall | F1-Score |
+|--------|-----------|--------|----------|
+| **Intron** | 0.9977 | 0.9973 | 0.9975 |
+| **Exon**   | 0.9988 | 0.9990 | 0.9989 |
+---
+### **Notes**
+- Metrics were computed on the full test set.
+- Classes are approximately balanced, allowing direct interpretation of the scores.
+- The model operates on raw nucleotide sequences without additional biological features.
 ## GitHub Repository
 The full code for **data processing, model training, and inference** is available on GitHub: