GustavoHCruz commited on
Commit
d44e01f
·
verified ·
1 Parent(s): aa2d98d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -4
README.md CHANGED
@@ -16,11 +16,15 @@ tags:
16
 
17
  GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
18
 
19
- ## Architecture
 
 
20
  - Base model: GPT-2
21
  - Approach: Full-sequence classification
22
  - Framework: PyTorch + Hugging Face Transformers
23
-
 
 
24
  ## Usage
25
 
26
  ```python
@@ -51,7 +55,9 @@ The model expects the following input format:
51
 
52
  The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
53
 
54
- ## Data
 
 
55
 
56
  The model was trained on a processed version of GenBank sequences spanning multiple species.
57
 
@@ -63,11 +69,31 @@ The model was trained on a processed version of GenBank sequences spanning multi
63
  - **Short Paper (International)**
64
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
65
  [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
66
-
67
  ## Training
68
 
69
  - Trained on an architecture with 8x H100 GPUs.
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## GitHub Repository
72
 
73
  The full code for **data processing, model training, and inference** is available on GitHub:
 
16
 
17
  GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
18
 
19
+ ---
20
+
21
+ ## Model Architecture
22
  - Base model: GPT-2
23
  - Approach: Full-sequence classification
24
  - Framework: PyTorch + Hugging Face Transformers
25
+
26
+ ---
27
+
28
  ## Usage
29
 
30
  ```python
 
55
 
56
  The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
57
 
58
+ ---
59
+
60
+ ## Dataset
61
 
62
  The model was trained on a processed version of GenBank sequences spanning multiple species.
63
 
 
69
  - **Short Paper (International)**
70
  Presented at the _IEEE International Conference on Bioinformatics and BioEngineering (BIBE 2025)_, held in Athens, Greece.
71
  [https://doi.org/10.1109/BIBE66822.2025.00113](https://doi.org/10.1109/BIBE66822.2025.00113)
72
+
73
  ## Training
74
 
75
  - Trained on an architecture with 8x H100 GPUs.
76
 
77
+ ---
78
+
79
+ ## Metrics
80
+
81
+ **Average accuracy:** **0.9985**
82
+
83
+ | Class | Precision | Recall | F1-Score |
84
+ |--------|-----------|--------|----------|
85
+ | **Intron** | 0.9977 | 0.9973 | 0.9975 |
86
+ | **Exon** | 0.9988 | 0.9990 | 0.9989 |
87
+
88
+ ---
89
+
90
+ ### **Notes**
91
+ - Metrics were computed on the full test set.
92
+ - Classes are approximately balanced, allowing direct interpretation of the scores.
93
+ - The model operates on raw nucleotide sequences without additional biological features.
94
+
95
+
96
+
97
  ## GitHub Repository
98
 
99
  The full code for **data processing, model training, and inference** is available on GitHub: