Gustavo Henrique Ferreira Cruz commited on
Commit
65c688a
·
verified ·
1 Parent(s): a5a689e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -10,4 +10,72 @@ tags:
10
  - introns
11
  - exons
12
  - GPT
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - introns
11
  - exons
12
  - GPT
13
+ ---
14
+
15
+ # Exons and Introns Classifier
16
+
17
+ GPT-2 finetuned model for **classifying DNA sequences** into **introns** and **exons**, trained on a large cross-species GenBank dataset.
18
+
19
+ ## Architecture
20
+ - Base model: GPT-2
21
+ - Approach: Full-sequence classification
22
+ - Framework: PyTorch + Hugging Face Transformers
23
+
24
+ ## Usage
25
+
26
+ ```python
27
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained("gu-dudi/ExInGPT")
30
+ model = AutoModelForSequenceClassification.from_pretrained("gu-dudi/ExInGPT")
31
+ ```
32
+
33
+ Prompt format:
34
+
35
+ The model expects the following input format:
36
+
37
+ ```
38
+ <|SEQUENCE|>ACGAAGGGTAAGCC...
39
+ <|FLANK_BEFORE|>ACGT...
40
+ <|FLANK_AFTER|>ACGT...
41
+ <|ORGANISM|>...
42
+ <|GENE|>...
43
+ <|TARGET|>
44
+ ```
45
+
46
+ - `<|SEQUENCE|>`: Full DNA sequence.
47
+ - `<|FLANK_BEFORE|>` and `<|FLANK_AFTER|>`: Optional upstream/downstream context sequences.
48
+ - `<|ORGANISM|>`: Optional organism name (truncated to a maximum of 10 characters in training).
49
+ - `<|GENE|>`: Optional gene name (truncated to a maximum of 10 characters in training).
50
+ - `<|TARGET|>`: Separation token for label prediction.
51
+
52
+ The model should predict the next token as the class label: `[EXON]` or `[INTRON]`.
53
+
54
+ ## Data
55
+
56
+ The model was trained on a processed version of GenBank sequences spanning multiple species.
57
+
58
+ ## Publications
59
+
60
+ - Achieved **2nd place** at a national event in Fortaleza, Ceará, Brazil - [Symposium on Knowledge Discovery, Mining and Learning (KDMiLe) - SBC](https://doi.org/10.5753/kdmile.2025.247575).
61
+ - Later accepted for publication in Athens, Greece, on [International Conference on BioInformatics and BioEngineering (BIBE) - IEEE](pending).
62
+
63
+ ## Training
64
+
65
+ - Trained on an architecture with 8x H100 GPUs.
66
+
67
+ ## GitHub Repository
68
+
69
+ The full code for **data processing, model training, and inference** is available on GitHub:
70
+ [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
71
+
72
+ You can find scripts for:
73
+ - Preprocessing GenBank sequences
74
+ - Fine-tuning models
75
+ - Evaluating and using the trained models
76
+
77
+ ## Reference
78
+
79
+ If you use this model in scientific research, please cite:
80
+
81
+ > [future IEEE link](pending)