GustavoHCruz commited on
Commit
a6c29ec
·
verified ·
1 Parent(s): 36b2010

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md CHANGED
@@ -1,3 +1,108 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ base_model:
4
+ - openai-community/gpt2
5
+ tags:
6
+ - genomics
7
+ - bioinformatics
8
+ - DNA
9
+ - sequence-translation
10
+ - proteins
11
+ - GPT
12
  ---
13
+
14
+ # DNA To Proteins Translator
15
+
16
+ GPT-2 finetuned model for **translate DNA** into **proteins sequences**, trained on a large cross-species GenBank dataset.
17
+
18
+ ---
19
+
20
+ ## Model Architecture
21
+
22
+ - Base model: GPT-2
23
+ - Approach: DNA To Proteins Translation
24
+
25
+ ---
26
+
27
+ ## Usage
28
+
29
+ You can use this model through its own custom pipeline:
30
+
31
+ ```python
32
+ from transformers import pipeline
33
+
34
+ pipe = pipeline(
35
+ task="gpt2-dna-translator",
36
+ model="GustavoHCruz/DNATranslatorGPT2",
37
+ trust_remote_code=True,
38
+ )
39
+
40
+ out = pipe({
41
+ "sequence": "GTTTCTTTGCTTTTTAMGCTTGTATCTATTCTTCCATCGTAGACTGACCTGGTCATTTCTTTGCATCCAACGTA",
42
+ "organism": "Homo sapiens"
43
+ })
44
+ print(out) # LTWSFLCIQR
45
+
46
+ out = pipe({
47
+ "sequence": "ACACCAGCCTAGTTCTATGTCAGGTTCTAAAATATTTTCTGGTTCAATAAATAAAACATCAACATCTCACATAAAAGAAGTACGGAAAAGATTTAAAGGCAGTAACATATGAACGTAGGACGTTTAGGAGAAAAATGCTAAAAAAGTAGCTATTGTTAATTGAACATTACTCAGGGATGATCGGTTGTTTTTGTATTGACTTACCAAGACCACCATTGCCGAGTGCTGCATCCATTTCACGTTCTTCTAATTCTTCAATATCTAAATTCAACTCATAAAGAGCTTAATCA",
48
+ "organism": "Rotaria socialis"
49
+ })
50
+ print(out) # MDAALGNGGL
51
+ ```
52
+
53
+ This model uses the same maximum context length as the standard GPT-2 (1024 tokens). Its training was performed ensuring that the DNA sequence and the resulting protein would always fit within this context. An additional (and highly recommended) context is available: `organism`.
54
+
55
+ When using this pipeline, some rules will be applied to keep the model functioning in the same way as the training performed:
56
+
57
+ - DNA sequences will be limited to 1000 tokens (each nucleotide becomes a token).
58
+ - The organism (raw text) is limited to a maximum of 10 characters.
59
+ - The generated response is limited to 1024 – the size of the received input. The minimum number of tokens generated, when the input is at its limit, is 11 new tokens.
60
+
61
+ ---
62
+
63
+ ## Custom Usage Information
64
+
65
+ Prompt format:
66
+
67
+ The model expects the following input format:
68
+
69
+ ```
70
+ <|DNA|>[DNA_G][DNA_T][DNA_T][DNA_T]...<|ORGANISM|>Homo sapiens
71
+ ```
72
+
73
+ The model will generate a response in the following expected format:
74
+
75
+ ```
76
+ <|PROTEIN|>[PROT_L][PROT_T][PROT_W]...<|END|>
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Dataset
82
+
83
+ The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
84
+
85
+ ---
86
+
87
+ ## Training
88
+
89
+ - Trained on an architecture with 8x H100 GPUs.
90
+
91
+ ---
92
+
93
+ ## Metrics
94
+
95
+ The model is still in the initial evaluation stages, and currently shows an average similarity of approximately 0.75 (calculated from the edit distance) with target sequences in the test set.
96
+
97
+ ---
98
+
99
+ ## GitHub Repository
100
+
101
+ The full code for **data processing, model training, and inference** is available on GitHub:
102
+ [CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
103
+
104
+ You can find scripts for:
105
+
106
+ - Preprocessing GenBank sequences
107
+ - Fine-tuning models
108
+ - Evaluating and using the trained models