GustavoHCruz
/

DNATranslatorGPT2

 ---
 license: mit
+base_model:
+  - openai-community/gpt2
+tags:
+  - genomics
+  - bioinformatics
+  - DNA
+  - sequence-translation
+  - proteins
+  - GPT
 ---
+# DNA To Proteins Translator
+GPT-2 finetuned model for **translate DNA** into **proteins sequences**, trained on a large cross-species GenBank dataset.
+---
+## Model Architecture
+- Base model: GPT-2
+- Approach: DNA To Proteins Translation
+---
+## Usage
+You can use this model through its own custom pipeline:
+```python
+from transformers import pipeline
+pipe = pipeline(
+	task="gpt2-dna-translator",
+	model="GustavoHCruz/DNATranslatorGPT2",
+	trust_remote_code=True,
+)
+out = pipe({
+	"sequence": "GTTTCTTTGCTTTTTAMGCTTGTATCTATTCTTCCATCGTAGACTGACCTGGTCATTTCTTTGCATCCAACGTA",
+  "organism": "Homo sapiens"
+})
+print(out) # LTWSFLCIQR
+out = pipe({
+  "sequence": "ACACCAGCCTAGTTCTATGTCAGGTTCTAAAATATTTTCTGGTTCAATAAATAAAACATCAACATCTCACATAAAAGAAGTACGGAAAAGATTTAAAGGCAGTAACATATGAACGTAGGACGTTTAGGAGAAAAATGCTAAAAAAGTAGCTATTGTTAATTGAACATTACTCAGGGATGATCGGTTGTTTTTGTATTGACTTACCAAGACCACCATTGCCGAGTGCTGCATCCATTTCACGTTCTTCTAATTCTTCAATATCTAAATTCAACTCATAAAGAGCTTAATCA",
+  "organism": "Rotaria socialis"
+})
+print(out) # MDAALGNGGL
+```
+This model uses the same maximum context length as the standard GPT-2 (1024 tokens). Its training was performed ensuring that the DNA sequence and the resulting protein would always fit within this context. An additional (and highly recommended) context is available: `organism`.
+When using this pipeline, some rules will be applied to keep the model functioning in the same way as the training performed:
+- DNA sequences will be limited to 1000 tokens (each nucleotide becomes a token).
+- The organism (raw text) is limited to a maximum of 10 characters.
+- The generated response is limited to 1024 – the size of the received input. The minimum number of tokens generated, when the input is at its limit, is 11 new tokens.
+---
+## Custom Usage Information
+Prompt format:
+The model expects the following input format:
+```
+<|DNA|>[DNA_G][DNA_T][DNA_T][DNA_T]...<|ORGANISM|>Homo sapiens
+```
+The model will generate a response in the following expected format:
+```
+<|PROTEIN|>[PROT_L][PROT_T][PROT_W]...<|END|>
+```
+---
+## Dataset
+The model was trained on a processed version of GenBank sequences spanning multiple species, available at the [DNA Coding Regions Dataset](https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions).
+---
+## Training
+- Trained on an architecture with 8x H100 GPUs.
+---
+## Metrics
+The model is still in the initial evaluation stages, and currently shows an average similarity of approximately 0.75 (calculated from the edit distance) with target sequences in the test set.
+---
+## GitHub Repository
+The full code for **data processing, model training, and inference** is available on GitHub:
+[CodingDNATransformers](https://github.com/GustavoHCruz/CodingDNATransformers)
+You can find scripts for:
+- Preprocessing GenBank sequences
+- Fine-tuning models
+- Evaluating and using the trained models