DNA To Protein Translator

GPT-2 finetuned model for translate DNA into protein sequences, trained on a large cross-species GenBank dataset.

Model Architecture

Base model: GPT-2
Approach: DNA To Protein Translation

Usage

You can use this model through its own custom pipeline:

from transformers import pipeline

pipe = pipeline(
  task="gpt2-dna-translator",
  model="GustavoHCruz/DNATranslatorGPT2",
  trust_remote_code=True,
)

out = pipe({
  "sequence": "GTTTCTTTGCTTTTTAMGCTTGTATCTATTCTTCCATCGTAGACTGACCTGGTCATTTCTTTGCATCCAACGTA",
  "organism": "Homo sapiens"
})
print(out) # LTWSFLCIQR

out = pipe({
  "sequence": "ACACCAGCCTAGTTCTATGTCAGGTTCTAAAATATTTTCTGGTTCAATAAATAAAACATCAACATCTCACATAAAAGAAGTACGGAAAAGATTTAAAGGCAGTAACATATGAACGTAGGACGTTTAGGAGAAAAATGCTAAAAAAGTAGCTATTGTTAATTGAACATTACTCAGGGATGATCGGTTGTTTTTGTATTGACTTACCAAGACCACCATTGCCGAGTGCTGCATCCATTTCACGTTCTTCTAATTCTTCAATATCTAAATTCAACTCATAAAGAGCTTAATCA",
  "organism": "Rotaria socialis"
})
print(out) # MDAALGNGGL

This model uses the same maximum context length as the standard GPT-2 (1024 tokens). Its training was performed ensuring that the DNA sequence and the resulting protein would always fit within this context. An additional (and highly recommended) context is available: organism.

When using this pipeline, some rules will be applied to keep the model functioning in the same way as the training performed:

DNA sequences will be limited to 1000 tokens (each nucleotide becomes a token).
The organism (raw text) is limited to a maximum of 10 characters.
The generated response is limited to 1024 – the size of the received input. The minimum number of tokens generated, when the input is at its limit, is 11 new tokens.

Custom Usage Information

Prompt format:

The model expects the following input format:

<|DNA|>[DNA_G][DNA_T][DNA_T][DNA_T]...<|ORGANISM|>Homo sapiens

The model will generate a response in the following expected format:

<|PROTEIN|>[PROT_L][PROT_T][PROT_W]...<|END|>

Dataset

The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.

Training

Trained on an architecture with 8x H100 GPUs.

Metrics

The originally similarity scores (approximately 70% similarity) were obtained under a random sequence-level split, which allows shared proteins and closely related organisms across training and test sets. While this setup is useful to measure performance on easy and redundant cases, it may overestimate generalization performance.

For subsequent work, we adopt stricter organism-disjoint and unique-target evaluation protocols, which significantly reduce shortcut effects and lead to lower but more realistic scores.