AmpGPT2

AmpGPT2 is a language model capable of generating de novo antimicrobial peptides (AMPs). Over 95% of sequences generated by AmpGPT2 are predicted to have antimicrobial activities.

Model description

AmpGPT2 is a fine-tuned version of nferruz/ProtGPT2 based on the GPT2 Transformer architecture.

Model sequences generated AMP percentage (AMP%) average length
AmpGPT2 1000 95.86 64.08
ProtGPT2 1000 51.85 222.59

The results demonstrate that AmpGPT2 outperformes ProtGPT2 in AMP%, suggesting the model learned from the AMP-specific data.
To validate the results the Antimicrobial Peptide Scanner vr.2 (https://www.dveltri.com/ascan/v2/ascan.html) was used, which is a deep learning tool specifically designed for AMP recognition.

Training and evaluation data

AmpGPT2 was trained using 32014 AMP sequences from the Compass (https://compass.mathematik.uni-marburg.de/) database.

How to use AmpGPT2

The example code below contains the ideal generation settings found while testing. The 'num_return_sequences' parameter specifies the amount of sequences generated. When generating more than 100 sequences at the same time, I recommend doing it in batches. The results can then be checked with the peptide scanner.

from transformers import pipeline
from transformers import GPT2LMHeadModel, GPT2Tokenizer

ampgpt2 = pipeline('text-generation', model="wabu/AmpGPT2")

model_amp = GPT2LMHeadModel.from_pretrained('wabu/AmpGPT2')
tokenizer_amp = GPT2Tokenizer.from_pretrained('wabu/AmpGPT2')

amp_sequences = ampgpt2( "", do_sample=True, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0 )

for i, seq in enumerate(amp_sequences):
    sequence_identifier = f"Sequence_{i + 1}"
    sequence = seq['generated_text'].replace('','').strip()

    print(f">{sequence_identifier}\n{sequence}")

Training hyperparameters and results

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 50.0
Training Loss Epoch Validation Loss Accuracy
3.7948 50.0 3.9890 0.4213

Framework versions

  • Transformers 4.38.0.dev0
  • Pytorch 2.2.0+cu121
  • Datasets 2.16.1
  • Tokenizers 0.15.0

The model was trained on four NVIDIA A100 GPUs.

Downloads last month
23
Safetensors
Model size
0.8B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wabu/AmpGPT2

Base model

nferruz/ProtGPT2
Finetuned
(13)
this model

Evaluation results