Update README.md
Browse files
README.md
CHANGED
|
@@ -18,9 +18,11 @@ AmpGPT2 is a language model capable of generating de novo antimicrobial peptides
|
|
| 18 |
|
| 19 |
AmpGPT2 is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) based on the GPT2 Transformer architecture.
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
|
|
|
| 23 |
|
|
|
|
| 24 |
|
| 25 |
## Training and evaluation data
|
| 26 |
|
|
@@ -30,7 +32,7 @@ AmpGPT2 was trained using 32014 AMP sequences from the Compass (https://compass.
|
|
| 30 |
|
| 31 |
The example code below contains the ideal generation settings found while testing.
|
| 32 |
The 'num_return_sequences' parameter specifies the amount of sequences generated. When generating more than 100 sequences at the same time, I recommend doing it in batches.
|
| 33 |
-
The results can then be checked with the peptide scanner
|
| 34 |
```
|
| 35 |
from transformers import pipeline
|
| 36 |
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
|
@@ -49,7 +51,7 @@ for i, seq in enumerate(amp_sequences):
|
|
| 49 |
print(f">{sequence_identifier}\n{sequence}")
|
| 50 |
```
|
| 51 |
|
| 52 |
-
### Training hyperparameters
|
| 53 |
|
| 54 |
The following hyperparameters were used during training:
|
| 55 |
- learning_rate: 1e-05
|
|
@@ -60,14 +62,24 @@ The following hyperparameters were used during training:
|
|
| 60 |
- lr_scheduler_type: linear
|
| 61 |
- num_epochs: 50.0
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
### Framework versions
|
| 73 |
|
|
@@ -75,3 +87,5 @@ The model was trained on four NVIDIA A100 GPUs.
|
|
| 75 |
- Pytorch 2.2.0+cu121
|
| 76 |
- Datasets 2.16.1
|
| 77 |
- Tokenizers 0.15.0
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
AmpGPT2 is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) based on the GPT2 Transformer architecture.
|
| 20 |
|
| 21 |
+
| Training Loss | Epoch | Validation Loss | Accuracy |
|
| 22 |
+
|:-------------:|:-----:|:---------------:|:--------:|
|
| 23 |
+
| 3.7948 | 50.0 | 3.9890 | 0.4213 |
|
| 24 |
|
| 25 |
+
To validate the results the Antimicrobial Peptide Scanner vr.2 (https://www.dveltri.com/ascan/v2/ascan.html) was used, which is a deep learning tool specifically designed for AMP recognition.
|
| 26 |
|
| 27 |
## Training and evaluation data
|
| 28 |
|
|
|
|
| 32 |
|
| 33 |
The example code below contains the ideal generation settings found while testing.
|
| 34 |
The 'num_return_sequences' parameter specifies the amount of sequences generated. When generating more than 100 sequences at the same time, I recommend doing it in batches.
|
| 35 |
+
The results can then be checked with the peptide scanner.
|
| 36 |
```
|
| 37 |
from transformers import pipeline
|
| 38 |
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
|
|
|
| 51 |
print(f">{sequence_identifier}\n{sequence}")
|
| 52 |
```
|
| 53 |
|
| 54 |
+
### Training hyperparameters and results
|
| 55 |
|
| 56 |
The following hyperparameters were used during training:
|
| 57 |
- learning_rate: 1e-05
|
|
|
|
| 62 |
- lr_scheduler_type: linear
|
| 63 |
- num_epochs: 50.0
|
| 64 |
|
| 65 |
+
\begin{table}[h!]
|
| 66 |
+
\centering
|
| 67 |
+
\caption{AMP Yield Comparison between AmpGPT2 and ProtGPT2}
|
| 68 |
+
\begin{tabular}{lccc}
|
| 69 |
+
\toprule
|
| 70 |
+
Model & Total Sequences & AMP Classified & AMP Percentage (AMP\%) \\
|
| 71 |
+
\midrule
|
| 72 |
+
AmpGPT2 & 10000 & 9541 & 95.41\% \\
|
| 73 |
+
ProtGPT2 & 10000 & 5530 & 55.3\% \\
|
| 74 |
+
\bottomrule
|
| 75 |
+
\end{tabular}
|
| 76 |
+
\label{tab:amp_yield}
|
| 77 |
+
\end{table}
|
| 78 |
+
|
| 79 |
+
| Model | Amp% | Length |
|
| 80 |
+
|:-------:|:-----:|:-------:|
|
| 81 |
+
|AmpGPT2|95.86|64.08 |
|
| 82 |
+
|ProtGPT2| 51.85 | 222.59 |
|
| 83 |
|
| 84 |
### Framework versions
|
| 85 |
|
|
|
|
| 87 |
- Pytorch 2.2.0+cu121
|
| 88 |
- Datasets 2.16.1
|
| 89 |
- Tokenizers 0.15.0
|
| 90 |
+
|
| 91 |
+
The model was trained on four NVIDIA A100 GPUs.
|