File size: 5,107 Bytes
1d0e5b4 0d596d9 1d0e5b4 d02561b 51637aa 223ed89 d02561b 223ed89 d02561b 223ed89 51637aa 223ed89 d02561b 223ed89 c5fdb97 d02561b 223ed89 d02561b 223ed89 d02561b 223ed89 c5fdb97 d02561b 223ed89 d02561b 223ed89 d02561b 223ed89 3d33457 d02561b 223ed89 d02561b 223ed89 d02561b b4c1c4b d02561b 223ed89 d02561b 223ed89 d02561b 207e4f6 223ed89 d02561b 223ed89 d02561b 207e4f6 d02561b 223ed89 8e4a08c 1ed836c e0cde05 d02561b 223ed89 d02561b 223ed89 d02561b 223ed89 d02561b 223ed89 d02561b 223ed89 d02561b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: apache-2.0
pipeline_tag: text-generation
widget:
- text: <|endoftext|>
inference:
parameters:
top_k: 950
repetition_penalty: 1.2
---
# **GPepT: A Language Model for Peptides and Peptidomimetics**

GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for _de novo_ protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics.
## **Model Overview**
GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL.
To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using [**Monomerizer**](https://github.com/tsudalab/Monomerizer/tree/main). Detailed insights into the training process and datasets are provided in our accompanying publication.
Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning.
SMILES representation, and selected chemical properties of each token, which corresponds to a non-canonical amino acid or terminal modification.
---
## **Using GPepT for Sequence Generation**
GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found [here](https://huggingface.co/docs/transformers/installation).
The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements.
### **Example 1: Zero-Shot Sequence Generation**
GPepT generates sequences that extend from a specified input token (e.g., `<|endoftext|>`). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example:
```python
from transformers import pipeline
# Initialize GPepT for text generation
GPepT = pipeline('text-generation', model="Playingyoyo/GPepT")
# Generate sequences
sequences = GPepT("<|endoftext|>",
max_length=25,
do_sample=True,
top_k=950,
repetition_penalty=1.5,
num_return_sequences=5,
eos_token_id=0)
# Print generated sequences
for seq in sequences:
print(seq['generated_text'])
```
Sample output:
```
<|endoftext|>R K A L E Z1649
<|endoftext|>G K A L Z341
<|endoftext|>G V A G K X4097 V A P
```
---
### **Example 2: Fine-Tuning for Directed Sequence Generation**
Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data:
1. ```git clone https://github.com/tsudalab/Monomerizer/tree/main```
2. ```cd Monomerizer```
3. ```python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt```. Check the repo for the required format.
4. 3. will monomerize the SMILES and split the resulting sequences into training (`output/datetime/for_GPepT/train90.txt`) and validation (`output/datetime/for_GPepT/val10.txt`) files.
To fine-tune the model:
```bash
python run_clm.py --model_name_or_path Playingyoyo/GPepT \
--train_file path_to_train90.txt \
--validation_file path_to_val10.txt \
--tokenizer_name Playingyoyo/GPepT \
--do_train \
--do_eval \
--output_dir ./output \
--learning_rate 1e-5
```
Refer to the HuggingFace [script run_clm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) and [requirements.txt](https://huggingface.co/Playingyoyo/GPepT/blob/main/requirements.txt).
Note that train90.txt and val10.txt need to be at least 50 samples long.
The fine-tuned model will be saved in the `./output` directory, ready to generate tailored sequences.
---
## **Selecting Valid Sequences**
While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example:
- **Invalid Sequences:** Those with terminal modifications (e.g., `Z`) embedded within the sequence.
- **Valid Sequences:** Should adhere to standard peptidomimetic rules.
By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study.
---
GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces. |