GPepT

File size: 5,107 Bytes

1d0e5b4
 
 
 
0d596d9
1d0e5b4
 
 
 
 
 
d02561b
51637aa
223ed89
d02561b
223ed89
d02561b
 
223ed89
51637aa
223ed89
d02561b
223ed89
c5fdb97
 
d02561b
223ed89
d02561b
 
223ed89
d02561b
223ed89
c5fdb97
d02561b
 
223ed89
d02561b
 
223ed89
d02561b
 
223ed89
3d33457
d02561b
 
 
 
 
 
 
223ed89
d02561b
 
 
 
223ed89
d02561b
 
b4c1c4b
 
 
d02561b
223ed89
d02561b
223ed89
d02561b
 
207e4f6
 
 
 
223ed89
d02561b
223ed89
d02561b
 
207e4f6
 
d02561b
 
 
 
 
 
223ed89
8e4a08c
1ed836c
e0cde05
d02561b
223ed89
d02561b
223ed89
d02561b
 
 
 
223ed89
d02561b
223ed89
d02561b
223ed89
d02561b

---
license: apache-2.0
pipeline_tag: text-generation
widget:
- text: <|endoftext|>
inference:
  parameters:
    top_k: 950
    repetition_penalty: 1.2
---

# **GPepT: A Language Model for Peptides and Peptidomimetics**
![alt text](TOC.png)

GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for _de novo_ protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics.

## **Model Overview**
GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL.

To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using [**Monomerizer**](https://github.com/tsudalab/Monomerizer/tree/main). Detailed insights into the training process and datasets are provided in our accompanying publication.

Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning.

SMILES representation, and selected chemical properties of each token, which corresponds to a non-canonical amino acid or terminal modification.

---

## **Using GPepT for Sequence Generation**
GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found [here](https://huggingface.co/docs/transformers/installation).

The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements.


### **Example 1: Zero-Shot Sequence Generation**
GPepT generates sequences that extend from a specified input token (e.g., `<|endoftext|>`). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example:

```python
from transformers import pipeline

# Initialize GPepT for text generation
GPepT = pipeline('text-generation', model="Playingyoyo/GPepT")

# Generate sequences
sequences = GPepT("<|endoftext|>", 
                   max_length=25, 
                   do_sample=True, 
                   top_k=950, 
                   repetition_penalty=1.5, 
                   num_return_sequences=5, 
                   eos_token_id=0)

# Print generated sequences
for seq in sequences:
    print(seq['generated_text'])
```

Sample output:
```
<|endoftext|>R K A L E Z1649
<|endoftext|>G K A L Z341
<|endoftext|>G V A G K X4097 V A P
```

---

### **Example 2: Fine-Tuning for Directed Sequence Generation**
Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data:
1. ```git clone https://github.com/tsudalab/Monomerizer/tree/main```
2. ```cd Monomerizer```
3. ```python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt```. Check the repo for the required format.
4. 3. will monomerize the SMILES and split the resulting sequences into training (`output/datetime/for_GPepT/train90.txt`) and validation (`output/datetime/for_GPepT/val10.txt`) files.

To fine-tune the model:

```bash
python run_clm.py --model_name_or_path Playingyoyo/GPepT \
                  --train_file path_to_train90.txt \
                  --validation_file path_to_val10.txt \
                  --tokenizer_name Playingyoyo/GPepT \
                  --do_train \
                  --do_eval \
                  --output_dir ./output \
                  --learning_rate 1e-5
```

Refer to the HuggingFace [script run_clm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) and [requirements.txt](https://huggingface.co/Playingyoyo/GPepT/blob/main/requirements.txt).
Note that train90.txt and val10.txt need to be at least 50 samples long.

The fine-tuned model will be saved in the `./output` directory, ready to generate tailored sequences.

---

## **Selecting Valid Sequences**
While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example:
- **Invalid Sequences:** Those with terminal modifications (e.g., `Z`) embedded within the sequence.
- **Valid Sequences:** Should adhere to standard peptidomimetic rules.

By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study.

--- 

GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces.