GPepT / README.md
Playingyoyo's picture
Update README.md
3d33457 verified
---
license: apache-2.0
pipeline_tag: text-generation
widget:
- text: <|endoftext|>
inference:
parameters:
top_k: 950
repetition_penalty: 1.2
---
# **GPepT: A Language Model for Peptides and Peptidomimetics**
![alt text](TOC.png)
GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for _de novo_ protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics.
## **Model Overview**
GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL.
To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using [**Monomerizer**](https://github.com/tsudalab/Monomerizer/tree/main). Detailed insights into the training process and datasets are provided in our accompanying publication.
Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning.
SMILES representation, and selected chemical properties of each token, which corresponds to a non-canonical amino acid or terminal modification.
---
## **Using GPepT for Sequence Generation**
GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found [here](https://huggingface.co/docs/transformers/installation).
The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements.
### **Example 1: Zero-Shot Sequence Generation**
GPepT generates sequences that extend from a specified input token (e.g., `<|endoftext|>`). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example:
```python
from transformers import pipeline
# Initialize GPepT for text generation
GPepT = pipeline('text-generation', model="Playingyoyo/GPepT")
# Generate sequences
sequences = GPepT("<|endoftext|>",
max_length=25,
do_sample=True,
top_k=950,
repetition_penalty=1.5,
num_return_sequences=5,
eos_token_id=0)
# Print generated sequences
for seq in sequences:
print(seq['generated_text'])
```
Sample output:
```
<|endoftext|>R K A L E Z1649
<|endoftext|>G K A L Z341
<|endoftext|>G V A G K X4097 V A P
```
---
### **Example 2: Fine-Tuning for Directed Sequence Generation**
Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data:
1. ```git clone https://github.com/tsudalab/Monomerizer/tree/main```
2. ```cd Monomerizer```
3. ```python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt```. Check the repo for the required format.
4. 3. will monomerize the SMILES and split the resulting sequences into training (`output/datetime/for_GPepT/train90.txt`) and validation (`output/datetime/for_GPepT/val10.txt`) files.
To fine-tune the model:
```bash
python run_clm.py --model_name_or_path Playingyoyo/GPepT \
--train_file path_to_train90.txt \
--validation_file path_to_val10.txt \
--tokenizer_name Playingyoyo/GPepT \
--do_train \
--do_eval \
--output_dir ./output \
--learning_rate 1e-5
```
Refer to the HuggingFace [script run_clm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) and [requirements.txt](https://huggingface.co/Playingyoyo/GPepT/blob/main/requirements.txt).
Note that train90.txt and val10.txt need to be at least 50 samples long.
The fine-tuned model will be saved in the `./output` directory, ready to generate tailored sequences.
---
## **Selecting Valid Sequences**
While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example:
- **Invalid Sequences:** Those with terminal modifications (e.g., `Z`) embedded within the sequence.
- **Valid Sequences:** Should adhere to standard peptidomimetic rules.
By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study.
---
GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces.