|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: <|endoftext|> |
|
|
inference: |
|
|
parameters: |
|
|
top_k: 950 |
|
|
repetition_penalty: 1.2 |
|
|
--- |
|
|
|
|
|
# **GPepT: A Language Model for Peptides and Peptidomimetics** |
|
|
 |
|
|
|
|
|
GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for _de novo_ protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics. |
|
|
|
|
|
## **Model Overview** |
|
|
GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL. |
|
|
|
|
|
To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using [**Monomerizer**](https://github.com/tsudalab/Monomerizer/tree/main). Detailed insights into the training process and datasets are provided in our accompanying publication. |
|
|
|
|
|
Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning. |
|
|
|
|
|
SMILES representation, and selected chemical properties of each token, which corresponds to a non-canonical amino acid or terminal modification. |
|
|
|
|
|
--- |
|
|
|
|
|
## **Using GPepT for Sequence Generation** |
|
|
GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found [here](https://huggingface.co/docs/transformers/installation). |
|
|
|
|
|
The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements. |
|
|
|
|
|
|
|
|
### **Example 1: Zero-Shot Sequence Generation** |
|
|
GPepT generates sequences that extend from a specified input token (e.g., `<|endoftext|>`). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Initialize GPepT for text generation |
|
|
GPepT = pipeline('text-generation', model="Playingyoyo/GPepT") |
|
|
|
|
|
# Generate sequences |
|
|
sequences = GPepT("<|endoftext|>", |
|
|
max_length=25, |
|
|
do_sample=True, |
|
|
top_k=950, |
|
|
repetition_penalty=1.5, |
|
|
num_return_sequences=5, |
|
|
eos_token_id=0) |
|
|
|
|
|
# Print generated sequences |
|
|
for seq in sequences: |
|
|
print(seq['generated_text']) |
|
|
``` |
|
|
|
|
|
Sample output: |
|
|
``` |
|
|
<|endoftext|>R K A L E Z1649 |
|
|
<|endoftext|>G K A L Z341 |
|
|
<|endoftext|>G V A G K X4097 V A P |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### **Example 2: Fine-Tuning for Directed Sequence Generation** |
|
|
Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data: |
|
|
1. ```git clone https://github.com/tsudalab/Monomerizer/tree/main``` |
|
|
2. ```cd Monomerizer``` |
|
|
3. ```python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt```. Check the repo for the required format. |
|
|
4. 3. will monomerize the SMILES and split the resulting sequences into training (`output/datetime/for_GPepT/train90.txt`) and validation (`output/datetime/for_GPepT/val10.txt`) files. |
|
|
|
|
|
To fine-tune the model: |
|
|
|
|
|
```bash |
|
|
python run_clm.py --model_name_or_path Playingyoyo/GPepT \ |
|
|
--train_file path_to_train90.txt \ |
|
|
--validation_file path_to_val10.txt \ |
|
|
--tokenizer_name Playingyoyo/GPepT \ |
|
|
--do_train \ |
|
|
--do_eval \ |
|
|
--output_dir ./output \ |
|
|
--learning_rate 1e-5 |
|
|
``` |
|
|
|
|
|
Refer to the HuggingFace [script run_clm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) and [requirements.txt](https://huggingface.co/Playingyoyo/GPepT/blob/main/requirements.txt). |
|
|
Note that train90.txt and val10.txt need to be at least 50 samples long. |
|
|
|
|
|
The fine-tuned model will be saved in the `./output` directory, ready to generate tailored sequences. |
|
|
|
|
|
--- |
|
|
|
|
|
## **Selecting Valid Sequences** |
|
|
While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example: |
|
|
- **Invalid Sequences:** Those with terminal modifications (e.g., `Z`) embedded within the sequence. |
|
|
- **Valid Sequences:** Should adhere to standard peptidomimetic rules. |
|
|
|
|
|
By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study. |
|
|
|
|
|
--- |
|
|
|
|
|
GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces. |