GPepT / README.md

Update README.md

3d33457 verified 3 months ago

5.11 kB

	---
	license: apache-2.0
	pipeline_tag: text-generation
	widget:
	- text: <\|endoftext\|>
	inference:
	parameters:
	top_k: 950
	repetition_penalty: 1.2
	---

	# GPepT: A Language Model for Peptides and Peptidomimetics
	![alt text](TOC.png)

	GPepT is a cutting-edge language model designed to understand and generate sequences in the specialized domain of peptides and peptidomimetics. It serves as a powerful tool for _de novo_ protein design and engineering. As demonstrated in our research, the incorporation of peptidomimetics significantly broadens the chemical space accessible through generated sequences, enabling innovative approaches to peptide-based therapeutics.

	## Model Overview
	GPepT builds upon the GPT-2 Transformer architecture, comprising 36 layers and a model dimensionality of 1280, with a total of 738 million parameters. This decoder-only model has been pre-trained on a curated dataset of peptides and peptidomimetics mined from bioactivity-labeled chemical formulas in ChEMBL.

	To leverage GPepT’s pre-trained weights, input molecules must be converted into a standardized sequence-like representation of peptidomimetics using [Monomerizer](https://github.com/tsudalab/Monomerizer/tree/main). Detailed insights into the training process and datasets are provided in our accompanying publication.

	Unlike traditional protein design models, GPepT is trained in a self-supervised manner, using raw sequence data without explicit annotation. This design enables the model to generalize across diverse sequence spaces, producing functional antimicrobial peptidomimetics upon fine-tuning.

	SMILES representation, and selected chemical properties of each token, which corresponds to a non-canonical amino acid or terminal modification.

	---

	## Using GPepT for Sequence Generation
	GPepT is fully compatible with the HuggingFace Transformers Python library. Installation instructions can be found [here](https://huggingface.co/docs/transformers/installation).

	The model excels at generating peptidomimetic sequences in a zero-shot fashion, but it can also be fine-tuned on custom datasets to generate sequences tailored to specific requirements.


	### Example 1: Zero-Shot Sequence Generation
	GPepT generates sequences that extend from a specified input token (e.g., `<\|endoftext\|>`). If no input is provided, it selects the start token automatically and generates likely sequences. Here’s a Python example:

	```python
	from transformers import pipeline

	# Initialize GPepT for text generation
	GPepT = pipeline('text-generation', model="Playingyoyo/GPepT")

	# Generate sequences
	sequences = GPepT("<\|endoftext\|>",
	max_length=25,
	do_sample=True,
	top_k=950,
	repetition_penalty=1.5,
	num_return_sequences=5,
	eos_token_id=0)

	# Print generated sequences
	for seq in sequences:
	print(seq['generated_text'])
	```

	Sample output:
	```
	<\|endoftext\|>R K A L E Z1649
	<\|endoftext\|>G K A L Z341
	<\|endoftext\|>G V A G K X4097 V A P
	```

	---

	### Example 2: Fine-Tuning for Directed Sequence Generation
	Fine-tuning enables GPepT to generate sequences with user-defined properties. To prepare training data:
	1. ```git clone https://github.com/tsudalab/Monomerizer/tree/main```
	2. ```cd Monomerizer```
	3. ```python3 Monomerizer/run_pipeline.py --input_file path_to_your_smiles_file.txt```. Check the repo for the required format.
	4. 3. will monomerize the SMILES and split the resulting sequences into training (`output/datetime/for_GPepT/train90.txt`) and validation (`output/datetime/for_GPepT/val10.txt`) files.

	To fine-tune the model:

	```bash
	python run_clm.py --model_name_or_path Playingyoyo/GPepT \
	--train_file path_to_train90.txt \
	--validation_file path_to_val10.txt \
	--tokenizer_name Playingyoyo/GPepT \
	--do_train \
	--do_eval \
	--output_dir ./output \
	--learning_rate 1e-5
	```

	Refer to the HuggingFace [script run_clm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py) and [requirements.txt](https://huggingface.co/Playingyoyo/GPepT/blob/main/requirements.txt).
	Note that train90.txt and val10.txt need to be at least 50 samples long.

	The fine-tuned model will be saved in the `./output` directory, ready to generate tailored sequences.

	---

	## Selecting Valid Sequences
	While GPepT generates diverse peptidomimetic sequences, not all are chemically valid. For example:
	- Invalid Sequences: Those with terminal modifications (e.g., `Z`) embedded within the sequence.
	- Valid Sequences: Should adhere to standard peptidomimetic rules.

	By filtering out invalid sequences, GPepT users can ensure the generation of high-quality candidates for further study.

	---

	GPepT stands as a powerful tool for researchers at the forefront of peptide and peptidomimetic innovation, enabling both exploration and application in vast chemical and biological spaces.