Duplicate from liyuesen/druggpt

Browse files

Files changed (11) hide show

.gitattributes +34 -0
README.md +78 -0
added_tokens.json +4 -0
config.json +39 -0
generation_config.json +6 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +12 -0
tokenizer.json +0 -0
tokenizer_config.json +33 -0
vocab.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,34 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,78 @@

+---
+license: gpl-3.0
+tags:
+- chemistry
+- biology
+- medical
+- gpt2
+---
+# DrugGPT
+A generative drug design model based on GPT2.
+<img src="https://img.shields.io/github/license/LIYUESEN/druggpt"><img src="https://img.shields.io/badge/python-3.7-blue"><img src="https://img.shields.io/github/stars/LIYUESEN/druggpt?style=social">
+## 🚩 Introduction
+DrugGPT is a generative pharmaceutical strategy based on GPT structure, which aims to bring innovation to drug design by using natural language processing technique.
+This project applies the GPT model to the exploration of chemical space to discover new molecules with potential binding abilities for specific proteins.
+DrugGPT provides a fast and efficient method for the generation of drug candidate molecules by training on up to 1.8 million protein-ligand binding data.
+## 📥 Deployment
+1. Clone
+    ```shell
+    git clone https://github.com/LIYUESEN/druggpt.git
+    cd druggpt
+    ```
+   Or you can visit our [GitHub repo](https://github.com/LIYUESEN/druggpt) and click *Code>Download ZIP* to download this repo.
+2. Create virtual environment
+    ```shell
+    conda create -n druggpt python=3.7
+    conda activate druggpt
+    ```
+3. Download python dependencies
+    ```shell
+    pip install datasets transformers scipy scikit-learn
+    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
+    conda install -c openbabel openbabel
+    ```
+## 🗝 How to use
+Use [drug_generator.py](https://github.com/LIYUESEN/druggpt/blob/main/drug_generator.py)
+Required parameters:
+- `-p` | `--pro_seq`: Input a protein amino acid sequence.
+- `-f` | `--fasta`: Input a FASTA file.
+  > Only one of -p and -f should be specified.
+- `-l` | `--ligand_prompt`: Input a ligand prompt.
+- `-e` | `--empty_input`: Enable directly generate mode.
+- `-n` | `--number`: At least how many molecules will be generated.
+- `-d` | `--device`: Hardware device to use. Default is 'cuda'.
+- `-o` | `--output`: Output directory for generated molecules. Default is './ligand_output/'.
+- `-b` | `--batch_size`: How many molecules will be generated per batch. Try to reduce this value if you have low RAM. Default is 32.
+## 🔬 Example usage
+- If you want to input a protein FASTA file
+    ```shell
+    python drug_generator.py -f bcl2.fasta -n 50
+    ```
+- If you want to input the amino acid sequence of the protein
+    ```shell
+    python drug_generator.py -p MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH -n 50
+    ```
+- If you want to provide a prompt for the ligand
+    ```shell
+    python drug_generator.py -f bcl2.fasta -l COc1ccc(cc1)C(=O) -n 50
+    ```
+- Note: If you are running in a Linux environment, you need to enclose the ligand's prompt with single quotes ('').
+    ```shell
+    python drug_generator.py -f bcl2.fasta -l 'COc1ccc(cc1)C(=O)' -n 50
+    ```
+## 📝 How to reference this work
+DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
+Yuesen Li, Chengyi Gao, Xin Song, Xiangyu Wang, Yungang Xu, Suxia Han
+bioRxiv 2023.06.29.543848; doi: [https://doi.org/10.1101/2023.06.29.543848](https://doi.org/10.1101/2023.06.29.543848)
+[![DOI](https://img.shields.io/badge/DOI-10.1101/2023.06.29.543848-blue)](https://doi.org/10.1101/2023.06.29.543848)
+## ⚖ License
+[GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.html)

added_tokens.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "<|startoftext|>": 53082,
+  "[PAD]": 53081
+}

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "_name_or_path": "../model_save/epoch_4",
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.1,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50
+    }
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.27.2",
+  "use_cache": true,
+  "vocab_size": 53083
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 50256,
+  "eos_token_id": 50256,
+  "transformers_version": "4.27.2"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e8e5c858ce18e9d2ac195182830525246c01b0741dd9519aa0871d93b742c870
+size 519077053

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bos_token": "<|startoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "[PAD]",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "errors": "replace",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": null,
+  "special_tokens_map_file": "tokenizer_folder/new_tokenizer_gpt\\special_tokens_map.json",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff