liyuesen commited on
Commit
fad01a9
·
0 Parent(s):

Duplicate from liyuesen/druggpt

Browse files
.gitattributes ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tflite filter=lfs diff=lfs merge=lfs -text
29
+ *.tgz filter=lfs diff=lfs merge=lfs -text
30
+ *.wasm filter=lfs diff=lfs merge=lfs -text
31
+ *.xz filter=lfs diff=lfs merge=lfs -text
32
+ *.zip filter=lfs diff=lfs merge=lfs -text
33
+ *.zst filter=lfs diff=lfs merge=lfs -text
34
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ tags:
4
+ - chemistry
5
+ - biology
6
+ - medical
7
+ - gpt2
8
+ ---
9
+ # DrugGPT
10
+ A generative drug design model based on GPT2.
11
+ <img src="https://img.shields.io/github/license/LIYUESEN/druggpt"><img src="https://img.shields.io/badge/python-3.7-blue"><img src="https://img.shields.io/github/stars/LIYUESEN/druggpt?style=social">
12
+ ## 🚩 Introduction
13
+ DrugGPT is a generative pharmaceutical strategy based on GPT structure, which aims to bring innovation to drug design by using natural language processing technique.
14
+
15
+ This project applies the GPT model to the exploration of chemical space to discover new molecules with potential binding abilities for specific proteins.
16
+
17
+ DrugGPT provides a fast and efficient method for the generation of drug candidate molecules by training on up to 1.8 million protein-ligand binding data.
18
+ ## 📥 Deployment
19
+ 1. Clone
20
+ ```shell
21
+ git clone https://github.com/LIYUESEN/druggpt.git
22
+ cd druggpt
23
+ ```
24
+ Or you can visit our [GitHub repo](https://github.com/LIYUESEN/druggpt) and click *Code>Download ZIP* to download this repo.
25
+ 2. Create virtual environment
26
+ ```shell
27
+ conda create -n druggpt python=3.7
28
+ conda activate druggpt
29
+ ```
30
+ 3. Download python dependencies
31
+ ```shell
32
+ pip install datasets transformers scipy scikit-learn
33
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
34
+ conda install -c openbabel openbabel
35
+ ```
36
+ ## 🗝 How to use
37
+ Use [drug_generator.py](https://github.com/LIYUESEN/druggpt/blob/main/drug_generator.py)
38
+
39
+ Required parameters:
40
+ - `-p` | `--pro_seq`: Input a protein amino acid sequence.
41
+ - `-f` | `--fasta`: Input a FASTA file.
42
+
43
+ > Only one of -p and -f should be specified.
44
+ - `-l` | `--ligand_prompt`: Input a ligand prompt.
45
+ - `-e` | `--empty_input`: Enable directly generate mode.
46
+ - `-n` | `--number`: At least how many molecules will be generated.
47
+ - `-d` | `--device`: Hardware device to use. Default is 'cuda'.
48
+ - `-o` | `--output`: Output directory for generated molecules. Default is './ligand_output/'.
49
+ - `-b` | `--batch_size`: How many molecules will be generated per batch. Try to reduce this value if you have low RAM. Default is 32.
50
+ ## 🔬 Example usage
51
+ - If you want to input a protein FASTA file
52
+ ```shell
53
+ python drug_generator.py -f bcl2.fasta -n 50
54
+ ```
55
+ - If you want to input the amino acid sequence of the protein
56
+ ```shell
57
+ python drug_generator.py -p MAKQPSDVSSECDREGRQLQPAERPPQLRPGAPTSLQTEPQGNPEGNHGGEGDSCPHGSPQGPLAPPASPGPFATRSPLFIFMRRSSLLSRSSSGYFSFDTDRSPAPMSCDKSTQTPSPPCQAFNHYLSAMASMRQAEPADMRPEIWIAQELRRIGDEFNAYYARRVFLNNYQAAEDHPRMVILRLLRYIVRLVWRMH -n 50
58
+ ```
59
+
60
+ - If you want to provide a prompt for the ligand
61
+ ```shell
62
+ python drug_generator.py -f bcl2.fasta -l COc1ccc(cc1)C(=O) -n 50
63
+ ```
64
+
65
+ - Note: If you are running in a Linux environment, you need to enclose the ligand's prompt with single quotes ('').
66
+ ```shell
67
+ python drug_generator.py -f bcl2.fasta -l 'COc1ccc(cc1)C(=O)' -n 50
68
+ ```
69
+ ## 📝 How to reference this work
70
+ DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins
71
+
72
+ Yuesen Li, Chengyi Gao, Xin Song, Xiangyu Wang, Yungang Xu, Suxia Han
73
+
74
+ bioRxiv 2023.06.29.543848; doi: [https://doi.org/10.1101/2023.06.29.543848](https://doi.org/10.1101/2023.06.29.543848)
75
+
76
+ [![DOI](https://img.shields.io/badge/DOI-10.1101/2023.06.29.543848-blue)](https://doi.org/10.1101/2023.06.29.543848)
77
+ ## ⚖ License
78
+ [GNU General Public License v3.0](https://www.gnu.org/licenses/gpl-3.0.html)
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "<|startoftext|>": 53082,
3
+ "[PAD]": 53081
4
+ }
config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "../model_save/epoch_4",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "GPT2LMHeadModel"
6
+ ],
7
+ "attn_pdrop": 0.1,
8
+ "bos_token_id": 50256,
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_ctx": 1024,
15
+ "n_embd": 768,
16
+ "n_head": 12,
17
+ "n_inner": null,
18
+ "n_layer": 12,
19
+ "n_positions": 1024,
20
+ "reorder_and_upcast_attn": false,
21
+ "resid_pdrop": 0.1,
22
+ "scale_attn_by_inverse_layer_idx": false,
23
+ "scale_attn_weights": true,
24
+ "summary_activation": null,
25
+ "summary_first_dropout": 0.1,
26
+ "summary_proj_to_labels": true,
27
+ "summary_type": "cls_index",
28
+ "summary_use_proj": true,
29
+ "task_specific_params": {
30
+ "text-generation": {
31
+ "do_sample": true,
32
+ "max_length": 50
33
+ }
34
+ },
35
+ "torch_dtype": "float32",
36
+ "transformers_version": "4.27.2",
37
+ "use_cache": true,
38
+ "vocab_size": 53083
39
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.27.2"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8e5c858ce18e9d2ac195182830525246c01b0741dd9519aa0871d93b742c870
3
+ size 519077053
special_tokens_map.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|startoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "[PAD]",
5
+ "unk_token": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ }
12
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "bos_token": {
5
+ "__type": "AddedToken",
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 1000000000000000019884624838656,
22
+ "pad_token": null,
23
+ "special_tokens_map_file": "tokenizer_folder/new_tokenizer_gpt\\special_tokens_map.json",
24
+ "tokenizer_class": "GPT2Tokenizer",
25
+ "unk_token": {
26
+ "__type": "AddedToken",
27
+ "content": "<|endoftext|>",
28
+ "lstrip": false,
29
+ "normalized": true,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff