mjbuehler commited on
Commit
0bb3e26
·
1 Parent(s): 3af12c9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Pretrained model
3
+
4
+ This model is a pretrained autoregressive transformer model in GPT-style, trained on a large number of protein sequences.
5
+
6
+ Load pretrained model:
7
+
8
+ ```python
9
+ from transformers import AutoModelForCausalLM, AutoTokenizer
10
+
11
+ pretrained_model_name='lamm-mit/GPTProteinPretrained'
12
+
13
+ tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)
14
+ tokenizer.pad_token = tokenizer.eos_token
15
+
16
+ model_name = pretrained_model_name
17
+
18
+ model = AutoModelForCausalLM.from_pretrained(
19
+ model_name,
20
+ trust_remote_code=True
21
+ ).to(device)
22
+
23
+ model.config.use_cache = False
24
+ ```
25
+
26
+ Sample inference using the "Sequence<...>" task, where here, the model will simply autocomplete the sequence starting with "AIIAA":
27
+
28
+ ```python
29
+ prompt = "Sequence<AIIAA"
30
+ generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device)
31
+ print(generated.shape, generated)
32
+
33
+ sample_outputs = model.generate(
34
+ inputs=generated,
35
+ eos_token_id =tokenizer.eos_token_id,
36
+ do_sample=True,
37
+ top_k=500,
38
+ max_length = 300,
39
+ top_p=0.9,
40
+ num_return_sequences=3,
41
+ temperature=1,
42
+ ).to(device)
43
+
44
+ for i, sample_output in enumerate(sample_outputs):
45
+ print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
46
+ ```
47
+ Output (here, three candidate sequences):
48
+ ```raw
49
+ torch.Size([1, 4]) tensor([[303, 32, 853, 261]], device='cuda:0')
50
+ 0: Sequence<AIIAAGGDHGAPFNIALESLINQSGRIWDDGISKETVEDLEDLKSLRLQDPTAEQALICSILSSLQLDDTRQAELISQGCEQIIQGNNNLTQQIEQFCCPIDLCGSTLWSNAGISTQWPIYDQLQIIWEQKTEVGCRFVIDSKQLVYQVEFATPVLTLPNLRGFTRLEYLNDYRNSYIYVGGDSMGFPFDGIVNDTCAAGTLAT>
51
+ 1: Sequence<AIIAASHEQVSRLLGDLIYKVNWGTATDSNTTVDSGSKYDADYAYVLKPDNIATIHTNIIDKWKADVDVTEENVDKFSGKPIYNSFHADGGIDLVGLTVEERMAHVHHRITLKPVYQYAGIEECMFNIDKARVLHIPEGYRKVYDRATAIHTAILDDPDYAEFMAYKMNKTDLVKPVELIEVTKLDKKGMWNGHHGGVVMLGGRGIHHASNGYGVETIEYFRNDNWSEEYHYDRVNLIHGMGGRGMKEAALEEIAKAINNLDYTSMIHDAEDYKILPSGESKDIVGETKLNGAMVGRAYLKLMKINMEELDVYMKPGSHHHHHH>
52
+ 2: Sequence<AIIAATKHRTRAKQLVEKLNEVSKTKKDLVLVGISASGQHRQIDTTSRRPSSAKKRVVLYGVLEKQFLHDARTYHPTNSRGITGELLLVEDLIHDRRLDNVAYVIQSKKGLIHQRRVTHGHVLVNRTHHVKVKAGSSDIVDFDKVIRVAEETAKESDVLIVLEADDPEALIYLGVKADIDIDVRTLTNEVGDGTTVHIIDLGADGILLPTKEDLKLPANVNKAVIDIKAKNIP>
53
+ ```
54
+
55
+ ## Citation
56
+ To cite this work:
57
+ ```
58
+ @article{WeiKaplanBuehler_2023,
59
+ title = {Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins},
60
+ author = {M.J. Buehler},
61
+ journal = {J. Appl. Phys.},
62
+ year = {2023},
63
+ volume = {},
64
+ pages = {},
65
+ url = {https://doi.org/10.1063/5.0157367}
66
+ }
67
+ ```