lamm-mit
/

GPTProteinPretrained

Text Generation

text-generation-inference

Model card Files Files and versions

mjbuehler commited on Dec 4, 2023

Commit

0bb3e26

·

1 Parent(s): 3af12c9

Create README.md

Files changed (1) hide show

README.md +67 -0

README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+# Pretrained model
+This model is a pretrained autoregressive transformer model in GPT-style, trained on a large number of protein sequences.
+Load pretrained model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+pretrained_model_name='lamm-mit/GPTProteinPretrained'
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+model_name = pretrained_model_name
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    trust_remote_code=True
+).to(device)
+model.config.use_cache = False
+```
+Sample inference using the "Sequence<...>" task, where here, the model will simply autocomplete the sequence starting with "AIIAA":
+```python
+prompt = "Sequence<AIIAA"
+generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device)
+print(generated.shape, generated)
+sample_outputs = model.generate(
+                                inputs=generated,
+                                eos_token_id =tokenizer.eos_token_id,
+                                do_sample=True,
+                                top_k=500,
+                                max_length = 300,
+                                top_p=0.9,
+                                num_return_sequences=3,
+                                temperature=1,
+                                ).to(device)
+for i, sample_output in enumerate(sample_outputs):
+      print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
+```
+Output (here, three candidate sequences):
+```raw
+torch.Size([1, 4]) tensor([[303,  32, 853, 261]], device='cuda:0')
+0: Sequence<AIIAAGGDHGAPFNIALESLINQSGRIWDDGISKETVEDLEDLKSLRLQDPTAEQALICSILSSLQLDDTRQAELISQGCEQIIQGNNNLTQQIEQFCCPIDLCGSTLWSNAGISTQWPIYDQLQIIWEQKTEVGCRFVIDSKQLVYQVEFATPVLTLPNLRGFTRLEYLNDYRNSYIYVGGDSMGFPFDGIVNDTCAAGTLAT>
+1: Sequence<AIIAASHEQVSRLLGDLIYKVNWGTATDSNTTVDSGSKYDADYAYVLKPDNIATIHTNIIDKWKADVDVTEENVDKFSGKPIYNSFHADGGIDLVGLTVEERMAHVHHRITLKPVYQYAGIEECMFNIDKARVLHIPEGYRKVYDRATAIHTAILDDPDYAEFMAYKMNKTDLVKPVELIEVTKLDKKGMWNGHHGGVVMLGGRGIHHASNGYGVETIEYFRNDNWSEEYHYDRVNLIHGMGGRGMKEAALEEIAKAINNLDYTSMIHDAEDYKILPSGESKDIVGETKLNGAMVGRAYLKLMKINMEELDVYMKPGSHHHHHH>
+2: Sequence<AIIAATKHRTRAKQLVEKLNEVSKTKKDLVLVGISASGQHRQIDTTSRRPSSAKKRVVLYGVLEKQFLHDARTYHPTNSRGITGELLLVEDLIHDRRLDNVAYVIQSKKGLIHQRRVTHGHVLVNRTHHVKVKAGSSDIVDFDKVIRVAEETAKESDVLIVLEADDPEALIYLGVKADIDIDVRTLTNEVGDGTTVHIIDLGADGILLPTKEDLKLPANVNKAVIDIKAKNIP>
+```
+## Citation
+To cite this work:
+```
+@article{WeiKaplanBuehler_2023,
+    title   = {Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins},
+    author  = {M.J. Buehler},
+    journal = {J. Appl. Phys.},
+    year    = {2023},
+    volume  = {},
+    pages   = {},
+    url     = {https://doi.org/10.1063/5.0157367}
+}
+```