| # Pretrained model | |
| This model is a pretrained autoregressive transformer model in GPT-style, trained on a large number of protein sequences. The training task used is "Sequence<...>". | |
| Dataset: https://huggingface.co/datasets/lamm-mit/GPTProteinPretrained | |
| Load pretrained model: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| pretrained_model_name='lamm-mit/GPTProteinPretrained' | |
| tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True) | |
| tokenizer.pad_token = tokenizer.eos_token | |
| model_name = pretrained_model_name | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| trust_remote_code=True | |
| ).to(device) | |
| model.config.use_cache = False | |
| ``` | |
| Sample inference using the "Sequence<...>" task, where here, the model will simply autocomplete the sequence starting with "AIIAA": | |
| ```python | |
| import torch | |
| device='cuda' | |
| prompt = "Sequence<ETAVPKLLQAL" | |
| generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device) | |
| print(generated.shape, generated) | |
| sample_outputs = model.generate( | |
| inputs=generated, | |
| eos_token_id =tokenizer.eos_token_id, | |
| do_sample=True, | |
| top_k=500, | |
| max_length = 1024, | |
| top_p=0.9, | |
| num_return_sequences=1, | |
| temperature=1, | |
| ).to(device) | |
| for i, sample_output in enumerate(sample_outputs): | |
| print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True))) | |
| ``` | |
| Output (here, three candidate sequences): | |
| ```raw | |
| torch.Size([1, 57]) tensor([[ 86, 104, 116, 120, 104, 113, 102, 104, 63, 80, 74, 84, 72, 73, | |
| 89, 81, 84, 87, 90, 89, 81, 72, 73, 76, 79, 79, 74, 79, | |
| 86, 86, 71, 84, 81, 87, 84, 89, 73, 79, 73, 89, 79, 76, | |
| 79, 89, 80, 92, 76, 76, 87, 89, 89, 74, 81, 86, 79, 76, | |
| 79]], device='cuda:0') | |
| 0: Sequence<MGQEFVNQTWVNEFILLGLSSDQNTQVFLFVLILVMYIITVVGNSLILLLIRLDSRLHTPMYFFLSNLSFVDLCFSTTTVPQLLANFLSVHKSISFLGCVAQLYIFLTLGGTEFFLLGAMAYDRYVAVCYPLHYTVIMNWRVCTSLAVASWVSGFLNSLVHTVITFRLPFCGPNEIDHFFCEVPALLKLACADTSLNEMAMNACCVLILLIPFSLILISYTRILITILRMPSATGRRKAFSTCASHIIVVILFYGTAISTYIQPSSDPVADQDKLMALFYAILTPMLNPIIYSLRNKDVKGAWQKLLNKLRVTQKRKFMAVTLH> | |
| ``` | |
| ## Citation | |
| To cite this work: | |
| ``` | |
| @article{WeiKaplanBuehler_2023, | |
| title = {Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins}, | |
| author = {M.J. Buehler}, | |
| journal = {J. Appl. Phys.}, | |
| year = {2023}, | |
| volume = {}, | |
| pages = {}, | |
| url = {https://doi.org/10.1063/5.0157367} | |
| } | |
| ``` |