lamm-mit
/

GPTProteinPretrained

Text Generation

text-generation-inference

Model card Files Files and versions

GPTProteinPretrained / README.md

mjbuehler's picture

Update README.md

5060f99 over 2 years ago

|

history blame contribute delete

2.85 kB


	# Pretrained model

	This model is a pretrained autoregressive transformer model in GPT-style, trained on a large number of protein sequences. The training task used is "Sequence<...>".

	Dataset: https://huggingface.co/datasets/lamm-mit/GPTProteinPretrained

	Load pretrained model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	pretrained_model_name='lamm-mit/GPTProteinPretrained'

	tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)
	tokenizer.pad_token = tokenizer.eos_token

	model_name = pretrained_model_name

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	trust_remote_code=True
	).to(device)

	model.config.use_cache = False
	```

	Sample inference using the "Sequence<...>" task, where here, the model will simply autocomplete the sequence starting with "AIIAA":

	```python
	import torch
	device='cuda'
	prompt = "Sequence<ETAVPKLLQAL"
	generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device)
	print(generated.shape, generated)

	sample_outputs = model.generate(
	inputs=generated,
	eos_token_id =tokenizer.eos_token_id,
	do_sample=True,
	top_k=500,
	max_length = 1024,
	top_p=0.9,
	num_return_sequences=1,
	temperature=1,
	).to(device)

	for i, sample_output in enumerate(sample_outputs):
	print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
	```
	Output (here, three candidate sequences):
	```raw
	torch.Size([1, 57]) tensor([[ 86, 104, 116, 120, 104, 113, 102, 104, 63, 80, 74, 84, 72, 73,
	89, 81, 84, 87, 90, 89, 81, 72, 73, 76, 79, 79, 74, 79,
	86, 86, 71, 84, 81, 87, 84, 89, 73, 79, 73, 89, 79, 76,
	79, 89, 80, 92, 76, 76, 87, 89, 89, 74, 81, 86, 79, 76,
	79]], device='cuda:0')
	0: Sequence<MGQEFVNQTWVNEFILLGLSSDQNTQVFLFVLILVMYIITVVGNSLILLLIRLDSRLHTPMYFFLSNLSFVDLCFSTTTVPQLLANFLSVHKSISFLGCVAQLYIFLTLGGTEFFLLGAMAYDRYVAVCYPLHYTVIMNWRVCTSLAVASWVSGFLNSLVHTVITFRLPFCGPNEIDHFFCEVPALLKLACADTSLNEMAMNACCVLILLIPFSLILISYTRILITILRMPSATGRRKAFSTCASHIIVVILFYGTAISTYIQPSSDPVADQDKLMALFYAILTPMLNPIIYSLRNKDVKGAWQKLLNKLRVTQKRKFMAVTLH>
	```

	## Citation
	To cite this work:
	```
	@article{WeiKaplanBuehler_2023,
	title = {Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins},
	author = {M.J. Buehler},
	journal = {J. Appl. Phys.},
	year = {2023},
	volume = {},
	pages = {},
	url = {https://doi.org/10.1063/5.0157367}
	}
	```