File size: 4,536 Bytes
5d9b5f6
 
2886b73
 
 
 
5d9b5f6
2886b73
5d9b5f6
 
 
2886b73
5d9b5f6
 
a1963ae
2886b73
 
 
 
 
 
 
 
 
a1963ae
2886b73
 
 
 
 
 
 
 
 
a1963ae
2886b73
a1963ae
5d9b5f6
2886b73
 
 
 
 
5d9b5f6
 
 
2886b73
5d9b5f6
 
2886b73
 
 
 
 
5d9b5f6
 
 
 
 
 
 
 
 
2886b73
 
 
 
 
 
 
 
 
 
 
 
1d7e28c
 
2886b73
a1963ae
a0c8f58
 
a1963ae
a0c8f58
 
 
a1963ae
5d9b5f6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- protein-language-model
---

# ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

[Paper on arxiv](https://arxiv.org/abs/2402.16445) for more information
[Github](https://github.com/PKU-YuanGroup/ProLLaMA) for more information

ProLLaMA_Stage_1 is based on Llama-2-7b, so please follow the license of Llama2.

## Abstract
Recent advances in Protein Language Models (PLMs) have transformed protein engineering, yet unlike their counterparts in Natural Language Processing (NLP), current PLMs exhibit a fundamental limitation: they excel in either Protein Language Understanding (PLU) or Protein Language Generation (PLG), but rarely both. This fragmentation hinders progress in protein engineering. To bridge this gap, we introduce ProLLaMA, a multitask protein language model enhanced by the Evolutionary Protein Generation Framework (EPGF). We construct a comprehensive instruction dataset containing approximately 13 million samples with over 11,000 superfamily annotations to facilitate better modeling of sequence-function landscapes. We leverage a two-stage training approach to develop ProLLaMA, a multitask LLM with protein domain expertise. Our EPGF addresses the mismatch between statistic language modeling and biological constraints through three innovations: a multi-dimensional interpretable scorer, hierarchical efficient decoding, and a probabilistic-biophysical joint selection mechanism. Extensive experiments demonstrate that ProLLaMA excels in both unconditional and controllable protein generation tasks, achieving superior structural quality metrics compared to existing PLMs. Additionally, ProLLaMA demonstrates strong understanding capabilities with a 67.1% exact match rate in superfamily prediction. EPGF significantly enhances the biological viability of generated sequences, as evidenced by improved biophysical scores (+4.3%) and structural metrics (+14.5%).

## Usage

This model is compatible with the `transformers` library. Below is a quick example of how to perform inference.

### Input Format
The instructions which you input to the model should follow the following format:
```text
[Generate by superfamily] Superfamily=<xxx>
or
[Determine superfamily] Seq=<yyy>
```
Here are some examples of the input:
```text
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>
```
```
#You can also specify the first few amino acids of the protein sequence:
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily> Seq=<MKRVL
```
```
[Determine superfamily] Seq=<MAPGGMPREFPSFVRTLPEADLGYPALRGWVLQGERGCVLYWEAVTEVALPEHCHAECWGVVVDGRMELMVDGYTRVYTRGDLYVVPPQARHRARVFPGFRGVEHLSDPDLLPVRKR>
```
For a full list of optional superfamilies, refer to [this file](https://github.com/PKU-YuanGroup/ProLLaMA/blob/main/superfamilies.txt) in the GitHub repository.

### Quick Inference Example

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from tqdm import tqdm

device = torch.device('cuda:0')

# You can replace the file_path with your local path
tokenizer = AutoTokenizer.from_pretrained("GreatCaptainNemo/ProLLaMA", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GreatCaptainNemo/ProLLaMA", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
generation_config = GenerationConfig(
    temperature=0.2,
    top_k=40,
    top_p=0.9,
    do_sample=True,
    num_beams=1,
    repetition_penalty=1.2,
    max_new_tokens=400
)
model.eval()
print("####Enter 'exit' to exit.")
with torch.no_grad():
    while True:
        messages = []
        user = str(input("Input:"))
        if user.strip() == "exit":
            break
        inputs = tokenizer(user, return_tensors="pt").to(device)
        generate_ids = model.generate(inputs.input_ids, generation_config=generation_config).to(device)
        response = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        print("Output:", response)
```

## Citation:
```
@article{lv2025prollama,
  title={Prollama: A protein large language model for multi-task protein language processing},
  author={Lv, Liuzhenghao and Lin, Zongying and Li, Hao and Liu, Yuyang and Cui, Jiaxi and Chen, Calvin Yu-Chian and Yuan, Li and Tian, Yonghong},
  journal={IEEE Transactions on Artificial Intelligence},
  year={2025},
  publisher={IEEE}
}
```