habdine commited on
Commit
c8c24f4
·
verified ·
1 Parent(s): ed5579d

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -3
README.md CHANGED
@@ -1,3 +1,114 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - Causal Language Modeling
4
+ - GPT2
5
+ - ESM2
6
+ - Proteins
7
+ - GNN
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ language:
11
+ - en
12
+
13
+ license: cc-by-nc-4.0
14
+ ---
15
+ # Prot2Text Model Card
16
+
17
+ ## Model Information
18
+
19
+ **Model Page:** [Prot2Text](http://nlp.polytechnique.fr/prot2text#proteins)
20
+ **Paper:** [https://arxiv.org/abs/2307.14367](https://arxiv.org/abs/2307.14367)
21
+ **Github:** [https://github.com/hadi-abdine/Prot2Text](https://github.com/hadi-abdine/Prot2Text)
22
+
23
+
24
+ ```
25
+ @inproceedings{abdine2024prot2text,
26
+ title={Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers},
27
+ author={Abdine, Hadi and Chatzianastasis, Michail and Bouyioukos, Costas and Vazirgiannis, Michalis},
28
+ booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
29
+ volume={38},
30
+ pages={10757--10765},
31
+ year={2024}
32
+ }
33
+ ```
34
+
35
+ ### Description
36
+
37
+ Prot2Text is a family of models that predict a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework. Prot2Text effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions.
38
+
39
+ Prot2Text is trained on a [multimodal dataset](https://huggingface.co/datasets/habdine/Prot2Text-Data) that consists of 256,690 proteins. For each protein, we have three information: the correspond- ing sequence, the AlphaFold accession ID and the textual description. To build this dataset, we used the SwissProt database the only curated proteins knowledge base with full proteins’ textual description included in the UniProtKB Consortium (2016) Release 2022_04.
40
+
41
+ ### Models and Results
42
+
43
+
44
+ | Model | #params | BLEU Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BERT Score | Link |
45
+ |:--------------------------:|:--------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:------------:|
46
+ | Prot2Text<sub>SMALL</sub> | 256M | 30.01 | 45.78 | 38.08 | 43.97 | 82.60 | [v1.0](https://huggingface.co/habdine/Prot2Text-Small-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Small-v1-1) |
47
+ | Prot2Text<sub>BASE</sub> | 283M | 35.11 | 50.59 | 42.71 | 48.49 | 84.30 | [v1.0](https://huggingface.co/habdine/Prot2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Base-v1-1) |
48
+ | Prot2Text<sub>MEDIUM</sub>| 398M | 36.51 | 52.13 | 44.17 | 50.04 | 84.83 | [v1.0](https://huggingface.co/habdine/Prot2Text-Medium-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Medium-v1-1) |
49
+ | Prot2Text<sub>LARGE</sub> | 898M | 36.29 | 53.68 | 45.60 | 51.40 | 85.20 | [v1.0](https://huggingface.co/habdine/Prot2Text-Large-v1-0)- [v1.1](https://huggingface.co/habdine/Prot2Text-Large-v1-1) |
50
+ | Esm2Text<sub>BASE</sub> | 225M | 32.11 | 47.46 | 39.18 | 45.31 | 83.21 | [v1.0](https://huggingface.co/habdine/Esm2Text-Base-v1-0)- [v1.1](https://huggingface.co/habdine/Esm2Text-Base-v1-1) |
51
+
52
+ The reported results are computed using v1.0
53
+
54
+ ### Usage
55
+
56
+ Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library, graphein, DSSP, torch and torch geometric with:
57
+ ```sh
58
+ pip install -U transformers
59
+ git clone https://github.com/a-r-j/graphein.git
60
+ pip install -e graphein/
61
+ pip install torch
62
+ pip install torch_geometric
63
+ pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv
64
+ sudo apt-get install dssp
65
+ sudo ln -s /usr/bin/mkdssp /usr/bin/dssp
66
+ ```
67
+ You might need to install different versions/variants according to your environnement.
68
+
69
+ Then, copy the snippet from the section that is relevant for your usecase.
70
+
71
+ #### Running Prot2Text to generate a protein's function using both its structure and sequence
72
+ To generate a protein's function using both its structure and amino-acid sequence, you need to load one of Prot2Text models and choose the AlphaFold database ID of the protein.
73
+
74
+ ```python
75
+ from transformers import AutoModelForCausalLM, AutoTokenizer
76
+
77
+ tokenizer = AutoTokenizer.from_pretrained('habdine/Prot2Text-Base-v1-1',
78
+ trust_remote_code=True)
79
+ model = AutoModelForCausalLM.from_pretrained('habdine/Prot2Text-Base-v1-1',
80
+ trust_remote_code=True)
81
+
82
+ function = model.generate_protein_description(protein_pdbID='Q10MK9',
83
+ tokenizer=tokenizer,
84
+ device='cuda' # replace with 'mps' to run on a Mac device
85
+ )
86
+
87
+ print(function)
88
+ # 'Carboxylate--CoA ligase that may use 4-coumarate as substrate. Follows a two-step reaction mechanism, wherein the carboxylate substrate first undergoes adenylation by ATP, followed by a thioesterification in the presence of CoA to yield the final CoA thioester.'
89
+ ```
90
+ <br>
91
+
92
+ #### Running Esm2Text to generate a protein's function using only its sequence
93
+ To generate a protein's function using only its amino-acid sequence, you need to load Esm2Text-Base model and pass an amino-acid sequence.
94
+
95
+ ```python
96
+ from transformers import AutoModelForCausalLM, AutoTokenizer
97
+
98
+ tokenizer = AutoTokenizer.from_pretrained('habdine/Esm2Text-Base-v1-1',
99
+ trust_remote_code=True)
100
+ model = AutoModelForCausalLM.from_pretrained('habdine/Esm2Text-Base-v1-1',
101
+ trust_remote_code=True)
102
+
103
+ function = model.generate_protein_description(protein_sequence='AEQAERYEEMVEFMEKL',
104
+ tokenizer=tokenizer,
105
+ device='cuda' # replace with 'mps' to run on a Mac device
106
+ )
107
+
108
+ print(function)
109
+ # 'A cytochrome b6-f complex catalyzes the calcium-dependent hydrolysis of the 2-acyl groups in 3-sn-phosphoglycerides. Its physiological function is not known.'
110
+ ```
111
+ <br>
112
+
113
+ ## Notice
114
+ THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. THE INFORMATION IS NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE.