pavm595 commited on
Commit
7bb020b
·
verified ·
1 Parent(s): b712bbf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -20
README.md CHANGED
@@ -1,20 +1,59 @@
1
- # ProtAlbert
2
-
3
- ```python
4
- from transformers import AutoModel, AlbertTokenizer, pipeline
5
- import re
6
-
7
- tokenizer = AlbertTokenizer.from_pretrained("Rostlab/prot_albert", do_lower_case=False)
8
-
9
- model = AutoModel.from_pretrained("Rostlab/prot_albert")
10
-
11
- fe = pipeline('feature-extraction', model=model, tokenizer=tokenizer, device=0)
12
-
13
- sequences_Example = ["A E T C Z A O", "S K T Z P"]
14
-
15
- sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
16
-
17
- embedding = fe(sequences_Example)
18
-
19
- print(embedding)
20
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: afl-3.0
3
+ datasets:
4
+ - bloyal/uniref100
5
+ pipeline_tag: fill-mask
6
+ library_name: transformers
7
+ tags:
8
+ - biology
9
+ - protein-language-model
10
+ - protein
11
+ ---
12
+ # ProtAlbert
13
+
14
+ ProtAlbert is a protein Language Model (pLM) pretrained on the Uniref100 using a masked language modeling (MLM) objective. It's suitable for creation of embeddings (feature extraction), but also fill the mask. The model was developed by Ahmed Elnaggar et al. and more information can be found on the [GitHub repository](https://github.com/agemagician/ProtTrans) and in the [accompanying paper](https://ieeexplore.ieee.org/document/9477085). This repository is a fork of their [HuggingFace repository](https://huggingface.co/Rostlab/prot_albert/tree/main).
15
+
16
+
17
+ ## Inference example
18
+
19
+ Example for masked language modeling:
20
+
21
+ ```python
22
+ from transformers import AlbertForMaskedLM, AlbertTokenizer, pipeline
23
+
24
+ >>>> tokenizer = AlbertTokenizer.from_pretrained("virtual-human-chc/prot_albert", do_lower_case=False )
25
+ >>>> model = AlbertForMaskedLM.from_pretrained("virtual-human-chc/prot_albert")
26
+ >>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
27
+ >>>> unmasker('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')
28
+
29
+ {'score': 0.10074187070131302, 'token': 13, 'token_str': 'L', 'sequence': 'D L I P T S S K L V V L D T S L Q V K K A F F A L V T'},
30
+ {'score': 0.08413360267877579, 'token': 14, 'token_str': 'S', 'sequence': 'D L I P T S S K L V V S D T S L Q V K K A F F A L V T'},
31
+ {'score': 0.07617155462503433, 'token': 18, 'token_str': 'V', 'sequence': 'D L I P T S S K L V V V D T S L Q V K K A F F A L V T'},
32
+ {'score': 0.06521160155534744, 'token': 19, 'token_str': 'T', 'sequence': 'D L I P T S S K L V V T D T S L Q V K K A F F A L V T'},
33
+ {'score': 0.06321343779563904, 'token': 15, 'token_str': 'A', 'sequence': 'D L I P T S S K L V V A D T S L Q V K K A F F A L V T'}]
34
+ ```
35
+
36
+ Example for feature extraction:
37
+
38
+ ```python
39
+ from transformers import AutoModel, AlbertTokenizer, pipeline
40
+ import re
41
+
42
+ tokenizer = AlbertTokenizer.from_pretrained("virtual-human-chc/prot_albert", do_lower_case=False)
43
+
44
+ model = AutoModel.from_pretrained("virtual-human-chc/prot_albert")
45
+
46
+ fe = pipeline('feature-extraction', model=model, tokenizer=tokenizer, device=0)
47
+
48
+ sequences_Example = ["A E T C Z A O", "S K T Z P"]
49
+
50
+ sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
51
+
52
+ embedding = fe(sequences_Example)
53
+
54
+ print(embedding)
55
+ ```
56
+
57
+ # Copyright
58
+
59
+ Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, Copyright (c) 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the under terms of the [Academic Free License v3.0 License](https://choosealicense.com/licenses/afl-3.0/), Copyright (c) 2025 Ahmed Elnaggar. The other code is licensed under the MIT license, Copyright (c) 2025 Maksim Pavlov.