proteinbert / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
d462d5c verified
---
datasets:
- multimolecule/uniref
library_name: multimolecule
license: agpl-3.0
mask_token: <mask>
pipeline_tag: fill-mask
tags:
- Biology
- Protein
- protein
widget:
- example_title: prion protein (Kanno blood group)
mask_index: 13
mask_index_1based: 14
masked_char: A
output:
- label: W
score: 0.627241
- label: L
score: 0.064748
- label: J
score: 0.035412
- label: V
score: 0.029481
- label: S
score: 0.025956
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG
- example_title: interleukin 10
mask_index: 17
mask_index_1based: 18
masked_char: A
output:
- label: R
score: 0.60463
- label: G
score: 0.055521
- label: P
score: 0.02906
- label: S
score: 0.028023
- label: '?'
score: 0.022019
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MHSSALLCCLVLLTGVR<mask>SPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN
- example_title: Zaire ebolavirus
mask_index: 10
mask_index_1based: 11
masked_char: A
output:
- label: H
score: 0.436416
- label: D
score: 0.147794
- label: B
score: 0.048469
- label: C
score: 0.030239
- label: S
score: 0.022767
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: NVQTLCEALL<mask>DGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY
- example_title: SARS coronavirus
mask_index: 26
mask_index_1based: 27
masked_char: A
output:
- label: D
score: 0.201616
- label: B
score: 0.138675
- label: N
score: 0.095383
- label: F
score: 0.088915
- label: I
score: 0.073027
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MFIFLLFLTLTSGSDLDRCTTFDDVQ<mask>PNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS
- example_title: insulin
mask_index: 11
mask_index_1based: 12
masked_char: A
output:
- label: L
score: 0.495459
- label: C
score: 0.367089
- label: P
score: 0.034614
- label: A
score: 0.017155
- label: J
score: 0.016473
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MALWMRLLPLL<mask>LLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
- example_title: cyclin dependent kinase inhibitor 2A
mask_index: 12
mask_index_1based: 13
masked_char: A
output:
- label: P
score: 0.372832
- label: R
score: 0.110636
- label: D
score: 0.09743
- label: A
score: 0.090202
- label: L
score: 0.072687
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MEPAAGSSMEPS<mask>DWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD
- example_title: human papillomavirus type 16 E6
mask_index: 52
mask_index_1based: 53
masked_char: A
output:
- label: C
score: 0.242568
- label: D
score: 0.230786
- label: P
score: 0.049231
- label: B
score: 0.049184
- label: L
score: 0.033364
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDF<mask>FRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL
---
# ProteinBERT
Pre-trained model on protein sequences and Gene Ontology annotations using a combined language modeling and annotation prediction objective.
## Disclaimer
This is an UNOFFICIAL implementation of the [ProteinBERT: a universal deep-learning model of protein sequence and function](https://doi.org/10.1093/bioinformatics/btac020) by Nadav Brandes, et al.
The OFFICIAL repository of ProteinBERT is at [nadavbra/protein_bert](https://github.com/nadavbra/protein_bert).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing ProteinBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
ProteinBERT is a protein language model with coupled local residue representations and a global protein representation.
It is pre-trained on UniRef90 with a sequence language modeling objective and a Gene Ontology annotation recovery objective.
ProteinBERT uses convolutional local branches and global-attention layers instead of quadratic self-attention, so the architecture has no learned positional table and can be evaluated on variable sequence lengths.
### Model Specification
| Num Layers | Hidden Size | Global Hidden Size | Num Heads | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
| ---------- | ----------- | ------------------ | --------- | ------------------ | --------- | -------- | -------------- |
| 6 | 128 | 512 | 4 | 15.98 | 7.16 | 3.54 | 1024 |
### Links
- **Code**: [multimolecule.proteinbert](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/proteinbert)
- **Data**: [UniRef90](https://www.uniprot.org/help/uniref)
- **Paper**: [ProteinBERT: a universal deep-learning model of protein sequence and function](https://doi.org/10.1093/bioinformatics/btac020)
- **Developed by**: Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial
- **Model type**: Protein language model with local convolutional branches and global-attention layers
- **Original Repository**: [nadavbra/protein_bert](https://github.com/nadavbra/protein_bert)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### Masked Language Modeling
You can use this model directly with a pipeline for masked language modeling:
```python
import multimolecule # you must import multimolecule to register models
from transformers import pipeline
predictor = pipeline("fill-mask", model="multimolecule/proteinbert")
output = predictor("MVLSPADKTNVKAAW<mask>KVGAHAGEYGAEALER")
```
### Downstream Use
#### Extract Features
Here is how to use this model to get the features of a given sequence in PyTorch:
```python
from multimolecule import ProteinTokenizer, ProteinBertModel
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert")
model = ProteinBertModel.from_pretrained("multimolecule/proteinbert")
text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
output = model(**input)
```
#### Sequence Classification / Regression
> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
```python
import torch
from multimolecule import ProteinTokenizer, ProteinBertForSequencePrediction
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert")
model = ProteinBertForSequencePrediction.from_pretrained("multimolecule/proteinbert")
text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])
output = model(**input, labels=label)
```
#### Token Classification / Regression
> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.
Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch:
```python
import torch
from multimolecule import ProteinTokenizer, ProteinBertForTokenPrediction
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert")
model = ProteinBertForTokenPrediction.from_pretrained("multimolecule/proteinbert")
text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (1, len(text)))
output = model(**input, labels=label)
```
## Training Details
### Training Data
ProteinBERT is pre-trained on approximately 106 million protein sequences from UniRef90 and Gene Ontology annotations.
### Training Procedure
ProteinBERT is trained with a combined objective over masked protein sequence recovery and Gene Ontology annotation prediction.
Please refer to the original paper for details on the training setup.
## Citation
```bibtex
@article{brandes2022proteinbert,
title = {ProteinBERT: a universal deep-learning model of protein sequence and function},
author = {Brandes, Nadav and Ofer, Dan and Peleg, Yam and Rappoport, Nadav and Linial, Michal},
year = {2022},
journal = {Bioinformatics},
volume = {38},
number = {8},
pages = {2102--2110},
doi = {10.1093/bioinformatics/btac020},
url = {https://doi.org/10.1093/bioinformatics/btac020},
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [ProteinBERT paper](https://doi.org/10.1093/bioinformatics/btac020) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```