Instructions to use multimolecule/proteinbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/proteinbert with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/proteinbert") model = AutoModel.from_pretrained("multimolecule/proteinbert") inputs = tokenizer("MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_stateimport multimolecule from transformers import pipeline predictor = pipeline("fill-mask", model="multimolecule/proteinbert") output = predictor("MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG") - Notebooks
- Google Colab
- Kaggle
| datasets: | |
| - multimolecule/uniref | |
| library_name: multimolecule | |
| license: agpl-3.0 | |
| mask_token: <mask> | |
| pipeline_tag: fill-mask | |
| tags: | |
| - Biology | |
| - Protein | |
| - protein | |
| widget: | |
| - example_title: prion protein (Kanno blood group) | |
| mask_index: 13 | |
| mask_index_1based: 14 | |
| masked_char: A | |
| output: | |
| - label: W | |
| score: 0.627241 | |
| - label: L | |
| score: 0.064748 | |
| - label: J | |
| score: 0.035412 | |
| - label: V | |
| score: 0.029481 | |
| - label: S | |
| score: 0.025956 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG | |
| - example_title: interleukin 10 | |
| mask_index: 17 | |
| mask_index_1based: 18 | |
| masked_char: A | |
| output: | |
| - label: R | |
| score: 0.60463 | |
| - label: G | |
| score: 0.055521 | |
| - label: P | |
| score: 0.02906 | |
| - label: S | |
| score: 0.028023 | |
| - label: '?' | |
| score: 0.022019 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MHSSALLCCLVLLTGVR<mask>SPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN | |
| - example_title: Zaire ebolavirus | |
| mask_index: 10 | |
| mask_index_1based: 11 | |
| masked_char: A | |
| output: | |
| - label: H | |
| score: 0.436416 | |
| - label: D | |
| score: 0.147794 | |
| - label: B | |
| score: 0.048469 | |
| - label: C | |
| score: 0.030239 | |
| - label: S | |
| score: 0.022767 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: NVQTLCEALL<mask>DGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY | |
| - example_title: SARS coronavirus | |
| mask_index: 26 | |
| mask_index_1based: 27 | |
| masked_char: A | |
| output: | |
| - label: D | |
| score: 0.201616 | |
| - label: B | |
| score: 0.138675 | |
| - label: N | |
| score: 0.095383 | |
| - label: F | |
| score: 0.088915 | |
| - label: I | |
| score: 0.073027 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MFIFLLFLTLTSGSDLDRCTTFDDVQ<mask>PNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS | |
| - example_title: insulin | |
| mask_index: 11 | |
| mask_index_1based: 12 | |
| masked_char: A | |
| output: | |
| - label: L | |
| score: 0.495459 | |
| - label: C | |
| score: 0.367089 | |
| - label: P | |
| score: 0.034614 | |
| - label: A | |
| score: 0.017155 | |
| - label: J | |
| score: 0.016473 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MALWMRLLPLL<mask>LLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN | |
| - example_title: cyclin dependent kinase inhibitor 2A | |
| mask_index: 12 | |
| mask_index_1based: 13 | |
| masked_char: A | |
| output: | |
| - label: P | |
| score: 0.372832 | |
| - label: R | |
| score: 0.110636 | |
| - label: D | |
| score: 0.09743 | |
| - label: A | |
| score: 0.090202 | |
| - label: L | |
| score: 0.072687 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MEPAAGSSMEPS<mask>DWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD | |
| - example_title: human papillomavirus type 16 E6 | |
| mask_index: 52 | |
| mask_index_1based: 53 | |
| masked_char: A | |
| output: | |
| - label: C | |
| score: 0.242568 | |
| - label: D | |
| score: 0.230786 | |
| - label: P | |
| score: 0.049231 | |
| - label: B | |
| score: 0.049184 | |
| - label: L | |
| score: 0.033364 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDF<mask>FRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL | |
| # ProteinBERT | |
| Pre-trained model on protein sequences and Gene Ontology annotations using a combined language modeling and annotation prediction objective. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of the [ProteinBERT: a universal deep-learning model of protein sequence and function](https://doi.org/10.1093/bioinformatics/btac020) by Nadav Brandes, et al. | |
| The OFFICIAL repository of ProteinBERT is at [nadavbra/protein_bert](https://github.com/nadavbra/protein_bert). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing ProteinBERT did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| ProteinBERT is a protein language model with coupled local residue representations and a global protein representation. | |
| It is pre-trained on UniRef90 with a sequence language modeling objective and a Gene Ontology annotation recovery objective. | |
| ProteinBERT uses convolutional local branches and global-attention layers instead of quadratic self-attention, so the architecture has no learned positional table and can be evaluated on variable sequence lengths. | |
| ### Model Specification | |
| | Num Layers | Hidden Size | Global Hidden Size | Num Heads | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens | | |
| | ---------- | ----------- | ------------------ | --------- | ------------------ | --------- | -------- | -------------- | | |
| | 6 | 128 | 512 | 4 | 15.98 | 7.16 | 3.54 | 1024 | | |
| ### Links | |
| - **Code**: [multimolecule.proteinbert](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/proteinbert) | |
| - **Data**: [UniRef90](https://www.uniprot.org/help/uniref) | |
| - **Paper**: [ProteinBERT: a universal deep-learning model of protein sequence and function](https://doi.org/10.1093/bioinformatics/btac020) | |
| - **Developed by**: Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, Michal Linial | |
| - **Model type**: Protein language model with local convolutional branches and global-attention layers | |
| - **Original Repository**: [nadavbra/protein_bert](https://github.com/nadavbra/protein_bert) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| #### Masked Language Modeling | |
| You can use this model directly with a pipeline for masked language modeling: | |
| ```python | |
| import multimolecule # you must import multimolecule to register models | |
| from transformers import pipeline | |
| predictor = pipeline("fill-mask", model="multimolecule/proteinbert") | |
| output = predictor("MVLSPADKTNVKAAW<mask>KVGAHAGEYGAEALER") | |
| ``` | |
| ### Downstream Use | |
| #### Extract Features | |
| Here is how to use this model to get the features of a given sequence in PyTorch: | |
| ```python | |
| from multimolecule import ProteinTokenizer, ProteinBertModel | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert") | |
| model = ProteinBertModel.from_pretrained("multimolecule/proteinbert") | |
| text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" | |
| input = tokenizer(text, return_tensors="pt") | |
| output = model(**input) | |
| ``` | |
| #### Sequence Classification / Regression | |
| > [!NOTE] | |
| > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression. | |
| Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch: | |
| ```python | |
| import torch | |
| from multimolecule import ProteinTokenizer, ProteinBertForSequencePrediction | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert") | |
| model = ProteinBertForSequencePrediction.from_pretrained("multimolecule/proteinbert") | |
| text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" | |
| input = tokenizer(text, return_tensors="pt") | |
| label = torch.tensor([1]) | |
| output = model(**input, labels=label) | |
| ``` | |
| #### Token Classification / Regression | |
| > [!NOTE] | |
| > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression. | |
| Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch: | |
| ```python | |
| import torch | |
| from multimolecule import ProteinTokenizer, ProteinBertForTokenPrediction | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/proteinbert") | |
| model = ProteinBertForTokenPrediction.from_pretrained("multimolecule/proteinbert") | |
| text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" | |
| input = tokenizer(text, return_tensors="pt") | |
| label = torch.randint(2, (1, len(text))) | |
| output = model(**input, labels=label) | |
| ``` | |
| ## Training Details | |
| ### Training Data | |
| ProteinBERT is pre-trained on approximately 106 million protein sequences from UniRef90 and Gene Ontology annotations. | |
| ### Training Procedure | |
| ProteinBERT is trained with a combined objective over masked protein sequence recovery and Gene Ontology annotation prediction. | |
| Please refer to the original paper for details on the training setup. | |
| ## Citation | |
| ```bibtex | |
| @article{brandes2022proteinbert, | |
| title = {ProteinBERT: a universal deep-learning model of protein sequence and function}, | |
| author = {Brandes, Nadav and Ofer, Dan and Peleg, Yam and Rappoport, Nadav and Linial, Michal}, | |
| year = {2022}, | |
| journal = {Bioinformatics}, | |
| volume = {38}, | |
| number = {8}, | |
| pages = {2102--2110}, | |
| doi = {10.1093/bioinformatics/btac020}, | |
| url = {https://doi.org/10.1093/bioinformatics/btac020}, | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [ProteinBERT paper](https://doi.org/10.1093/bioinformatics/btac020) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` |