Instructions to use multimolecule/carp-76m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/carp-76m with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/carp-76m") model = AutoModel.from_pretrained("multimolecule/carp-76m") inputs = tokenizer("MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_stateimport multimolecule from transformers import pipeline predictor = pipeline("fill-mask", model="multimolecule/carp-76m") output = predictor("MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG") - Notebooks
- Google Colab
- Kaggle
| datasets: | |
| - multimolecule/uniref | |
| library_name: multimolecule | |
| license: agpl-3.0 | |
| mask_token: <mask> | |
| pipeline_tag: fill-mask | |
| tags: | |
| - Biology | |
| - Protein | |
| - protein | |
| widget: | |
| - example_title: prion protein (Kanno blood group) | |
| mask_index: 13 | |
| mask_index_1based: 14 | |
| masked_char: A | |
| output: | |
| - label: L | |
| score: 0.43529 | |
| - label: V | |
| score: 0.157319 | |
| - label: I | |
| score: 0.115491 | |
| - label: A | |
| score: 0.068044 | |
| - label: F | |
| score: 0.064057 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG | |
| - example_title: interleukin 10 | |
| mask_index: 17 | |
| mask_index_1based: 18 | |
| masked_char: A | |
| output: | |
| - label: A | |
| score: 0.153797 | |
| - label: G | |
| score: 0.115835 | |
| - label: L | |
| score: 0.109841 | |
| - label: S | |
| score: 0.089235 | |
| - label: V | |
| score: 0.059057 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MHSSALLCCLVLLTGVR<mask>SPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN | |
| - example_title: Zaire ebolavirus | |
| mask_index: 10 | |
| mask_index_1based: 11 | |
| masked_char: A | |
| output: | |
| - label: K | |
| score: 0.087737 | |
| - label: A | |
| score: 0.079069 | |
| - label: R | |
| score: 0.074701 | |
| - label: L | |
| score: 0.061874 | |
| - label: E | |
| score: 0.060291 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: NVQTLCEALL<mask>DGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY | |
| - example_title: SARS coronavirus | |
| mask_index: 26 | |
| mask_index_1based: 27 | |
| masked_char: A | |
| output: | |
| - label: L | |
| score: 0.093341 | |
| - label: N | |
| score: 0.072193 | |
| - label: I | |
| score: 0.071112 | |
| - label: F | |
| score: 0.068745 | |
| - label: S | |
| score: 0.052662 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MFIFLLFLTLTSGSDLDRCTTFDDVQ<mask>PNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS | |
| - example_title: insulin | |
| mask_index: 11 | |
| mask_index_1based: 12 | |
| masked_char: A | |
| output: | |
| - label: L | |
| score: 0.436841 | |
| - label: A | |
| score: 0.146552 | |
| - label: P | |
| score: 0.083974 | |
| - label: G | |
| score: 0.083079 | |
| - label: V | |
| score: 0.045703 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MALWMRLLPLL<mask>LLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN | |
| - example_title: cyclin dependent kinase inhibitor 2A | |
| mask_index: 12 | |
| mask_index_1based: 13 | |
| masked_char: A | |
| output: | |
| - label: A | |
| score: 0.136787 | |
| - label: P | |
| score: 0.130256 | |
| - label: G | |
| score: 0.087531 | |
| - label: L | |
| score: 0.066999 | |
| - label: D | |
| score: 0.06274 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MEPAAGSSMEPS<mask>DWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD | |
| - example_title: human papillomavirus type 16 E6 | |
| mask_index: 52 | |
| mask_index_1based: 53 | |
| masked_char: A | |
| output: | |
| - label: L | |
| score: 0.102573 | |
| - label: I | |
| score: 0.067941 | |
| - label: V | |
| score: 0.057858 | |
| - label: H | |
| score: 0.057704 | |
| - label: C | |
| score: 0.057218 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDF<mask>FRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL | |
| # CARP | |
| Pre-trained convolutional protein language model using a masked language modeling (MLM) objective. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [Convolutions are competitive with transformers for protein sequence pretraining](https://doi.org/10.1016/j.cels.2024.01.008) by Kevin K. Yang, et al. | |
| The OFFICIAL repository of CARP is at [microsoft/protein-sequence-models](https://github.com/microsoft/protein-sequence-models). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing CARP did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| CARP is a family of ByteNet-style convolutional protein language models. It uses learned token embeddings, a stack of residual dilated 1D convolution blocks, and a final layer normalization before the masked-language-model decoder. The models were pre-trained on the March 2020 release of UniRef50 using the same masked language modeling task as BERT and ESM-1b. | |
| ### Variants | |
| - **[multimolecule/carp-600k](https://huggingface.co/multimolecule/carp-600k)**: The CARP model with about 600 thousand parameters. | |
| - **[multimolecule/carp-38m](https://huggingface.co/multimolecule/carp-38m)**: The CARP model with about 38 million parameters. | |
| - **[multimolecule/carp-76m](https://huggingface.co/multimolecule/carp-76m)**: The CARP model with about 76 million parameters. | |
| - **[multimolecule/carp-640m](https://huggingface.co/multimolecule/carp-640m)**: The CARP model with about 640 million parameters. | |
| ### Model Specification | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Variant</th> | |
| <th>Num Layers</th> | |
| <th>Hidden Size</th> | |
| <th>Intermediate Size</th> | |
| <th>Num Parameters (M)</th> | |
| <th>FLOPs (G)</th> | |
| <th>MACs (G)</th> | |
| <th>Max Num Tokens</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>CARP-600k</td> | |
| <td>16</td> | |
| <td>128</td> | |
| <td>64</td> | |
| <td>0.61</td> | |
| <td>1.25</td> | |
| <td>0.61</td> | |
| <td rowspan="4">1024</td> | |
| </tr> | |
| <tr> | |
| <td>CARP-38M</td> | |
| <td>16</td> | |
| <td rowspan="2">1024</td> | |
| <td rowspan="2">512</td> | |
| <td>37.90</td> | |
| <td>77.68</td> | |
| <td>38.70</td> | |
| </tr> | |
| <tr> | |
| <td><b>CARP-76M</b></td> | |
| <td>32</td> | |
| <td>75.74</td> | |
| <td>155.26</td> | |
| <td>77.36</td> | |
| </tr> | |
| <tr> | |
| <td>CARP-640M</td> | |
| <td>56</td> | |
| <td>1280</td> | |
| <td>1280</td> | |
| <td>642.96</td> | |
| <td>1317.22</td> | |
| <td>657.73</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| ### Links | |
| - **Code**: [multimolecule.carp](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/carp) | |
| - **Data**: [UniRef50](https://www.uniprot.org/help/uniref) | |
| - **Paper**: [Convolutions are competitive with transformers for protein sequence pretraining](https://doi.org/10.1016/j.cels.2024.01.008) | |
| - **Developed by**: Kevin K. Yang, Nicolo Fusi, Alex X. Lu | |
| - **Model type**: ByteNet-style convolutional protein masked language model | |
| - **Original Repository**: [microsoft/protein-sequence-models](https://github.com/microsoft/protein-sequence-models) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| #### Masked Language Modeling | |
| You can use this model directly with a pipeline for masked language modeling: | |
| ```python | |
| import multimolecule # you must import multimolecule to register models | |
| from transformers import pipeline | |
| predictor = pipeline("fill-mask", model="multimolecule/carp-76m") | |
| output = predictor("MVLSPADKTNVKAAW<mask>KVGAHAGEYGAEALER") | |
| ``` | |
| ### Downstream Use | |
| #### Extract Features | |
| Here is how to use this model to get the features of a given sequence in PyTorch: | |
| ```python | |
| from multimolecule import ProteinTokenizer, CarpModel | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") | |
| model = CarpModel.from_pretrained("multimolecule/carp-76m") | |
| text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" | |
| input = tokenizer(text, return_tensors="pt") | |
| output = model(**input) | |
| ``` | |
| #### Sequence Classification / Regression | |
| > [!NOTE] | |
| > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression. | |
| Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch: | |
| ```python | |
| import torch | |
| from multimolecule import ProteinTokenizer, CarpForSequencePrediction | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") | |
| model = CarpForSequencePrediction.from_pretrained("multimolecule/carp-76m") | |
| text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" | |
| input = tokenizer(text, return_tensors="pt") | |
| label = torch.tensor([1]) | |
| output = model(**input, labels=label) | |
| ``` | |
| #### Token Classification / Regression | |
| > [!NOTE] | |
| > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression. | |
| Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch: | |
| ```python | |
| import torch | |
| from multimolecule import ProteinTokenizer, CarpForTokenPrediction | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") | |
| model = CarpForTokenPrediction.from_pretrained("multimolecule/carp-76m") | |
| text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" | |
| input = tokenizer(text, return_tensors="pt") | |
| label = torch.randint(2, (len(text), )) | |
| output = model(**input, labels=label) | |
| ``` | |
| #### Contact Classification / Regression | |
| > [!NOTE] | |
| > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression. | |
| Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch: | |
| ```python | |
| import torch | |
| from multimolecule import ProteinTokenizer, CarpForContactPrediction | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") | |
| model = CarpForContactPrediction.from_pretrained("multimolecule/carp-76m") | |
| text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" | |
| input = tokenizer(text, return_tensors="pt") | |
| label = torch.randint(2, (len(text), len(text))) | |
| output = model(**input, labels=label) | |
| ``` | |
| ## Training Details | |
| CARP was trained with Masked Language Modeling (MLM) as the pre-training objective. Masked residues are predicted from the surrounding protein sequence using bidirectional dilated convolution blocks rather than self-attention layers. | |
| ### Training Data | |
| CARP was pre-trained on the March 2020 release of [UniRef50](https://www.uniprot.org/help/uniref). | |
| ### Training Procedure | |
| #### Preprocessing | |
| The released CARP checkpoints use the protein alphabet from the official `sequence_models` package. During conversion, equivalent amino-acid and special-token rows are mapped into the MultiMolecule protein tokenizer vocabulary. | |
| #### Pre-training | |
| The model was trained with masked language modeling over a ByteNet-style residual dilated convolution stack. | |
| Please refer to the original paper for details on the training setup. | |
| ## Citation | |
| ```bibtex | |
| @article{yang2024convolutions, | |
| author = {Yang, Kevin K. and Fusi, Nicolo and Lu, Alex X.}, | |
| title = {Convolutions are competitive with transformers for protein sequence pretraining}, | |
| journal = {Cell Systems}, | |
| volume = {15}, | |
| number = {3}, | |
| pages = {286--294.e2}, | |
| year = {2024}, | |
| doi = {10.1016/j.cels.2024.01.008}, | |
| url = {https://doi.org/10.1016/j.cels.2024.01.008}, | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [CARP paper](https://doi.org/10.1016/j.cels.2024.01.008) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` |