--- datasets: - multimolecule/uniref library_name: multimolecule license: agpl-3.0 mask_token: pipeline_tag: fill-mask tags: - Biology - Protein - protein widget: - example_title: prion protein (Kanno blood group) mask_index: 13 mask_index_1based: 14 masked_char: A output: - label: L score: 0.43529 - label: V score: 0.157319 - label: I score: 0.115491 - label: A score: 0.068044 - label: F score: 0.064057 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MANLGCWMLVLFVTWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG - example_title: interleukin 10 mask_index: 17 mask_index_1based: 18 masked_char: A output: - label: A score: 0.153797 - label: G score: 0.115835 - label: L score: 0.109841 - label: S score: 0.089235 - label: V score: 0.059057 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MHSSALLCCLVLLTGVRSPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN - example_title: Zaire ebolavirus mask_index: 10 mask_index_1based: 11 masked_char: A output: - label: K score: 0.087737 - label: A score: 0.079069 - label: R score: 0.074701 - label: L score: 0.061874 - label: E score: 0.060291 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: NVQTLCEALLDGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY - example_title: SARS coronavirus mask_index: 26 mask_index_1based: 27 masked_char: A output: - label: L score: 0.093341 - label: N score: 0.072193 - label: I score: 0.071112 - label: F score: 0.068745 - label: S score: 0.052662 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MFIFLLFLTLTSGSDLDRCTTFDDVQPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS - example_title: insulin mask_index: 11 mask_index_1based: 12 masked_char: A output: - label: L score: 0.436841 - label: A score: 0.146552 - label: P score: 0.083974 - label: G score: 0.083079 - label: V score: 0.045703 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MALWMRLLPLLLLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN - example_title: cyclin dependent kinase inhibitor 2A mask_index: 12 mask_index_1based: 13 masked_char: A output: - label: A score: 0.136787 - label: P score: 0.130256 - label: G score: 0.087531 - label: L score: 0.066999 - label: D score: 0.06274 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MEPAAGSSMEPSDWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD - example_title: human papillomavirus type 16 E6 mask_index: 52 mask_index_1based: 53 masked_char: A output: - label: L score: 0.102573 - label: I score: 0.067941 - label: V score: 0.057858 - label: H score: 0.057704 - label: C score: 0.057218 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDFFRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL --- # CARP Pre-trained convolutional protein language model using a masked language modeling (MLM) objective. ## Disclaimer This is an UNOFFICIAL implementation of [Convolutions are competitive with transformers for protein sequence pretraining](https://doi.org/10.1016/j.cels.2024.01.008) by Kevin K. Yang, et al. The OFFICIAL repository of CARP is at [microsoft/protein-sequence-models](https://github.com/microsoft/protein-sequence-models). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing CARP did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details CARP is a family of ByteNet-style convolutional protein language models. It uses learned token embeddings, a stack of residual dilated 1D convolution blocks, and a final layer normalization before the masked-language-model decoder. The models were pre-trained on the March 2020 release of UniRef50 using the same masked language modeling task as BERT and ESM-1b. ### Variants - **[multimolecule/carp-600k](https://huggingface.co/multimolecule/carp-600k)**: The CARP model with about 600 thousand parameters. - **[multimolecule/carp-38m](https://huggingface.co/multimolecule/carp-38m)**: The CARP model with about 38 million parameters. - **[multimolecule/carp-76m](https://huggingface.co/multimolecule/carp-76m)**: The CARP model with about 76 million parameters. - **[multimolecule/carp-640m](https://huggingface.co/multimolecule/carp-640m)**: The CARP model with about 640 million parameters. ### Model Specification
Variant Num Layers Hidden Size Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens
CARP-600k 16 128 64 0.61 1.25 0.61 1024
CARP-38M 16 1024 512 37.90 77.68 38.70
CARP-76M 32 75.74 155.26 77.36
CARP-640M 56 1280 1280 642.96 1317.22 657.73
### Links - **Code**: [multimolecule.carp](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/carp) - **Data**: [UniRef50](https://www.uniprot.org/help/uniref) - **Paper**: [Convolutions are competitive with transformers for protein sequence pretraining](https://doi.org/10.1016/j.cels.2024.01.008) - **Developed by**: Kevin K. Yang, Nicolo Fusi, Alex X. Lu - **Model type**: ByteNet-style convolutional protein masked language model - **Original Repository**: [microsoft/protein-sequence-models](https://github.com/microsoft/protein-sequence-models) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### Masked Language Modeling You can use this model directly with a pipeline for masked language modeling: ```python import multimolecule # you must import multimolecule to register models from transformers import pipeline predictor = pipeline("fill-mask", model="multimolecule/carp-76m") output = predictor("MVLSPADKTNVKAAWKVGAHAGEYGAEALER") ``` ### Downstream Use #### Extract Features Here is how to use this model to get the features of a given sequence in PyTorch: ```python from multimolecule import ProteinTokenizer, CarpModel tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") model = CarpModel.from_pretrained("multimolecule/carp-76m") text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" input = tokenizer(text, return_tensors="pt") output = model(**input) ``` #### Sequence Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression. Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch: ```python import torch from multimolecule import ProteinTokenizer, CarpForSequencePrediction tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") model = CarpForSequencePrediction.from_pretrained("multimolecule/carp-76m") text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" input = tokenizer(text, return_tensors="pt") label = torch.tensor([1]) output = model(**input, labels=label) ``` #### Token Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression. Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch: ```python import torch from multimolecule import ProteinTokenizer, CarpForTokenPrediction tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") model = CarpForTokenPrediction.from_pretrained("multimolecule/carp-76m") text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" input = tokenizer(text, return_tensors="pt") label = torch.randint(2, (len(text), )) output = model(**input, labels=label) ``` #### Contact Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression. Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch: ```python import torch from multimolecule import ProteinTokenizer, CarpForContactPrediction tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m") model = CarpForContactPrediction.from_pretrained("multimolecule/carp-76m") text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER" input = tokenizer(text, return_tensors="pt") label = torch.randint(2, (len(text), len(text))) output = model(**input, labels=label) ``` ## Training Details CARP was trained with Masked Language Modeling (MLM) as the pre-training objective. Masked residues are predicted from the surrounding protein sequence using bidirectional dilated convolution blocks rather than self-attention layers. ### Training Data CARP was pre-trained on the March 2020 release of [UniRef50](https://www.uniprot.org/help/uniref). ### Training Procedure #### Preprocessing The released CARP checkpoints use the protein alphabet from the official `sequence_models` package. During conversion, equivalent amino-acid and special-token rows are mapped into the MultiMolecule protein tokenizer vocabulary. #### Pre-training The model was trained with masked language modeling over a ByteNet-style residual dilated convolution stack. Please refer to the original paper for details on the training setup. ## Citation ```bibtex @article{yang2024convolutions, author = {Yang, Kevin K. and Fusi, Nicolo and Lu, Alex X.}, title = {Convolutions are competitive with transformers for protein sequence pretraining}, journal = {Cell Systems}, volume = {15}, number = {3}, pages = {286--294.e2}, year = {2024}, doi = {10.1016/j.cels.2024.01.008}, url = {https://doi.org/10.1016/j.cels.2024.01.008}, } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [CARP paper](https://doi.org/10.1016/j.cels.2024.01.008) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```