carp-76m / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
2ea37bd verified
|
Raw
History Blame Contribute Delete
12.4 kB
---
datasets:
- multimolecule/uniref
library_name: multimolecule
license: agpl-3.0
mask_token: <mask>
pipeline_tag: fill-mask
tags:
- Biology
- Protein
- protein
widget:
- example_title: prion protein (Kanno blood group)
mask_index: 13
mask_index_1based: 14
masked_char: A
output:
- label: L
score: 0.43529
- label: V
score: 0.157319
- label: I
score: 0.115491
- label: A
score: 0.068044
- label: F
score: 0.064057
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG
- example_title: interleukin 10
mask_index: 17
mask_index_1based: 18
masked_char: A
output:
- label: A
score: 0.153797
- label: G
score: 0.115835
- label: L
score: 0.109841
- label: S
score: 0.089235
- label: V
score: 0.059057
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MHSSALLCCLVLLTGVR<mask>SPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN
- example_title: Zaire ebolavirus
mask_index: 10
mask_index_1based: 11
masked_char: A
output:
- label: K
score: 0.087737
- label: A
score: 0.079069
- label: R
score: 0.074701
- label: L
score: 0.061874
- label: E
score: 0.060291
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: NVQTLCEALL<mask>DGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY
- example_title: SARS coronavirus
mask_index: 26
mask_index_1based: 27
masked_char: A
output:
- label: L
score: 0.093341
- label: N
score: 0.072193
- label: I
score: 0.071112
- label: F
score: 0.068745
- label: S
score: 0.052662
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MFIFLLFLTLTSGSDLDRCTTFDDVQ<mask>PNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS
- example_title: insulin
mask_index: 11
mask_index_1based: 12
masked_char: A
output:
- label: L
score: 0.436841
- label: A
score: 0.146552
- label: P
score: 0.083974
- label: G
score: 0.083079
- label: V
score: 0.045703
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MALWMRLLPLL<mask>LLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
- example_title: cyclin dependent kinase inhibitor 2A
mask_index: 12
mask_index_1based: 13
masked_char: A
output:
- label: A
score: 0.136787
- label: P
score: 0.130256
- label: G
score: 0.087531
- label: L
score: 0.066999
- label: D
score: 0.06274
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MEPAAGSSMEPS<mask>DWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD
- example_title: human papillomavirus type 16 E6
mask_index: 52
mask_index_1based: 53
masked_char: A
output:
- label: L
score: 0.102573
- label: I
score: 0.067941
- label: V
score: 0.057858
- label: H
score: 0.057704
- label: C
score: 0.057218
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDF<mask>FRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL
---
# CARP
Pre-trained convolutional protein language model using a masked language modeling (MLM) objective.
## Disclaimer
This is an UNOFFICIAL implementation of [Convolutions are competitive with transformers for protein sequence pretraining](https://doi.org/10.1016/j.cels.2024.01.008) by Kevin K. Yang, et al.
The OFFICIAL repository of CARP is at [microsoft/protein-sequence-models](https://github.com/microsoft/protein-sequence-models).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing CARP did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
CARP is a family of ByteNet-style convolutional protein language models. It uses learned token embeddings, a stack of residual dilated 1D convolution blocks, and a final layer normalization before the masked-language-model decoder. The models were pre-trained on the March 2020 release of UniRef50 using the same masked language modeling task as BERT and ESM-1b.
### Variants
- **[multimolecule/carp-600k](https://huggingface.co/multimolecule/carp-600k)**: The CARP model with about 600 thousand parameters.
- **[multimolecule/carp-38m](https://huggingface.co/multimolecule/carp-38m)**: The CARP model with about 38 million parameters.
- **[multimolecule/carp-76m](https://huggingface.co/multimolecule/carp-76m)**: The CARP model with about 76 million parameters.
- **[multimolecule/carp-640m](https://huggingface.co/multimolecule/carp-640m)**: The CARP model with about 640 million parameters.
### Model Specification
<table>
<thead>
<tr>
<th>Variant</th>
<th>Num Layers</th>
<th>Hidden Size</th>
<th>Intermediate Size</th>
<th>Num Parameters (M)</th>
<th>FLOPs (G)</th>
<th>MACs (G)</th>
<th>Max Num Tokens</th>
</tr>
</thead>
<tbody>
<tr>
<td>CARP-600k</td>
<td>16</td>
<td>128</td>
<td>64</td>
<td>0.61</td>
<td>1.25</td>
<td>0.61</td>
<td rowspan="4">1024</td>
</tr>
<tr>
<td>CARP-38M</td>
<td>16</td>
<td rowspan="2">1024</td>
<td rowspan="2">512</td>
<td>37.90</td>
<td>77.68</td>
<td>38.70</td>
</tr>
<tr>
<td><b>CARP-76M</b></td>
<td>32</td>
<td>75.74</td>
<td>155.26</td>
<td>77.36</td>
</tr>
<tr>
<td>CARP-640M</td>
<td>56</td>
<td>1280</td>
<td>1280</td>
<td>642.96</td>
<td>1317.22</td>
<td>657.73</td>
</tr>
</tbody>
</table>
### Links
- **Code**: [multimolecule.carp](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/carp)
- **Data**: [UniRef50](https://www.uniprot.org/help/uniref)
- **Paper**: [Convolutions are competitive with transformers for protein sequence pretraining](https://doi.org/10.1016/j.cels.2024.01.008)
- **Developed by**: Kevin K. Yang, Nicolo Fusi, Alex X. Lu
- **Model type**: ByteNet-style convolutional protein masked language model
- **Original Repository**: [microsoft/protein-sequence-models](https://github.com/microsoft/protein-sequence-models)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### Masked Language Modeling
You can use this model directly with a pipeline for masked language modeling:
```python
import multimolecule # you must import multimolecule to register models
from transformers import pipeline
predictor = pipeline("fill-mask", model="multimolecule/carp-76m")
output = predictor("MVLSPADKTNVKAAW<mask>KVGAHAGEYGAEALER")
```
### Downstream Use
#### Extract Features
Here is how to use this model to get the features of a given sequence in PyTorch:
```python
from multimolecule import ProteinTokenizer, CarpModel
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m")
model = CarpModel.from_pretrained("multimolecule/carp-76m")
text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
output = model(**input)
```
#### Sequence Classification / Regression
> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
```python
import torch
from multimolecule import ProteinTokenizer, CarpForSequencePrediction
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m")
model = CarpForSequencePrediction.from_pretrained("multimolecule/carp-76m")
text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])
output = model(**input, labels=label)
```
#### Token Classification / Regression
> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.
Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch:
```python
import torch
from multimolecule import ProteinTokenizer, CarpForTokenPrediction
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m")
model = CarpForTokenPrediction.from_pretrained("multimolecule/carp-76m")
text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), ))
output = model(**input, labels=label)
```
#### Contact Classification / Regression
> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for contact classification or regression.
Here is how to use this model as backbone to fine-tune for a contact-level task in PyTorch:
```python
import torch
from multimolecule import ProteinTokenizer, CarpForContactPrediction
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/carp-76m")
model = CarpForContactPrediction.from_pretrained("multimolecule/carp-76m")
text = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), len(text)))
output = model(**input, labels=label)
```
## Training Details
CARP was trained with Masked Language Modeling (MLM) as the pre-training objective. Masked residues are predicted from the surrounding protein sequence using bidirectional dilated convolution blocks rather than self-attention layers.
### Training Data
CARP was pre-trained on the March 2020 release of [UniRef50](https://www.uniprot.org/help/uniref).
### Training Procedure
#### Preprocessing
The released CARP checkpoints use the protein alphabet from the official `sequence_models` package. During conversion, equivalent amino-acid and special-token rows are mapped into the MultiMolecule protein tokenizer vocabulary.
#### Pre-training
The model was trained with masked language modeling over a ByteNet-style residual dilated convolution stack.
Please refer to the original paper for details on the training setup.
## Citation
```bibtex
@article{yang2024convolutions,
author = {Yang, Kevin K. and Fusi, Nicolo and Lu, Alex X.},
title = {Convolutions are competitive with transformers for protein sequence pretraining},
journal = {Cell Systems},
volume = {15},
number = {3},
pages = {286--294.e2},
year = {2024},
doi = {10.1016/j.cels.2024.01.008},
url = {https://doi.org/10.1016/j.cels.2024.01.008},
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [CARP paper](https://doi.org/10.1016/j.cels.2024.01.008) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```