ablang-light / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
5d86cb7 verified
|
Raw
History Blame Contribute Delete
11 kB
---
datasets:
- multimolecule/oas
library_name: multimolecule
license: agpl-3.0
mask_token: <mask>
pipeline_tag: fill-mask
tags:
- Biology
- Protein
- Antibody
- protein
widget:
- example_title: prion protein (Kanno blood group)
mask_index: 13
mask_index_1based: 14
masked_char: A
output:
- label: S
score: 0.507465
- label: T
score: 0.107061
- label: N
score: 0.059567
- label: G
score: 0.055233
- label: A
score: 0.027234
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG
- example_title: interleukin 10
mask_index: 17
mask_index_1based: 18
masked_char: A
output:
- label: A
score: 0.549848
- label: V
score: 0.134538
- label: T
score: 0.034051
- label: U
score: 0.029686
- label: '*'
score: 0.023536
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MHSSALLCCLVLLTGVR<mask>SPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN
- example_title: Zaire ebolavirus
mask_index: 10
mask_index_1based: 11
masked_char: A
output:
- label: S
score: 0.622724
- label: T
score: 0.074842
- label: A
score: 0.047342
- label: P
score: 0.029094
- label: N
score: 0.025649
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: NVQTLCEALL<mask>DGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY
- example_title: SARS coronavirus
mask_index: 26
mask_index_1based: 27
masked_char: A
output:
- label: S
score: 0.215669
- label: K
score: 0.128916
- label: D
score: 0.085707
- label: G
score: 0.072274
- label: E
score: 0.066373
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MFIFLLFLTLTSGSDLDRCTTFDDVQ<mask>PNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS
- example_title: insulin
mask_index: 11
mask_index_1based: 12
masked_char: A
output:
- label: S
score: 0.289015
- label: T
score: 0.115097
- label: N
score: 0.106958
- label: B
score: 0.057565
- label: U
score: 0.033646
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MALWMRLLPLL<mask>LLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
- example_title: cyclin dependent kinase inhibitor 2A
mask_index: 12
mask_index_1based: 13
masked_char: A
output:
- label: L
score: 0.192988
- label: J
score: 0.162188
- label: V
score: 0.15939
- label: I
score: 0.136304
- label: P
score: 0.120693
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MEPAAGSSMEPS<mask>DWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD
- example_title: human papillomavirus type 16 E6
mask_index: 52
mask_index_1based: 53
masked_char: A
output:
- label: S
score: 0.224877
- label: V
score: 0.083718
- label: T
score: 0.070572
- label: P
score: 0.058594
- label: A
score: 0.056656
pipeline_tag: fill-mask
sequence_type: Protein
task: fill-mask
text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDF<mask>FRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL
---
# AbLang
Pre-trained antibody language model using a masked language modeling (MLM) objective.
## Disclaimer
This is an UNOFFICIAL implementation of [AbLang: an antibody language model for completing antibody sequences](https://doi.org/10.1093/bioadv/vbac046) by Tobias H. Olsen, et al.
The OFFICIAL repository of AbLang is at [oxpig/AbLang](https://github.com/oxpig/AbLang).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing AbLang did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
AbLang v1 is an encoder-only Transformer trained on antibody sequences from the Observed Antibody Space (OAS). The official release provides separate heavy-chain and light-chain checkpoints. Both variants use the same architecture and vocabulary, but they were trained on chain-specific data and are represented as separate MultiMolecule variants.
### Variants
- **[multimolecule/ablang-heavy](https://huggingface.co/multimolecule/ablang-heavy)**: AbLang v1 trained on heavy-chain antibody sequences.
- **[multimolecule/ablang-light](https://huggingface.co/multimolecule/ablang-light)**: AbLang v1 trained on light-chain antibody sequences.
### Model Specification
<table>
<thead>
<tr>
<th>Variant</th>
<th>Chain Type</th>
<th>Num Layers</th>
<th>Hidden Size</th>
<th>Num Heads</th>
<th>Intermediate Size</th>
<th>Num Parameters (M)</th>
<th>FLOPs (G)</th>
<th>MACs (G)</th>
<th>Max Num Tokens</th>
</tr>
</thead>
<tbody>
<tr>
<td>AbLang-Heavy</td>
<td>Heavy</td>
<td rowspan="2">12</td>
<td rowspan="2">768</td>
<td rowspan="2">12</td>
<td rowspan="2">3072</td>
<td rowspan="2">85.83</td>
<td rowspan="2">28.18</td>
<td rowspan="2">14.06</td>
<td rowspan="2">159</td>
</tr>
<tr>
<td><b>AbLang-Light</b></td>
<td>Light</td>
</tr>
</tbody>
</table>
### Links
- **Code**: [multimolecule.ablang](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/ablang)
- **Data**: [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/)
- **Paper**: [AbLang: an antibody language model for completing antibody sequences](https://doi.org/10.1093/bioadv/vbac046)
- **Developed by**: Tobias H. Olsen, Iain H. Moal, Charlotte M. Deane
- **Model type**: Encoder-only Transformer for antibody masked language modeling
- **Original Repository**: [oxpig/AbLang](https://github.com/oxpig/AbLang)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### Masked Language Modeling
You can use this model directly with a pipeline for masked language modeling:
```python
import multimolecule # you must import multimolecule to register models
from transformers import pipeline
predictor = pipeline("fill-mask", model="multimolecule/ablang-light")
output = predictor("EVQLVESGGGLVQPGGSLRLSCAASGFTFSSY<mask>MSWVRQAPGKGLEWVSA")
```
### Downstream Use
#### Extract Features
Here is how to use this model to get the features of a given antibody sequence in PyTorch:
```python
from multimolecule import AbLangModel, ProteinTokenizer
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light")
model = AbLangModel.from_pretrained("multimolecule/ablang-light")
text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA"
input = tokenizer(text, return_tensors="pt")
output = model(**input)
```
#### Sequence Classification / Regression
> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression.
Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch:
```python
import torch
from multimolecule import AbLangForSequencePrediction, ProteinTokenizer
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light")
model = AbLangForSequencePrediction.from_pretrained("multimolecule/ablang-light")
text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA"
input = tokenizer(text, return_tensors="pt")
label = torch.tensor([1])
output = model(**input, labels=label)
```
#### Token Classification / Regression
> [!NOTE]
> This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression.
Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch:
```python
import torch
from multimolecule import AbLangForTokenPrediction, ProteinTokenizer
tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light")
model = AbLangForTokenPrediction.from_pretrained("multimolecule/ablang-light")
text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA"
input = tokenizer(text, return_tensors="pt")
label = torch.randint(2, (len(text), ))
output = model(**input, labels=label)
```
## Training Details
AbLang was trained with masked language modeling (MLM) as the pre-training objective.
### Training Data
AbLang was trained on antibody sequences from the [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/).
The heavy-chain model was trained on 14,126,724 sequences, and the light-chain model was trained on 187,068 sequences.
### Training Procedure
#### Pre-training
The heavy-chain and light-chain checkpoints were trained separately on chain-specific OAS sequences.
Please refer to the original paper for details on the training setup.
## Citation
```bibtex
@article{olsen2022ablang,
title = {AbLang: an antibody language model for completing antibody sequences},
author = {Olsen, Tobias H. and Moal, Iain H. and Deane, Charlotte M.},
journal = {Bioinformatics Advances},
volume = {2},
number = {1},
pages = {vbac046},
year = {2022},
doi = {10.1093/bioadv/vbac046},
url = {https://doi.org/10.1093/bioadv/vbac046},
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [AbLang paper](https://doi.org/10.1093/bioadv/vbac046) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```