Instructions to use multimolecule/ablang-light with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/ablang-light with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/ablang-light") model = AutoModel.from_pretrained("multimolecule/ablang-light") inputs = tokenizer("MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_stateimport multimolecule from transformers import pipeline predictor = pipeline("fill-mask", model="multimolecule/ablang-light") output = predictor("MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG") - Notebooks
- Google Colab
- Kaggle
| datasets: | |
| - multimolecule/oas | |
| library_name: multimolecule | |
| license: agpl-3.0 | |
| mask_token: <mask> | |
| pipeline_tag: fill-mask | |
| tags: | |
| - Biology | |
| - Protein | |
| - Antibody | |
| - protein | |
| widget: | |
| - example_title: prion protein (Kanno blood group) | |
| mask_index: 13 | |
| mask_index_1based: 14 | |
| masked_char: A | |
| output: | |
| - label: S | |
| score: 0.507465 | |
| - label: T | |
| score: 0.107061 | |
| - label: N | |
| score: 0.059567 | |
| - label: G | |
| score: 0.055233 | |
| - label: A | |
| score: 0.027234 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MANLGCWMLVLFV<mask>TWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG | |
| - example_title: interleukin 10 | |
| mask_index: 17 | |
| mask_index_1based: 18 | |
| masked_char: A | |
| output: | |
| - label: A | |
| score: 0.549848 | |
| - label: V | |
| score: 0.134538 | |
| - label: T | |
| score: 0.034051 | |
| - label: U | |
| score: 0.029686 | |
| - label: '*' | |
| score: 0.023536 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MHSSALLCCLVLLTGVR<mask>SPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN | |
| - example_title: Zaire ebolavirus | |
| mask_index: 10 | |
| mask_index_1based: 11 | |
| masked_char: A | |
| output: | |
| - label: S | |
| score: 0.622724 | |
| - label: T | |
| score: 0.074842 | |
| - label: A | |
| score: 0.047342 | |
| - label: P | |
| score: 0.029094 | |
| - label: N | |
| score: 0.025649 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: NVQTLCEALL<mask>DGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY | |
| - example_title: SARS coronavirus | |
| mask_index: 26 | |
| mask_index_1based: 27 | |
| masked_char: A | |
| output: | |
| - label: S | |
| score: 0.215669 | |
| - label: K | |
| score: 0.128916 | |
| - label: D | |
| score: 0.085707 | |
| - label: G | |
| score: 0.072274 | |
| - label: E | |
| score: 0.066373 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MFIFLLFLTLTSGSDLDRCTTFDDVQ<mask>PNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS | |
| - example_title: insulin | |
| mask_index: 11 | |
| mask_index_1based: 12 | |
| masked_char: A | |
| output: | |
| - label: S | |
| score: 0.289015 | |
| - label: T | |
| score: 0.115097 | |
| - label: N | |
| score: 0.106958 | |
| - label: B | |
| score: 0.057565 | |
| - label: U | |
| score: 0.033646 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MALWMRLLPLL<mask>LLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN | |
| - example_title: cyclin dependent kinase inhibitor 2A | |
| mask_index: 12 | |
| mask_index_1based: 13 | |
| masked_char: A | |
| output: | |
| - label: L | |
| score: 0.192988 | |
| - label: J | |
| score: 0.162188 | |
| - label: V | |
| score: 0.15939 | |
| - label: I | |
| score: 0.136304 | |
| - label: P | |
| score: 0.120693 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MEPAAGSSMEPS<mask>DWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD | |
| - example_title: human papillomavirus type 16 E6 | |
| mask_index: 52 | |
| mask_index_1based: 53 | |
| masked_char: A | |
| output: | |
| - label: S | |
| score: 0.224877 | |
| - label: V | |
| score: 0.083718 | |
| - label: T | |
| score: 0.070572 | |
| - label: P | |
| score: 0.058594 | |
| - label: A | |
| score: 0.056656 | |
| pipeline_tag: fill-mask | |
| sequence_type: Protein | |
| task: fill-mask | |
| text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDF<mask>FRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL | |
| # AbLang | |
| Pre-trained antibody language model using a masked language modeling (MLM) objective. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [AbLang: an antibody language model for completing antibody sequences](https://doi.org/10.1093/bioadv/vbac046) by Tobias H. Olsen, et al. | |
| The OFFICIAL repository of AbLang is at [oxpig/AbLang](https://github.com/oxpig/AbLang). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing AbLang did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| AbLang v1 is an encoder-only Transformer trained on antibody sequences from the Observed Antibody Space (OAS). The official release provides separate heavy-chain and light-chain checkpoints. Both variants use the same architecture and vocabulary, but they were trained on chain-specific data and are represented as separate MultiMolecule variants. | |
| ### Variants | |
| - **[multimolecule/ablang-heavy](https://huggingface.co/multimolecule/ablang-heavy)**: AbLang v1 trained on heavy-chain antibody sequences. | |
| - **[multimolecule/ablang-light](https://huggingface.co/multimolecule/ablang-light)**: AbLang v1 trained on light-chain antibody sequences. | |
| ### Model Specification | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Variant</th> | |
| <th>Chain Type</th> | |
| <th>Num Layers</th> | |
| <th>Hidden Size</th> | |
| <th>Num Heads</th> | |
| <th>Intermediate Size</th> | |
| <th>Num Parameters (M)</th> | |
| <th>FLOPs (G)</th> | |
| <th>MACs (G)</th> | |
| <th>Max Num Tokens</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>AbLang-Heavy</td> | |
| <td>Heavy</td> | |
| <td rowspan="2">12</td> | |
| <td rowspan="2">768</td> | |
| <td rowspan="2">12</td> | |
| <td rowspan="2">3072</td> | |
| <td rowspan="2">85.83</td> | |
| <td rowspan="2">28.18</td> | |
| <td rowspan="2">14.06</td> | |
| <td rowspan="2">159</td> | |
| </tr> | |
| <tr> | |
| <td><b>AbLang-Light</b></td> | |
| <td>Light</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| ### Links | |
| - **Code**: [multimolecule.ablang](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/ablang) | |
| - **Data**: [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/) | |
| - **Paper**: [AbLang: an antibody language model for completing antibody sequences](https://doi.org/10.1093/bioadv/vbac046) | |
| - **Developed by**: Tobias H. Olsen, Iain H. Moal, Charlotte M. Deane | |
| - **Model type**: Encoder-only Transformer for antibody masked language modeling | |
| - **Original Repository**: [oxpig/AbLang](https://github.com/oxpig/AbLang) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| #### Masked Language Modeling | |
| You can use this model directly with a pipeline for masked language modeling: | |
| ```python | |
| import multimolecule # you must import multimolecule to register models | |
| from transformers import pipeline | |
| predictor = pipeline("fill-mask", model="multimolecule/ablang-light") | |
| output = predictor("EVQLVESGGGLVQPGGSLRLSCAASGFTFSSY<mask>MSWVRQAPGKGLEWVSA") | |
| ``` | |
| ### Downstream Use | |
| #### Extract Features | |
| Here is how to use this model to get the features of a given antibody sequence in PyTorch: | |
| ```python | |
| from multimolecule import AbLangModel, ProteinTokenizer | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light") | |
| model = AbLangModel.from_pretrained("multimolecule/ablang-light") | |
| text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA" | |
| input = tokenizer(text, return_tensors="pt") | |
| output = model(**input) | |
| ``` | |
| #### Sequence Classification / Regression | |
| > [!NOTE] | |
| > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression. | |
| Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch: | |
| ```python | |
| import torch | |
| from multimolecule import AbLangForSequencePrediction, ProteinTokenizer | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light") | |
| model = AbLangForSequencePrediction.from_pretrained("multimolecule/ablang-light") | |
| text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA" | |
| input = tokenizer(text, return_tensors="pt") | |
| label = torch.tensor([1]) | |
| output = model(**input, labels=label) | |
| ``` | |
| #### Token Classification / Regression | |
| > [!NOTE] | |
| > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression. | |
| Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch: | |
| ```python | |
| import torch | |
| from multimolecule import AbLangForTokenPrediction, ProteinTokenizer | |
| tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light") | |
| model = AbLangForTokenPrediction.from_pretrained("multimolecule/ablang-light") | |
| text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA" | |
| input = tokenizer(text, return_tensors="pt") | |
| label = torch.randint(2, (len(text), )) | |
| output = model(**input, labels=label) | |
| ``` | |
| ## Training Details | |
| AbLang was trained with masked language modeling (MLM) as the pre-training objective. | |
| ### Training Data | |
| AbLang was trained on antibody sequences from the [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/). | |
| The heavy-chain model was trained on 14,126,724 sequences, and the light-chain model was trained on 187,068 sequences. | |
| ### Training Procedure | |
| #### Pre-training | |
| The heavy-chain and light-chain checkpoints were trained separately on chain-specific OAS sequences. | |
| Please refer to the original paper for details on the training setup. | |
| ## Citation | |
| ```bibtex | |
| @article{olsen2022ablang, | |
| title = {AbLang: an antibody language model for completing antibody sequences}, | |
| author = {Olsen, Tobias H. and Moal, Iain H. and Deane, Charlotte M.}, | |
| journal = {Bioinformatics Advances}, | |
| volume = {2}, | |
| number = {1}, | |
| pages = {vbac046}, | |
| year = {2022}, | |
| doi = {10.1093/bioadv/vbac046}, | |
| url = {https://doi.org/10.1093/bioadv/vbac046}, | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [AbLang paper](https://doi.org/10.1093/bioadv/vbac046) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` |