--- datasets: - multimolecule/oas library_name: multimolecule license: agpl-3.0 mask_token: pipeline_tag: fill-mask tags: - Biology - Protein - Antibody - protein widget: - example_title: prion protein (Kanno blood group) mask_index: 13 mask_index_1based: 14 masked_char: A output: - label: S score: 0.507465 - label: T score: 0.107061 - label: N score: 0.059567 - label: G score: 0.055233 - label: A score: 0.027234 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MANLGCWMLVLFVTWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG - example_title: interleukin 10 mask_index: 17 mask_index_1based: 18 masked_char: A output: - label: A score: 0.549848 - label: V score: 0.134538 - label: T score: 0.034051 - label: U score: 0.029686 - label: '*' score: 0.023536 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MHSSALLCCLVLLTGVRSPGQGTQSENSCTHFPGNLPNMLRDLRDAFSRVKTFFQMKDQLDNLLLKESLLEDFKGYLGCQALSEMIQFYLEEVMPQAENQDPDIKAHVNSLGENLKTLRLRLRRCHRFLPCENKSKAVEQVKNAFNKLQEKGIYKAMSEFDIFINYIEAYMTMKIRN - example_title: Zaire ebolavirus mask_index: 10 mask_index_1based: 11 masked_char: A output: - label: S score: 0.622724 - label: T score: 0.074842 - label: A score: 0.047342 - label: P score: 0.029094 - label: N score: 0.025649 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: NVQTLCEALLDGLAKAFPSNMMVVTEREQKESLLHQASWHHTSDDFGEHATVRGSSFVTDLEKYNLAFRYEFTAPFIEYCNRCYGVKNVFNWMHYTIPQCY - example_title: SARS coronavirus mask_index: 26 mask_index_1based: 27 masked_char: A output: - label: S score: 0.215669 - label: K score: 0.128916 - label: D score: 0.085707 - label: G score: 0.072274 - label: E score: 0.066373 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MFIFLLFLTLTSGSDLDRCTTFDDVQPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINHTFDNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFKCTFEYIS - example_title: insulin mask_index: 11 mask_index_1based: 12 masked_char: A output: - label: S score: 0.289015 - label: T score: 0.115097 - label: N score: 0.106958 - label: B score: 0.057565 - label: U score: 0.033646 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MALWMRLLPLLLLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN - example_title: cyclin dependent kinase inhibitor 2A mask_index: 12 mask_index_1based: 13 masked_char: A output: - label: L score: 0.192988 - label: J score: 0.162188 - label: V score: 0.15939 - label: I score: 0.136304 - label: P score: 0.120693 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MEPAAGSSMEPSDWLATAAARGRVEEVRALLEAGALPNAPNSYGRRPIQVMMMGSARVAELLLLHGAEPNCADPATLTRPVHDAAREGFLDTLVVLHRAGARLDVRDAWGRLPVDLAEELGHRDVARYLRAAAGGTRGSNHARIDAAEGPSDIPD - example_title: human papillomavirus type 16 E6 mask_index: 52 mask_index_1based: 53 masked_char: A output: - label: S score: 0.224877 - label: V score: 0.083718 - label: T score: 0.070572 - label: P score: 0.058594 - label: A score: 0.056656 pipeline_tag: fill-mask sequence_type: Protein task: fill-mask text: MHQKRTAMFQDPQERPRKLPQLCTELQTTIHDIILECVYCKQQLLRREVYDFFRDLCIVYRDGNPYAVCDKCLKFYSKISEYRHYCYSVYGTTLEQQYNKPLCDLLIRCINCQKPLCPEEKQRHLDKKQRFHNIRGRWTGRCMSCCRSSRTRRETQL --- # AbLang Pre-trained antibody language model using a masked language modeling (MLM) objective. ## Disclaimer This is an UNOFFICIAL implementation of [AbLang: an antibody language model for completing antibody sequences](https://doi.org/10.1093/bioadv/vbac046) by Tobias H. Olsen, et al. The OFFICIAL repository of AbLang is at [oxpig/AbLang](https://github.com/oxpig/AbLang). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing AbLang did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details AbLang v1 is an encoder-only Transformer trained on antibody sequences from the Observed Antibody Space (OAS). The official release provides separate heavy-chain and light-chain checkpoints. Both variants use the same architecture and vocabulary, but they were trained on chain-specific data and are represented as separate MultiMolecule variants. ### Variants - **[multimolecule/ablang-heavy](https://huggingface.co/multimolecule/ablang-heavy)**: AbLang v1 trained on heavy-chain antibody sequences. - **[multimolecule/ablang-light](https://huggingface.co/multimolecule/ablang-light)**: AbLang v1 trained on light-chain antibody sequences. ### Model Specification
Variant Chain Type Num Layers Hidden Size Num Heads Intermediate Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens
AbLang-Heavy Heavy 12 768 12 3072 85.83 28.18 14.06 159
AbLang-Light Light
### Links - **Code**: [multimolecule.ablang](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/ablang) - **Data**: [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/) - **Paper**: [AbLang: an antibody language model for completing antibody sequences](https://doi.org/10.1093/bioadv/vbac046) - **Developed by**: Tobias H. Olsen, Iain H. Moal, Charlotte M. Deane - **Model type**: Encoder-only Transformer for antibody masked language modeling - **Original Repository**: [oxpig/AbLang](https://github.com/oxpig/AbLang) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### Masked Language Modeling You can use this model directly with a pipeline for masked language modeling: ```python import multimolecule # you must import multimolecule to register models from transformers import pipeline predictor = pipeline("fill-mask", model="multimolecule/ablang-light") output = predictor("EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYMSWVRQAPGKGLEWVSA") ``` ### Downstream Use #### Extract Features Here is how to use this model to get the features of a given antibody sequence in PyTorch: ```python from multimolecule import AbLangModel, ProteinTokenizer tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light") model = AbLangModel.from_pretrained("multimolecule/ablang-light") text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA" input = tokenizer(text, return_tensors="pt") output = model(**input) ``` #### Sequence Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for sequence classification or regression. Here is how to use this model as backbone to fine-tune for a sequence-level task in PyTorch: ```python import torch from multimolecule import AbLangForSequencePrediction, ProteinTokenizer tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light") model = AbLangForSequencePrediction.from_pretrained("multimolecule/ablang-light") text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA" input = tokenizer(text, return_tensors="pt") label = torch.tensor([1]) output = model(**input, labels=label) ``` #### Token Classification / Regression > [!NOTE] > This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for token classification or regression. Here is how to use this model as backbone to fine-tune for a residue-level task in PyTorch: ```python import torch from multimolecule import AbLangForTokenPrediction, ProteinTokenizer tokenizer = ProteinTokenizer.from_pretrained("multimolecule/ablang-light") model = AbLangForTokenPrediction.from_pretrained("multimolecule/ablang-light") text = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSA" input = tokenizer(text, return_tensors="pt") label = torch.randint(2, (len(text), )) output = model(**input, labels=label) ``` ## Training Details AbLang was trained with masked language modeling (MLM) as the pre-training objective. ### Training Data AbLang was trained on antibody sequences from the [Observed Antibody Space](https://opig.stats.ox.ac.uk/webapps/oas/). The heavy-chain model was trained on 14,126,724 sequences, and the light-chain model was trained on 187,068 sequences. ### Training Procedure #### Pre-training The heavy-chain and light-chain checkpoints were trained separately on chain-specific OAS sequences. Please refer to the original paper for details on the training setup. ## Citation ```bibtex @article{olsen2022ablang, title = {AbLang: an antibody language model for completing antibody sequences}, author = {Olsen, Tobias H. and Moal, Iain H. and Deane, Charlotte M.}, journal = {Bioinformatics Advances}, volume = {2}, number = {1}, pages = {vbac046}, year = {2022}, doi = {10.1093/bioadv/vbac046}, url = {https://doi.org/10.1093/bioadv/vbac046}, } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If MultiMolecule supports your research, please cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [AbLang paper](https://doi.org/10.1093/bioadv/vbac046) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```