--- language: - en license: apache-2.0 library_name: transformers tags: - biology - protein - antibody - nanobody - bert - masked-lm pipeline_tag: fill-mask --- # NanoBodyBERT NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks. ## Model Description This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies. ## Intended Use This model is designed for: - Nanobody sequence analysis - CDR region reconstruction - Sequence embedding generation - Antibody design applications ## How to Use ### Installation First, install the required dependencies: ```bash pip install transformers torch ``` ### Loading the Model ```python import torch from transformers import BertForMaskedLM import sys import os # Load custom tokenizer (AATokenizer) # You need to have the tokenizer.py file in your project from tokenizer import AATokenizer # Load model and tokenizer model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert") tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) model.eval() ``` ### Inference Example ```python # Example nanobody sequence sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS" # Encode sequence input_ids = tokenizer.encode(sequence, add_special_tokens=True) input_ids = torch.tensor([input_ids], dtype=torch.long).to(device) # Get embeddings with torch.no_grad(): outputs = model.bert(input_ids=input_ids, return_dict=True) embeddings = outputs.last_hidden_state cls_embedding = embeddings[0, 0, :] # [CLS] token embedding print(f"CLS embedding shape: {cls_embedding.shape}") ``` ### Masked Prediction Example ```python # Create a masked sequence (mask CDR3 region for example) masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS" # Tokenize tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token) input_ids = tokenizer.encode(tokens, add_special_tokens=True) input_ids = torch.tensor([input_ids], dtype=torch.long).to(device) # Predict with torch.no_grad(): outputs = model(input_ids=input_ids) predictions = torch.argmax(outputs.logits, dim=-1) # Decode predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True) print(f"Predicted: {predicted_sequence}") ``` ## Model Architecture - **Architecture:** BERT (Bidirectional Encoder Representations from Transformers) - **Vocabulary:** 26 tokens (20 amino acids + special tokens) - **Max sequence length:** 256 - **Special tokens:** - `[PAD]`: Padding token - `[CLS]`: Classification token (sequence start) - `[SEP]`: Separator token (sequence end) - `[MASK]`: Mask token for MLM - `[UNK]`: Unknown token ## Training Data The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking. ## Citation If you use this model in your research, please cite: ```bibtex @misc{nanobodybert, title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences}, author={Ling Luo}, year={2025}, howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}}, } ``` ## License Apache 2.0 ## Contact For questions and feedback, please open an issue on the repository.