nanobodybert / README.md
LLMasterLL's picture
Upload README.md with huggingface_hub
365dda5 verified
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - biology
  - protein
  - antibody
  - nanobody
  - bert
  - masked-lm
pipeline_tag: fill-mask

NanoBodyBERT

NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks.

Model Description

This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies.

Intended Use

This model is designed for:

  • Nanobody sequence analysis
  • CDR region reconstruction
  • Sequence embedding generation
  • Antibody design applications

How to Use

Installation

First, install the required dependencies:

pip install transformers torch

Loading the Model

import torch
from transformers import BertForMaskedLM
import sys
import os

# Load custom tokenizer (AATokenizer)
# You need to have the tokenizer.py file in your project
from tokenizer import AATokenizer

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert")
tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

Inference Example

# Example nanobody sequence
sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS"

# Encode sequence
input_ids = tokenizer.encode(sequence, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

# Get embeddings
with torch.no_grad():
    outputs = model.bert(input_ids=input_ids, return_dict=True)
    embeddings = outputs.last_hidden_state
    cls_embedding = embeddings[0, 0, :]  # [CLS] token embedding

print(f"CLS embedding shape: {cls_embedding.shape}")

Masked Prediction Example

# Create a masked sequence (mask CDR3 region for example)
masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS"

# Tokenize
tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token)
input_ids = tokenizer.encode(tokens, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

# Predict
with torch.no_grad():
    outputs = model(input_ids=input_ids)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode
predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True)
print(f"Predicted: {predicted_sequence}")

Model Architecture

  • Architecture: BERT (Bidirectional Encoder Representations from Transformers)
  • Vocabulary: 26 tokens (20 amino acids + special tokens)
  • Max sequence length: 256
  • Special tokens:
    • [PAD]: Padding token
    • [CLS]: Classification token (sequence start)
    • [SEP]: Separator token (sequence end)
    • [MASK]: Mask token for MLM
    • [UNK]: Unknown token

Training Data

The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking.

Citation

If you use this model in your research, please cite:

@misc{nanobodybert,
  title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences},
  author={Ling Luo},
  year={2025},
  howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}},
}

License

Apache 2.0

Contact

For questions and feedback, please open an issue on the repository.