---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- biology
- protein
- antibody
- nanobody
- bert
- masked-lm
pipeline_tag: fill-mask
---

# NanoBodyBERT

NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks.

## Model Description

This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies.

## Intended Use

This model is designed for:
- Nanobody sequence analysis
- CDR region reconstruction
- Sequence embedding generation
- Antibody design applications

## How to Use

### Installation

First, install the required dependencies:

```bash
pip install transformers torch
```

### Loading the Model

```python
import torch
from transformers import BertForMaskedLM
import sys
import os

# Load custom tokenizer (AATokenizer)
# You need to have the tokenizer.py file in your project
from tokenizer import AATokenizer

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert")
tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
```

### Inference Example

```python
# Example nanobody sequence
sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS"

# Encode sequence
input_ids = tokenizer.encode(sequence, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

# Get embeddings
with torch.no_grad():
    outputs = model.bert(input_ids=input_ids, return_dict=True)
    embeddings = outputs.last_hidden_state
    cls_embedding = embeddings[0, 0, :]  # [CLS] token embedding

print(f"CLS embedding shape: {cls_embedding.shape}")
```

### Masked Prediction Example

```python
# Create a masked sequence (mask CDR3 region for example)
masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS"

# Tokenize
tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token)
input_ids = tokenizer.encode(tokens, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

# Predict
with torch.no_grad():
    outputs = model(input_ids=input_ids)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode
predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True)
print(f"Predicted: {predicted_sequence}")
```

## Model Architecture

- **Architecture:** BERT (Bidirectional Encoder Representations from Transformers)
- **Vocabulary:** 26 tokens (20 amino acids + special tokens)
- **Max sequence length:** 256
- **Special tokens:**
  - `[PAD]`: Padding token
  - `[CLS]`: Classification token (sequence start)
  - `[SEP]`: Separator token (sequence end)
  - `[MASK]`: Mask token for MLM
  - `[UNK]`: Unknown token

## Training Data

The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking.

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{nanobodybert,
  title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences},
  author={Ling Luo},
  year={2025},
  howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}},
}
```

## License

Apache 2.0

## Contact

For questions and feedback, please open an issue on the repository.